CN114998411B - Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss - Google Patents
Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss Download PDFInfo
- Publication number
- CN114998411B CN114998411B CN202210475411.0A CN202210475411A CN114998411B CN 114998411 B CN114998411 B CN 114998411B CN 202210475411 A CN202210475411 A CN 202210475411A CN 114998411 B CN114998411 B CN 114998411B
- Authority
- CN
- China
- Prior art keywords
- luminosity
- depth
- max
- self
- reconstruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000009466 transformation Effects 0.000 claims abstract description 45
- 238000013135 deep learning Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000013519 translation Methods 0.000 claims description 10
- 230000000873 masking effect Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims 4
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
本发明涉及一种结合时空增强光度损失的自监督单目深度估计方法和装置,其中,方法包括:获取图像序列中相邻的若干帧图像;将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息,其中,所述深度学习网络的光度损失信息基于深度感知像素对应关系的空间变换模型得到,并利用全向自动掩膜来避免运动物体的像素参与光度误差的计算。本发明能够提高光度损失的准确性,进而更好的监督深度网络的学习。
The invention relates to a self-supervised monocular depth estimation method and device combined with spatiotemporal enhanced photometric loss, wherein the method includes: acquiring several adjacent frame images in an image sequence; inputting the images into a trained deep learning network Depth information and pose information are obtained, in which the photometric loss information of the deep learning network is obtained based on the spatial transformation model of depth sensing pixel correspondence, and an omnidirectional automatic mask is used to avoid the pixels of moving objects from participating in the calculation of photometric errors. The present invention can improve the accuracy of photometric loss and thereby better supervise the learning of deep networks.
Description
技术领域Technical field
本发明涉及计算机视觉技术领域,特别是涉及一种结合时空增强光度损失的自监督单目深度估计方法和装置。The present invention relates to the field of computer vision technology, and in particular to a self-supervised monocular depth estimation method and device combined with spatiotemporal enhanced photometric loss.
背景技术Background technique
从图像中估计场景的深度信息,即图像深度估计,是目前计算机视觉中的一个基础而又十分重要的任务。良好的图像深度估计算法可以应用于室外驾驶场景以及室内的小型机器人等领域,具有巨大的应用价值。机器人或者自动驾驶的汽车在进行工作过程中,利用深度估计算法得到场景深度信息辅助机器人进行下一步运动的路径规划或避障。Estimating the depth information of a scene from an image, that is, image depth estimation, is a basic and very important task in current computer vision. A good image depth estimation algorithm can be applied to outdoor driving scenes and indoor small robots, etc., and has huge application value. When a robot or self-driving car is working, it uses a depth estimation algorithm to obtain scene depth information to assist the robot in path planning or obstacle avoidance for its next movement.
利用图像的深度估计分为有监督和自监督的方法。有监督的方法主要利用神经网络建立图像和深度图之间的映射,在真值的监督下进行训练,使得网络逐渐具备拟合出深度的能力。然而由于有监督方法真值代价较高,自监督的方法在近些年逐渐变为主流。相较于需要双目图像对进行训练的方法,基于序列图像的方法因其适用范围更广成为被研究人员广泛关注的方法。Depth estimation using images is divided into supervised and self-supervised methods. Supervised methods mainly use neural networks to establish the mapping between images and depth maps, and train under the supervision of true values, so that the network gradually has the ability to fit the depth. However, due to the higher true value cost of supervised methods, self-supervised methods have gradually become mainstream in recent years. Compared with methods that require binocular image pairs for training, methods based on sequence images have become widely concerned by researchers because of their wider scope of application.
基于序列图像的自监督单目深度框架主要包括一个深度估计网络和一个位姿估计网络,分别预测目标帧的深度,目标帧和源帧的位姿变换。结合估计的深度和位姿,可以将源帧变换到目标帧的坐标系得到重建图像,利用目标帧和重建图像光度上的差别也就是光度损失就可以监督两个网络同时训练。随着光度损失的下降,网络估计的深度就逐渐准确。The self-supervised monocular depth framework based on sequence images mainly includes a depth estimation network and a pose estimation network, which respectively predict the depth of the target frame and the pose transformation of the target frame and source frame. Combining the estimated depth and pose, the source frame can be transformed into the coordinate system of the target frame to obtain the reconstructed image. The difference in photometric loss between the target frame and the reconstructed image can be used to supervise the simultaneous training of the two networks. As the photometric loss decreases, the depth estimated by the network becomes increasingly accurate.
光度损失生成时需要采用空间变换模型,现有空间变换模型虽然符合理论上刚体变换的方法,但是由于在计算过程中位姿中的平移向量的误差会带来一定的深度估计误差,也就是说,深度越大,深度估计的误差也就越大。另外,为了解决图像中违背光度一致性的运动像素所造成光度损失不准确的问题,现有方式的主要思路是找到在训练过程中过滤掉从一帧到另一帧中光度未变小的像素,生成的二值化掩膜,但是该二值化掩膜只能判别出与相机运动方向相同的物体。A spatial transformation model needs to be used when generating photometric loss. Although the existing spatial transformation model conforms to the theoretical rigid body transformation method, due to the error of the translation vector in the pose during the calculation process, a certain depth estimation error will occur, that is to say , the greater the depth, the greater the depth estimation error. In addition, in order to solve the problem of inaccurate photometric loss caused by moving pixels that violate photometric consistency in the image, the main idea of the existing method is to find and filter out pixels whose brightness does not become smaller from one frame to another during the training process. , the generated binary mask, but this binary mask can only identify objects in the same direction as the camera movement.
发明内容Contents of the invention
本发明的发明人发现,造成深度越大,深度估计的误差也就越大的原因如下:空间变换的目的是为了经过空间变换后使得目标帧和源帧中的对应像素在像素平面上重合,假如利用近处的点PN来求解对应像素pt和ps的对应关系,如图1所示。自监督深度估计的原理是通过最小化pt和ps的光度误差以使得估计的位姿和深度更准。对于近处区域,如图1所示,在一定数量点的情况下,只有当pt和变换后的点pF较为重合时,估计的位姿才可以更准确,深度性能更好。而对于远处区域,如图2所示,只需要预测的旋转矩阵准确即可保证pt和ps的光度误差变小,因此如果不区分远近而利用估计的旋转矩阵和平移向量构造光度误差,光度误差不确定度会大大增加,从而造成深度估计的结果变差。The inventor of the present invention found that the greater the depth, the greater the depth estimation error is for the following reasons: the purpose of spatial transformation is to make the corresponding pixels in the target frame and the source frame coincide on the pixel plane after spatial transformation. If the nearby point P N is used to solve the corresponding relationship between the corresponding pixels p t and p s , as shown in Figure 1. The principle of self-supervised depth estimation is to make the estimated pose and depth more accurate by minimizing the photometric error of p t and p s . For the near area, as shown in Figure 1, with a certain number of points, only when p t and the transformed point p F are relatively coincident, the estimated pose can be more accurate and the depth performance can be better. For distant areas, as shown in Figure 2, only the predicted rotation matrix needs to be accurate to ensure that the photometric error of p t and p s becomes smaller. Therefore, if the distance is not distinguished and the estimated rotation matrix and translation vector are used to construct the photometric error , the photometric error uncertainty will be greatly increased, resulting in poor depth estimation results.
本发明所要解决的技术问题是提供一种结合时空增强光度损失的自监督单目深度估计方法和装置,能够提高光度损失的准确性,进而更好的监督深度网络的学习。The technical problem to be solved by the present invention is to provide a self-supervised monocular depth estimation method and device that combines spatiotemporal enhanced photometric loss, which can improve the accuracy of photometric loss and thereby better supervise the learning of deep networks.
本发明解决其技术问题所采用的技术方案是:提供一种结合时空增强光度损失的自监督单目深度估计方法,包括以下步骤:The technical solution adopted by the present invention to solve the technical problem is to provide a self-supervised monocular depth estimation method combined with spatiotemporal enhanced photometric loss, which includes the following steps:
获取图像序列中相邻的若干帧图像;Obtain several adjacent frames of images in the image sequence;
将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息,其中,所述深度学习网络的光度损失信息基于深度感知像素对应关系的空间变换模型得到,并利用全向自动掩膜来避免运动物体的像素参与光度误差的计算。The image is input into the trained deep learning network to obtain depth information and pose information. The photometric loss information of the deep learning network is obtained based on the spatial transformation model of depth sensing pixel correspondence, and uses omnidirectional automatic masking. Membrane to prevent the pixels of moving objects from participating in the calculation of photometric errors.
所述光度损失信息基于深度感知像素对应的空间变换模型得到具体为:The photometric loss information is obtained based on the spatial transformation model corresponding to the depth sensing pixels as follows:
对于远处区域利用单应矩阵进行空间变换,并构造第一重建图;其中,所述远处区域将远处区域看作为一个无穷远的平面;Use the homography matrix to perform spatial transformation on the distant area, and construct a first reconstruction map; wherein, the far area regards the far area as an infinite plane;
利用基础矩阵进行空间变换,并构造第二重建图;Use the basic matrix to perform spatial transformation and construct a second reconstruction map;
通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图,然后逐像素选取最小值,得到最终的光度损失信息。The photometric error map based on the first reconstructed image and the photometric error map based on the second reconstructed image are solved through two pixel correspondences, and then the minimum value is selected pixel by pixel to obtain the final photometric loss information.
所述利用全向自动掩膜来避免运动物体的像素参与光度误差的计算具体为:The calculation of using omnidirectional automatic mask to avoid the participation of pixels of moving objects in the photometric error is as follows:
通过预训练网络预测目标帧的初始深度和初始位姿,并生成初始重建图;Predict the initial depth and initial pose of the target frame through the pre-trained network, and generate an initial reconstruction map;
将干扰项加到所述初始位姿上,并利用空间变换得到若干假设的重建帧;利用所述假设的重建帧,结合所述目标帧的光度,生成多个光度误差图,并利用所述多个光度误差图得到多个二值化掩膜;Add interference terms to the initial pose, and use spatial transformation to obtain several hypothetical reconstructed frames; use the hypothetical reconstructed frames, combined with the luminosity of the target frame, to generate multiple photometric error maps, and use the Multiple photometric error maps obtain multiple binary masks;
从所述多个二值化掩膜中选取最小值作为最终的掩膜。Select the minimum value from the plurality of binary masks as the final mask.
所述干扰项为平移扰动项,包括:[tmax,0,0]、[-tmax,0,0]、[0,0,tmax]和[0,0,-tmax],其中,tmax表示初始化的平移向量中的最大值。The interference term is a translational disturbance term, including: [t max ,0,0], [-t max ,0,0], [0,0,t max ] and [0,0,-t max ], where , t max represents the maximum value in the initialized translation vector.
本发明解决其技术问题所采用的技术方案是:提供一种结合时空增强光度损失的自监督单目深度估计装置,包括:The technical solution adopted by the present invention to solve the technical problem is to provide a self-supervised monocular depth estimation device that combines spatiotemporal enhanced photometric loss, including:
获取模块,用于获取图像序列中相邻的若干帧图像;The acquisition module is used to acquire several adjacent frame images in the image sequence;
估计模块,用于将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息;所述深度学习网络的光度损失信息基于深度感知像素对应关系模块的空间变换模型得到,并利用全向自动掩膜模块来避免运动物体的像素参与光度误差的计算。An estimation module, used to input the image into a trained deep learning network to obtain depth information and pose information; the photometric loss information of the deep learning network is obtained based on the spatial transformation model of the depth sensing pixel correspondence module, and uses The omnidirectional automatic mask module prevents the pixels of moving objects from participating in the calculation of photometric errors.
所述深度感知像素对应关系模块包括:The depth sensing pixel correspondence module includes:
第一构造单元,用于对于远处区域利用单应矩阵进行空间变换,并构造第一重建图;其中,所述远处区域将远处区域看作为一个无穷远的平面;The first construction unit is used to use the homography matrix to perform spatial transformation on the far area and construct the first reconstruction map; wherein the far area regards the far area as an infinite plane;
第二构造单元,用于利用基础矩阵进行空间变换,并构造第二重建图;The second construction unit is used to perform spatial transformation using the basic matrix and construct the second reconstruction map;
光度损失信息获取单元,用于通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图,然后逐像素选取最小值,得到最终的光度损失信息。The photometric loss information acquisition unit is used to solve the photometric error map based on the first reconstructed image and the photometric error map based on the second reconstructed image through two pixel correspondences, and then select the minimum value pixel by pixel to obtain the final Photometric loss information.
所述全向自动掩膜模块包括:The omnidirectional automatic mask module includes:
初始重建图生成单元,用于通过预训练网络预测目标帧的初始深度和初始位姿,并生成初始重建图;An initial reconstruction map generation unit is used to predict the initial depth and initial pose of the target frame through the pre-trained network, and generate an initial reconstruction map;
二值化掩膜生成单元,用于将干扰项加到所述初始位姿上,并利用空间变换得到若干假设的重建帧;利用所述假设的重建帧,结合所述目标帧的光度,生成多个光度误差图,并利用所述多个光度误差图得到多个二值化掩膜;A binarized mask generation unit is used to add interference terms to the initial pose, and use spatial transformation to obtain several hypothetical reconstructed frames; use the hypothetical reconstructed frames, combined with the luminosity of the target frame, to generate A plurality of photometric error maps, and using the multiple photometric error maps to obtain a plurality of binary masks;
掩膜选取单元,用于从所述多个二值化掩膜中选取最小值作为最终的掩膜。A mask selection unit is used to select a minimum value from the plurality of binary masks as the final mask.
所述干扰项为平移扰动项,包括:[tmax,0,0]、[-tmax,0,0]、[0,0,tmax]和[0,0,-tmax],其中,tmax表示初始化的平移向量中的最大值。The interference term is a translational disturbance term, including: [t max ,0,0], [-t max ,0,0], [0,0,t max ] and [0,0,-t max ], where , t max represents the maximum value in the initialized translation vector.
有益效果beneficial effects
由于采用了上述的技术方案,本发明与现有技术相比,具有以下的优点和积极效果:本发明采用深度感知像素对应的方式对远处区域的像素对应关系进行了挖掘,改善了远处区域像素对应不准确的问题,并且利用全向自动掩膜的方式得到一个全方向的二值化掩膜用来避免运动物体的像素参与光度误差的计算。本发明通过改善空间变换以及生成动态物体自动掩膜来提高光度损失的准确性,进而更好的监督深度网络的学习。Due to the adoption of the above technical solution, the present invention has the following advantages and positive effects compared with the existing technology: the present invention uses a depth-sensing pixel correspondence method to mine pixel correspondences in distant areas, improving the The problem of inaccurate regional pixel correspondence, and the omnidirectional automatic masking method is used to obtain an omnidirectional binary mask to avoid the pixels of moving objects from participating in the calculation of photometric errors. The present invention improves the accuracy of photometric loss by improving spatial transformation and generating automatic masks for dynamic objects, thereby better supervising the learning of deep networks.
附图说明Description of the drawings
图1是近处点位姿求解示意图;Figure 1 is a schematic diagram of the near point pose solution;
图2是远处点位姿求解示意图;Figure 2 is a schematic diagram for solving the pose of a distant point;
图3是Monodepth2基本框架示意图;Figure 3 is a schematic diagram of the basic framework of Monodepth2;
图4是本发明第一实施方式中光度损失的生成示意图;Figure 4 is a schematic diagram of the generation of photometric loss in the first embodiment of the present invention;
图5是本发明第一实施方式中全向自动掩膜的示意图。Figure 5 is a schematic diagram of an omnidirectional automatic mask in the first embodiment of the present invention.
具体实施方式Detailed ways
下面结合具体实施例,进一步阐述本发明。应理解,这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解,在阅读了本发明讲授的内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所附权利要求书所限定的范围。The present invention will be further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the invention and are not intended to limit the scope of the invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of this application.
本发明的第一实施方式涉及一种结合时空增强光度损失的自监督单目深度估计方法,包括以下步骤:获取图像序列中相邻的若干帧图像;将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息,其中,所述深度学习网络的光度损失信息基于深度感知像素对应的空间变换模型得到,并利用全向自动掩膜来避免运动物体的像素参与光度误差的计算。The first embodiment of the present invention relates to a self-supervised monocular depth estimation method combined with spatiotemporal enhanced photometric loss, which includes the following steps: acquiring several adjacent frame images in an image sequence; inputting the images into trained deep learning Depth information and pose information are obtained from the network. The photometric loss information of the deep learning network is obtained based on the spatial transformation model corresponding to the depth sensing pixels, and an omnidirectional automatic mask is used to avoid the pixels of moving objects from participating in the calculation of photometric errors. .
本实施方式的方法可以直接用到一般的自监督单目深度估计中,任何以SfMLearner这一框架为实现原理的工作都可以使用本实施方式的方法。其只需要将原本框架中的空间变换部分采用本实施方式的基于深度感知像素对应的空间变换模型,将自动掩膜部分采用本申请的全向自动掩膜即可。The method of this implementation mode can be directly used in general self-supervised monocular depth estimation. Any work based on the SfMLearner framework as the implementation principle can use the method of this implementation mode. It only needs to adopt the space transformation model based on depth sensing pixel correspondence of this embodiment for the spatial transformation part in the original framework, and adopt the omnidirectional automatic mask of the present application for the automatic mask part.
下面以Godardetal.的Monodepth2的基础框架为例进一步说明本发明。The following takes the basic framework of Monodepth2 of Godard et al. as an example to further illustrate the present invention.
为了更容易理解,下面先介绍Monodepth2的整体框架,如图3所示,其输入为序列中相邻三帧RGB图像;输出为目标帧深度,目标帧和源帧之间的位姿变换。In order to make it easier to understand, the overall framework of Monodepth2 is introduced below, as shown in Figure 3. Its input is the RGB image of three adjacent frames in the sequence; the output is the depth of the target frame, and the pose transformation between the target frame and the source frame.
本实施方式的基本框架和图3相同。由于本实施方式改进的主要是空间变换生成光度损失和自动掩膜部分的方法,因此先重点介绍Monodepth2的这两个部分:The basic framework of this embodiment is the same as Figure 3 . Since this implementation mainly improves the method of spatial transformation to generate photometric loss and automatic masking, we will first focus on these two parts of Monodepth2:
Monodepth2采用和SfMLearner一样的空间变换模型,根据目标帧It的深度Dt和目标帧It和源帧Is的位姿Tt→s=[Rt→s|tt→s]。对于目标帧和源帧之间的对应像素pt和ps,若其对应同一个3D点,应满足:Monodepth2 uses the same spatial transformation model as SfMLearner, based on the depth D t of the target frame I t and the pose T t→s of the target frame I t and the source frame I s = [R t→s | t t→s ]. For the corresponding pixels p t and p s between the target frame and the source frame, if they correspond to the same 3D point, they should satisfy:
DsK-1ps=DtK-1pt D s K -1 p s =D t K -1 p t
其中,K是相机的内参。由于单目深度具有尺度模糊性,所以可以变换为下式用于空间变换:Among them, K is the internal parameter of the camera. Since monocular depth has scale ambiguity, it can be transformed into the following formula for spatial transformation:
ps~KTt→sDtK-1pt p s ~KT t→s D t K -1 p t
空间几何变换中将KTt→sK-1定义为基础矩阵F,用来帧间像素的对应关系。进而可以利用该关系构建重建帧 In the spatial geometric transformation, KT t→s K -1 is defined as the basic matrix F, which is used to correspond to the pixels between frames. This relationship can then be used to construct a reconstructed frame
根据目标帧和重建帧,可以构造光度损失pe,由L1误差和结构相似(SSIM)误差组成,具体如下:According to the target frame and reconstructed frame, the photometric loss pe can be constructed, which consists of L1 error and structural similarity (SSIM) error, as follows:
α是超参数,Monodepth2设置为0.85α is a hyperparameter, Monodepth2 is set to 0.85
Monodepth2的自动掩膜主要是为了解决图像中违背光度一致性的运动像素所造成光度损失不准确的问题。其主要思路是找到在训练过程中过滤掉从一帧到另一帧中光度未变小的像素,生成的二值化掩膜μ如下:The automatic masking of Monodepth2 is mainly to solve the problem of inaccurate photometric loss caused by moving pixels that violate photometric consistency in the image. The main idea is to filter out pixels whose brightness does not become smaller from one frame to another during the training process. The generated binary mask μ is as follows:
[]是Iversonbracket,用来生成二值化掩膜。It是目标帧,Is是源帧,是空间变换所得到的重建帧。[] is Iversonbracket, used to generate binary masks. I t is the target frame, I s is the source frame, is the reconstructed frame obtained by spatial transformation.
对于空间变换生成光度损失,本实施方式基于深度感知像素对应的空间变换模型得到光度损失量。如图4所示,具体如下:Regarding the photometric loss generated by spatial transformation, this embodiment obtains the photometric loss amount based on the spatial transformation model corresponding to the depth sensing pixel. As shown in Figure 4, the details are as follows:
在空间变换过程中,足够远处区域可以看作一个无穷远的平面,而平面满足:In the process of space transformation, the far enough area can be regarded as an infinite plane, and the plane satisfies:
nTP+D=0n T P+D=0
其中,n是平面的法向量,P是平面上一个三维点,D是点的深度,经过变换可得:Among them, n is the normal vector of the plane, P is a three-dimensional point on the plane, and D is the depth of the point. After transformation, it can be obtained:
将其带入空间变换的关系中可得:Putting it into the relationship of spatial transformation, we can get:
当Dt无穷大时,也就是对于无穷远平面:When D t is infinite, that is, for the infinite plane:
ps~KRt→sDtK-1pt p s ~KR t→s D t K -1 p t
KRt→sK-1被定义为无穷远处的单应矩阵H∞,因此对于远处区域只利用旋转矩阵来进行空间变换进而构造重建图为了进行区分,将利用基础矩阵得到的重建图表示为由于单目尺度估计估计的深度有尺度模糊性,因此无法直接通过预测的深度来选择两种像素对应关系。因此本实施方式设计了自适应选择的方法,具体是通过两种像素对应关系求解出两个光度误差图,然后逐像素选取最小值,即最终的光度误差为:KR t→s K -1 is defined as the homography matrix H∞ at infinity, so for distant areas only the rotation matrix is used to perform spatial transformation and then construct the reconstruction map In order to distinguish, the reconstructed graph obtained by using the basic matrix is expressed as Since the depth estimated by monocular scale estimation has scale ambiguity, it is impossible to directly select the two pixel correspondences through the predicted depth. Therefore, this embodiment designs an adaptive selection method. Specifically, two photometric error maps are obtained through two pixel correspondences, and then the minimum value is selected pixel by pixel, that is, the final photometric error is:
对于全向自动掩膜,本实施方式将图像序列直接输入模块,得到掩膜结果后将其作用于光度误差上遮挡掉不可靠的部分,如图5所示,具体如下:For omnidirectional automatic masking, this implementation directly inputs the image sequence into the module. After obtaining the masking result, it is applied to the photometric error to block out the unreliable parts, as shown in Figure 5. The details are as follows:
本实施方式引入了一个Monodepth2的预训练网络,预测目标帧的初始深度Dinit和初始帧位姿Tinit,进一步生成一个初始重建图Iinit。由于深度和位姿已经比较准确,因此符合光度一致性的区域的光度误差已经较小,但不符合光度一致性的区域就有潜力变小。This implementation introduces a Monodepth2 pre-training network to predict the initial depth D init and initial frame pose T init of the target frame, and further generates an initial reconstruction map I init . Since the depth and pose are already relatively accurate, the photometric error in areas that meet photometric consistency is already smaller, but the areas that do not meet photometric consistency have the potential to become smaller.
针对该思路,通过将干扰项加到初始位姿上,引入了一些干扰后的位姿,利用空间变换后得到一些假设的重建帧。利用这些重建帧Ii,其中,i∈{1,2,…},结合目标帧的光度,可以生成多个光度误差图,利用这些光度误差值的大小就可以得到多个二值化掩膜,对应着各个方向运动物体的像素,如下:In response to this idea, by adding interference terms to the initial pose, some post-interference poses are introduced, and some hypothetical reconstructed frames are obtained by using spatial transformation. Using these reconstructed frames I i , where i∈{1,2,…}, combined with the luminosity of the target frame, multiple photometric error maps can be generated, and multiple binary masks can be obtained by using the sizes of these photometric error values. , corresponding to the pixels of moving objects in various directions, as follows:
Mi=[pe(It,Iinit),pe(It,Ii)]M i =[pe(I t ,I init ),pe(I t ,I i )]
为了捕捉到各个方向运动的物体,将生成的各个掩膜取最小值得到最终的掩膜,即:In order to capture objects moving in all directions, the minimum value of each generated mask is taken to obtain the final mask, that is:
MoA=min(M1,M2,…) MoA =min(M 1 ,M 2 ,…)
本实施方式在实现过程中,只在平移向量上进行了扰动,具体平移扰动项ti:t1=[tmax,0,0]、t2=[-tmax,0,0]、t3=[0,0,tmax]和t4=[0,0,-tmax],其中,tmax为初始化的平移向量中的最大值。During the implementation process of this implementation, only the translation vector is perturbed. The specific translation perturbation term t i is : t 1 =[t max ,0,0], t 2 =[-t max ,0,0], t 3 =[0,0,t max ] and t 4 =[0,0,-t max ], where t max is the maximum value of the initialized translation vector.
不难发现,本发明采用深度感知像素对应的方式对远处区域的像素对应关系进行了挖掘,改善了远处区域像素对应不准确的问题,并且利用全向自动掩膜的方式得到一个全方向的二值化掩膜用来避免运动物体的像素参与光度误差的计算。本发明通过改善空间变换以及生成动态物体自动掩膜来提高光度损失的准确性,进而更好的监督深度网络的学习。因此将本实施方式的深度感知像素对应和全向自动掩膜应用到Godard et al.的Monodepth2框架中,可以得到精度较高的单目深度估计结果。It is not difficult to find that the present invention uses a depth-sensing pixel correspondence method to mine pixel correspondences in distant areas, improves the problem of inaccurate pixel correspondence in distant areas, and uses omnidirectional automatic masking to obtain an omnidirectional pixel correspondence. The binary mask is used to prevent the pixels of moving objects from participating in the calculation of photometric errors. The present invention improves the accuracy of photometric loss by improving spatial transformation and generating automatic masks for dynamic objects, thereby better supervising the learning of deep networks. Therefore, by applying the depth-sensing pixel correspondence and omnidirectional automatic masking of this embodiment to the Monodepth2 framework of Godard et al., monocular depth estimation results with higher accuracy can be obtained.
本发明的第二实施方式涉及一种结合时空增强光度损失的自监督单目深度估计装置,包括:获取模块,用于获取图像序列中相邻的若干帧图像;估计模块,用于将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息;所述深度学习网络的光度损失信息基于深度感知像素对应关系模块的空间变换模型得到,并利用全向自动掩膜模块来避免运动物体的像素参与光度误差的计算。The second embodiment of the present invention relates to a self-supervised monocular depth estimation device combined with spatiotemporal enhanced photometric loss, including: an acquisition module for acquiring several adjacent frame images in an image sequence; and an estimation module for converting the The image is input into the trained deep learning network to obtain depth information and pose information; the photometric loss information of the deep learning network is obtained based on the spatial transformation model of the depth sensing pixel correspondence module, and the omnidirectional automatic mask module is used to avoid Pixels of moving objects participate in the calculation of photometric errors.
所述深度感知像素对应关系模块包括:第一构造单元,用于对于远处区域利用单应矩阵进行空间变换,并构造第一重建图;其中,所述远处区域将远处区域看作为一个无穷远的平面;第二构造单元,用于利用基础矩阵进行空间变换,并构造第二重建图;光度损失信息获取单元,用于通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图,然后逐像素选取最小值,得到最终的光度损失信息。The depth-sensing pixel correspondence module includes: a first construction unit for spatially transforming the distant area using a homography matrix and constructing a first reconstruction map; wherein the distant area regards the distant area as a a plane at infinity; a second construction unit for using the basic matrix to perform spatial transformation and construct a second reconstruction map; a photometric loss information acquisition unit for solving the problem based on the first reconstruction map through two pixel correspondences The photometric error map and the photometric error map based on the second reconstructed image are then selected pixel by pixel to obtain the final photometric loss information.
所述全向自动掩膜模块包括:初始重建图生成单元,用于通过预训练网络预测目标帧的初始深度和初始位姿,并生成初始重建图;二值化掩膜生成单元,用于将干扰项加到所述初始位姿上,并利用空间变换得到若干假设的重建帧;利用所述假设的重建帧,结合所述目标帧的光度,生成多个光度误差图,并利用所述多个光度误差图得到多个二值化掩膜;掩膜选取单元,用于从所述多个二值化掩膜中选取最小值作为最终的掩膜。其中,所述干扰项为平移扰动项,包括:[tmax,0,0]、[-tmax,0,0]、[0,0,tmax]和[0,0,-tmax],其中,tmax表示初始化的平移向量中的最大值。The omnidirectional automatic mask module includes: an initial reconstruction map generation unit, used to predict the initial depth and initial pose of the target frame through a pre-trained network, and generate an initial reconstruction map; a binary mask generation unit, used to Interference terms are added to the initial pose, and spatial transformation is used to obtain several hypothetical reconstructed frames; multiple photometric error maps are generated using the hypothetical reconstructed frames, combined with the luminosity of the target frame, and the multiple photometric error maps are generated. A plurality of binary masks are obtained from each photometric error map; a mask selection unit is used to select a minimum value from the plurality of binary masks as the final mask. Among them, the interference term is a translational disturbance term, including: [t max ,0,0], [-t max ,0,0], [0,0,t max ] and [0,0,-t max ] , where t max represents the maximum value in the initialized translation vector.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210475411.0A CN114998411B (en) | 2022-04-29 | 2022-04-29 | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210475411.0A CN114998411B (en) | 2022-04-29 | 2022-04-29 | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114998411A CN114998411A (en) | 2022-09-02 |
CN114998411B true CN114998411B (en) | 2024-01-09 |
Family
ID=83025390
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210475411.0A Active CN114998411B (en) | 2022-04-29 | 2022-04-29 | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114998411B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116245927B (en) * | 2023-02-09 | 2024-01-16 | 湖北工业大学 | A self-supervised monocular depth estimation method and system based on ConvDepth |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264509A (en) * | 2018-04-27 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Determine the method, apparatus and its storage medium of the pose of image-capturing apparatus |
CN111260680A (en) * | 2020-01-13 | 2020-06-09 | 杭州电子科技大学 | An Unsupervised Pose Estimation Network Construction Method Based on RGBD Cameras |
CN111369608A (en) * | 2020-05-29 | 2020-07-03 | 南京晓庄学院 | A Visual Odometry Method Based on Image Depth Estimation |
CN111739078A (en) * | 2020-06-15 | 2020-10-02 | 大连理工大学 | A Monocular Unsupervised Depth Estimation Method Based on Context Attention Mechanism |
CN111783582A (en) * | 2020-06-22 | 2020-10-16 | 东南大学 | An unsupervised monocular depth estimation algorithm based on deep learning |
CN113160390A (en) * | 2021-04-28 | 2021-07-23 | 北京理工大学 | Three-dimensional dense reconstruction method and system |
CN113240722A (en) * | 2021-04-28 | 2021-08-10 | 浙江大学 | Self-supervision depth estimation method based on multi-frame attention |
CN113313732A (en) * | 2021-06-25 | 2021-08-27 | 南京航空航天大学 | Forward-looking scene depth estimation method based on self-supervision learning |
CN113450410A (en) * | 2021-06-29 | 2021-09-28 | 浙江大学 | Monocular depth and pose joint estimation method based on epipolar geometry |
CN113570658A (en) * | 2021-06-10 | 2021-10-29 | 西安电子科技大学 | Depth estimation method for monocular video based on deep convolutional network |
CN114022799A (en) * | 2021-09-23 | 2022-02-08 | 中国人民解放军军事科学院国防科技创新研究院 | Self-supervision monocular depth estimation method and device |
CN114170286A (en) * | 2021-11-04 | 2022-03-11 | 西安理工大学 | Monocular depth estimation method based on unsupervised depth learning |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11082681B2 (en) * | 2018-05-17 | 2021-08-03 | Niantic, Inc. | Self-supervised training of a depth estimation system |
US10970856B2 (en) * | 2018-12-27 | 2021-04-06 | Baidu Usa Llc | Joint learning of geometry and motion with three-dimensional holistic understanding |
US11176709B2 (en) * | 2019-10-17 | 2021-11-16 | Toyota Research Institute, Inc. | Systems and methods for self-supervised scale-aware training of a model for monocular depth estimation |
US11257231B2 (en) * | 2020-06-17 | 2022-02-22 | Toyota Research Institute, Inc. | Camera agnostic depth network |
-
2022
- 2022-04-29 CN CN202210475411.0A patent/CN114998411B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264509A (en) * | 2018-04-27 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Determine the method, apparatus and its storage medium of the pose of image-capturing apparatus |
CN111260680A (en) * | 2020-01-13 | 2020-06-09 | 杭州电子科技大学 | An Unsupervised Pose Estimation Network Construction Method Based on RGBD Cameras |
CN111369608A (en) * | 2020-05-29 | 2020-07-03 | 南京晓庄学院 | A Visual Odometry Method Based on Image Depth Estimation |
CN111739078A (en) * | 2020-06-15 | 2020-10-02 | 大连理工大学 | A Monocular Unsupervised Depth Estimation Method Based on Context Attention Mechanism |
CN111783582A (en) * | 2020-06-22 | 2020-10-16 | 东南大学 | An unsupervised monocular depth estimation algorithm based on deep learning |
CN113160390A (en) * | 2021-04-28 | 2021-07-23 | 北京理工大学 | Three-dimensional dense reconstruction method and system |
CN113240722A (en) * | 2021-04-28 | 2021-08-10 | 浙江大学 | Self-supervision depth estimation method based on multi-frame attention |
CN113570658A (en) * | 2021-06-10 | 2021-10-29 | 西安电子科技大学 | Depth estimation method for monocular video based on deep convolutional network |
CN113313732A (en) * | 2021-06-25 | 2021-08-27 | 南京航空航天大学 | Forward-looking scene depth estimation method based on self-supervision learning |
CN113450410A (en) * | 2021-06-29 | 2021-09-28 | 浙江大学 | Monocular depth and pose joint estimation method based on epipolar geometry |
CN114022799A (en) * | 2021-09-23 | 2022-02-08 | 中国人民解放军军事科学院国防科技创新研究院 | Self-supervision monocular depth estimation method and device |
CN114170286A (en) * | 2021-11-04 | 2022-03-11 | 西安理工大学 | Monocular depth estimation method based on unsupervised depth learning |
Non-Patent Citations (5)
Title |
---|
Unsupervised learning of depth and ego-motion from video;T.Zhou 等;《2017 IEEE Conference on Computer Vision and Pattern Recognition(CPVR)》;1851-1858 * |
基于域适应的图像深度信息估计方法研究;詹雁;《中国优秀硕士学位论文全文数据库信息科技辑》(第2021(04)期);I138-811 * |
基于无监督学习的单目图像深度估计;胡智程;《中国优秀硕士学位论文全文数据库信息科技辑》(第2021(08)期);I138-615 * |
基于语义先验和深度约束的室内动态场景RGB-D SLAM算法;姜昊辰 等;《信息与控制》;第50卷(第2021(03)期);275-286 * |
结合注意力与无监督深度学习的单目深度估计;岑仕杰 等;《广东工业大学学报》(第04期);35-41 * |
Also Published As
Publication number | Publication date |
---|---|
CN114998411A (en) | 2022-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111105432B (en) | Unsupervised end-to-end driving environment perception method based on deep learning | |
Jin et al. | Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering | |
CN110610486B (en) | Monocular image depth estimation method and device | |
CN110599522B (en) | Method for detecting and removing dynamic target in video sequence | |
Ubina et al. | Intelligent underwater stereo camera design for fish metric estimation using reliable object matching | |
CN110910437A (en) | A Depth Prediction Method for Complex Indoor Scenes | |
Miao et al. | Ds-depth: Dynamic and static depth estimation via a fusion cost volume | |
Gao et al. | Joint optimization of depth and ego-motion for intelligent autonomous vehicles | |
Goncalves et al. | Deepdive: An end-to-end dehazing method using deep learning | |
CN114998411B (en) | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss | |
CN118822906A (en) | Indoor dynamic environment map construction method and system based on image restoration and completion | |
Gong et al. | Skipcrossnets: Adaptive skip-cross fusion for road detection | |
Su et al. | Omnidirectional depth estimation with hierarchical deep network for multi-fisheye navigation systems | |
CN113920270B (en) | Layout reconstruction method and system based on multi-view panorama | |
Zhang et al. | Spatiotemporally enhanced photometric loss for self-supervised monocular depth estimation | |
Rohan et al. | A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision | |
CN114943762A (en) | A Binocular Visual Odometry Method Based on Event Camera | |
Liao et al. | VI-NeRF-SLAM: A real-time visual–inertial SLAM with NeRF mapping | |
CN116188550A (en) | Self-supervision depth vision odometer based on geometric constraint | |
Guo et al. | A simple baseline for supervised surround-view depth estimation | |
Bhutani et al. | Unsupervised Depth and Confidence Prediction from Monocular Images using Bayesian Inference | |
Han et al. | Weakly Supervised Monocular 3D Object Detection by Spatial-Temporal View Consistency | |
Wang et al. | Unsupervised Scale Network for Monocular Relative Depth and Visual Odometry | |
CN116630953A (en) | Monocular image 3D target detection method based on nerve volume rendering | |
Hirose et al. | Depth360: Self-supervised Learning for Monocular Depth Estimation using Learnable Camera Distortion Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |