WO2024051184A1 - 一种基于光流遮罩的无监督单目深度估计方法 - Google Patents

一种基于光流遮罩的无监督单目深度估计方法 Download PDF

Info

Publication number
WO2024051184A1
WO2024051184A1 PCT/CN2023/092180 CN2023092180W WO2024051184A1 WO 2024051184 A1 WO2024051184 A1 WO 2024051184A1 CN 2023092180 W CN2023092180 W CN 2023092180W WO 2024051184 A1 WO2024051184 A1 WO 2024051184A1
Authority
WO
WIPO (PCT)
Prior art keywords
optical flow
estimation
depth
image
network
Prior art date
Application number
PCT/CN2023/092180
Other languages
English (en)
French (fr)
Inventor
王梦凡
方效林
杨明
吴文甲
罗军舟
Original Assignee
南京逸智网络空间技术创新研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京逸智网络空间技术创新研究院有限公司 filed Critical 南京逸智网络空间技术创新研究院有限公司
Publication of WO2024051184A1 publication Critical patent/WO2024051184A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the invention belongs to the technical field of image recognition.
  • Depth estimation is a basic problem in the field of computer vision, which can be applied in fields such as robot navigation, augmented reality, three-dimensional reconstruction, and autonomous driving.
  • most depth estimation is based on the conversion estimation of two-dimensional RGB images to RBG-D images, which mainly includes the Shape from Algorithms for predicting camera poses using methods such as (Structure from motion) and SLAM (Simultaneous Localization And Mapping).
  • SLAM Simultaneous Localization And Mapping
  • Binocular depth estimation is also used, but since binocular images require stereo matching for pixel point correspondence and disparity calculation, the computational complexity is also high, especially for low-texture scenes where the matching effect is not good.
  • the present invention provides an unsupervised monocular depth estimation method based on optical flow mask.
  • the present invention provides an unsupervised monocular depth estimation method based on optical flow mask.
  • ⁇ and ⁇ are hyperparameters
  • L p is the photometric loss error
  • L s is the smoothness loss
  • the expression of L s is:
  • pe(.) is:
  • I a and I b represent any two image frames respectively
  • is the hyperparameter
  • SSIM(.) is the similarity calculation function
  • M a is:
  • r is the preset threshold.
  • optical flow estimation network performs the following processing on the two adjacent frame images It and It ' in the training sample:
  • Step 2 Include n encoder modules and n upsampling modules in the decoder of the pyramid structure of the optical flow estimation network.
  • step 2 for the feature image pair of the i-th scale and optical flow
  • the corresponding upsampling module performs the following processing:
  • Step 2.1 Use bilinear interpolation to improve The resolution of the initial optical flow is obtained
  • p represents the initial optical flow
  • N(p/s) represents the optical flow Among the four pixels adjacent to point p/s
  • s is the proportional magnification
  • ⁇ (p/s, k) is the weight of bilinear interpolation
  • Step 2.2 Calculate using the encoder and interpolation flow between Use interpolation flow for initial optical flow Perform warping transformation to obtain optical flow
  • N(d) represents the initial optical flow
  • the four pixels adjacent to pixel d in represents the initial optical flow
  • the optical flow value of the middle pixel point k', represents the interpolation flow of pixel point p, and ⁇ (d, k') represents the weight;
  • Step 2.3 According to the following formula, and Fusion is performed to obtain the output of the corresponding upsampling module.
  • the depth estimation network adopts ResNet network.
  • the present invention designs an unsupervised monocular depth estimation method based on optical flow mask, uses a pyramid structure to estimate optical flow from different granularities, and adds an upsampling module and interpolation flow to improve the motion boundary area. Bilinear interpolation mixing problem; then compare the image reconstructed based on optical flow estimation with the current image, consider the part with a large difference as a self-moving object, and mask this part during depth estimation reconstruction. to reduce the impact of moving objects on depth estimation and improve the accuracy of depth estimation; overall, the present invention can achieve depth estimation of images, and Partial improvements to depth estimation accuracy.
  • Figure 1(a) is the depth estimation network structure diagram
  • Figure 1(b) shows the hierarchical parameter setting diagram of the depth estimation network
  • Figure 2 is a schematic diagram of the camera pose estimation model
  • Figure 3 is a schematic diagram of the decoder with pyramid structure in the optical flow estimation network
  • Figure 4 is a schematic diagram of the overall training architecture of the present invention based on optical flow as a mask.
  • the present invention provides an unsupervised monocular depth estimation method based on optical flow mask. According to the following steps S1 to S5, the depth estimation network and camera pose estimation model are obtained, and then these two models are applied to complete the image analysis. Depth estimation is performed to obtain the depth estimate D t .
  • the depth estimation network uses the sensor's raw data image as the input image, uses the calibration file to view the camera's internal parameters, and uses the velodyne_points file to view the lidar data as the ground truth.
  • the depth estimation network uses the ResNet network. Based on the ResNet network, each frame of image in the video frame is used as input to estimate the depth value of each pixel of the image.
  • the estimated pose transformation matrix T t'-t is used as the output, which includes two parts, one is the rotation transformation of the camera, and the other is the camera translation transformation.
  • the coding sub-module uses the ResNet network for depth estimation, uses the residual learning structure to prevent the occurrence of degradation problems, and uses the forward neural network and short-circuit mechanism to output feature maps with more semantic information during the encoding process.
  • the specific steps are as follows:
  • the encoder in the ResNet network takes a single picture as input and a feature map with dimensions C*H*W as output, where C is the number of channels of the feature map, H is the length of the feature map, and W is The width of the feature map.
  • the ResNet network outputs five levels of features. The higher the level, the lower the feature space resolution, the stronger the representation ability, and the greater the number of features.
  • the output channel (channel) is 64, the stride is 2, and the padding is 3; then it goes through a 3 ⁇ 3 max pooling layer, the stride is 2, and the padding is 1; except for the max pooling layer, other downsampling They are all implemented using convolutional layers and are divided into four convolution groups: layer1, layer2, layer3, and layer4. Except for the downsampling of layer1 using the maximum pooling layer, the downsampling of other layers is adjacent to the previous volume. Residual block implementation of product groups.
  • the main branch uses three convolutional layers: a 1 ⁇ 1 convolutional layer to compress the channel dimension, a 3 ⁇ 3 convolutional layer and a 1 ⁇ 1 convolution Layers are used to restore channel dimensions.
  • the decoder uses upsampling to perform depth estimation combined with the features output by the encoder in the ResNet network to obtain preset depth estimates at different scales.
  • the input feature map For the input feature map, first upsample twice, copy the pixels of the feature map to rows and columns, one pixel produces a 2 ⁇ 2 output, and then go through a convolution operation without changing the resolution, and adjust the number of channels to half, Through such an operation, the number of channels can be halved without changing the resolution.
  • the upsampled feature map and the feature map output by the encoder are skip-connected, and a disparity map of the corresponding number of channels is output.
  • the depth estimate is obtained through two 3 ⁇ 3 convolutional layers and a sigmoid activation function.
  • the external environment basically does not change in a short time, so the photometric properties of the same object in adjacent frames with a short time interval are consistent. .
  • the image is thus reconstructed based on the depth obtained by the depth estimation network and the camera pose estimation model.
  • the reconstructed photometric loss error L s can be obtained, and then the error is transmitted back to the two networks to train the depth estimation network and camera pose Estimation model to improve the accuracy of estimation results.
  • this embodiment continues to add depth estimation smoothing as a regularization term and image structure similarity (SSIM) loss, which can obtain better depth estimation effects.
  • SSIM image structure similarity
  • Depth reconstructed image It is based on the principle that the image transformation is completely generated by the movement of the camera.
  • the reconstruction process uses the results of the depth estimation network estimation and the results of the camera pose estimation model estimation. However, most objects in actual scenes have self-moving objects. Using this method for reconstruction will cause calculation errors and reconstruct the image.
  • the large gap between the original current frame image I t may not be due to an error in the depth estimation result, but because simple camera movement cannot correctly reconstruct the moving object, resulting in a large gap between the correct depth reconstructed image and the current frame image. Large, ultimately leading to inaccurate depth estimation results.
  • the optical flow estimation network is added during training, and the optical flow reconstructed image is added to the depth estimation calculation loss part to estimate the motion of the moving object.
  • the optical flow reconstructed image is As a constraint on the depth of moving objects part of the estimate. The difference between the optical flow reconstructed image and the current frame image is used as a constraint for loss calculation.
  • Step S4 specifically includes the following steps:
  • a pyramid structure is generally used to capture global motion and local motion from coarse-grained to fine-grained.
  • the two adjacent images I t and I t' are input to the optical flow estimation network, and H is used to express the parameters as Optical flow estimation network for ⁇ , V f represents the forward flow field moving from each pixel in It to its corresponding pixel in It ' .
  • the optical flow estimation model H has a pyramid structure and is divided into two stages: pyramid encoding and pyramid decoding.
  • the encoding stage two consecutive frames of images are used as input image pairs.
  • the first decoder module D is used to decode the feature image pair in the decoding stage, from coarse to fine.
  • the first upsampling module S to analyze motion optical flow and (i.e.
  • S ⁇ (.) is the upsampling module S
  • D (.) is the decoder module D.
  • the upsampling module in this embodiment is a self-guided upsampling module. This embodiment improves the situation where boundary fusion of bilinear upsampling causes hybrid interpolation.
  • Bilinear interpolation is generally used. However, near the motion boundary, if bilinear interpolation is used, near the motion boundary where the motion conditions on both sides are inconsistent, the interpolation calculation will be performed by motion 1 and 2. The area close to the boundary of motion 1 will be affected by motion 2. The area close to the boundary where motion 2 is located will be affected by the interpolation of motion 1, resulting in a mixed interpolation phenomenon. But in actual circumstances, they belong to two different movement areas and should not be interfered by other movement areas.
  • a self-guided upsampling module is used in the upsampling process.
  • nearby points with the same motion direction are used for interpolation calculation.
  • the value is passed through the learned interpolation The flow moves, and the area that finally moves to the boundary position is used as the interpolation point of the area.
  • the motion optical flow corresponding to the obtained feature image pair of the i-1th scale (low resolution at this time), first improve it through bilinear interpolation
  • p represents the initial optical flow
  • N(p/s) represents the optical flow Among the four pixels adjacent to point p/s
  • s is the proportional magnification
  • ⁇ (p/s, k) is the weight of linear interpolation
  • the encoder is used to calculate the interpolation stream
  • the interpolation stream is used for initial optical flow Perform warping transformation to obtain optical flow It is the result of bilinear interpolation, but this interpolation method will turn the edge part into the sum of the differences between the movements of both sides, which is not realistic, so the interpolation flow is used Transform the points near the edge through the interpolation flow. If the edge point d can be transformed from the point p in the same motion area through the interpolation flow, then use the bilinear interpolation calculation formula for the four points around the point p. As follows:
  • N(d) represents the initial optical flow
  • ⁇ (d,k') represents the weight.
  • is the weight product operation of the corresponding elements.
  • a dense block with five convolutional layers is used.
  • the specific implementation method is to and concatenated as input to dense blocks.
  • the number of convolution kernels in each convolutional layer in the dense block is 32, 32, 32, 16, and 8; the output of the dense block is a 3-channel tensor map.
  • the first two channels of the tensor graph are used as the interpolation flow, and the last channel is used to form the interpolation map through the sigmoid layer.
  • the final self-learning interpolation map is almost an edge map, and the interpolation flow is also concentrated in the object edge area.
  • the reconstructed image from the adjacent image to the current frame image can be obtained, respectively, the depth reconstructed image and optical flow reconstructed images
  • ⁇ and ⁇ are hyperparameters
  • L p is the photometric loss error
  • L s is the smoothness loss
  • the expression of L s is:
  • I a and I b represent any two image frames respectively
  • is the hyperparameter
  • SSIM(.) is the similarity calculation function.
  • r is the preset threshold.
  • M a represents the masking of the original current frame image (i.e., the original image in Figure 4) based on the optical flow reconstructed image result. It is a combination of the optical flow reconstructed image and the actual image (i.e., the same as the current frame Based on the difference between adjacent images), a mask is set, consisting of 0, 1, and then added as a weight to the original pe(.) loss function, where if the optical flow reconstructs the image If the difference with I t' is greater than 0.8, then it is considered that the location is likely to be a moving object, and the location is masked.
  • This embodiment uses the estimated optical flow to synthesize the reconstructed image. Because the optical flow includes the optical flow movement between two adjacent frames of images, including the rigid body motion of the static background in the entire scene and the non-rigid body motion of the moving objects in the scene, according to the changes in optical flow and the adjacent images of the current frame Figure, the optical flow reconstructed image can be synthesized.
  • the synthesized image in this step takes into account the moving objects in the scene. And deeply reconstructed images The calculation formula assumes that there are no moving objects in the scene, so the depth reconstructed image Only the rigid body flow part is considered.
  • This embodiment uses the optical flow estimation network to further improve the depth estimation effect of moving objects, which can increase the accuracy of depth estimation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于光流遮罩的无监督单目深度估计方法,该方法具体为:采用深度估计网络,对图像帧进行深度估计;对深度估计网络进行训练时引入相机位姿估计模型和光流估计网络;根据光流估计网络输出的相邻两幅图像帧之间的光流估计,对当前帧进行重构,得到光流重构图像;根据相机位姿估计模型估计出的相邻两幅图像帧之间的位姿变换矩阵,对当前帧进行重构,得到深度重构图像,根据深度重构图像和光流重构图像,建立损失函数对深度估计网络,相机位姿估计模型和光流估计网络进行联合训练。本发明提高了深度估计的准确性。

Description

一种基于光流遮罩的无监督单目深度估计方法 技术领域
本发明属于图像识别技术领域。
背景技术
从视频中对于三维场景进行理解感知是一个受到广泛关注的基本课题。它包括许多经典的计算机视觉任务,如深度恢复、光流估计、视觉里程测量等。这些技术具有广泛的工业应用,包括自动驾驶平台、交互式协同机器人、定位导航系统等。传统的结构自运动(Structure from Motion,SfM)方法对其进行了综合处理,旨在同时重构场景结构和摄像机运动。
深度估计是计算机视觉领域的一个基础性问题,其可以应用在机器人导航、增强现实、三维重建、自动驾驶等领域。而目前大部分深度估计都是基于二维RGB图像到RBG-D图像的转化估计,主要包括从图像明暗、不同视角、光度、纹理信息等获取场景深度形状的Shape from X方法,还有结合SFM(Structure from motion)和SLAM(Simultaneous Localization And Mapping)等方式预测相机位姿的算法。其中虽然有很多设备可以直接获取深度,但是设备造价昂贵。也有利用双目进行深度估计,但是由于双目图像需要利用立体匹配进行像素点对应和视差计算,所以计算复杂度也较高,尤其是对于低纹理场景的匹配效果不好。
发明内容
发明目的:为了解决上述现有技术存在的问题,本发明提供了一种基于光流遮罩的无监督单目深度估计方法。
技术方案:本发明提供了一种基于光流遮罩的无监督单目深度估计方法,该方法具体为:采用深度估计网络,对图像帧进行深度估计;对深度估计网络进行训练时引入相机位姿估计模型和光流估计网络;根据光流估计网络输出的视频序列中相邻两幅图像帧It与It’之间的光流估计,对当前帧图像It进行重构,得到光流重构图像t’=t-1或者t’=t+1;根据相机位姿估计模型估计出的相邻两幅图像帧之间的位姿变换矩阵,对当前帧图像进行重构,得到深度重构图像根据建立损失函数L对深度估计网络,相机位姿估计模型和光流估计网络进行联合训练:
L=μLp+λLs
其中,λ和μ均为超参数,Lp为光度损失误差,Ls为平滑度损失;Ls的表达式为:
其中,表示在当前帧图像中坐标为(x,y)的像素点的深度归一化的值;表示对x进行求导,为对y进行求导;
Lp的表达式为:
其中,pe(.)的表达式为:
其中,Ia和Ib分别表示任意两幅图像帧,α为超参数,SSIM(.)为相似度计算函数,Ma的表达式为:
其中,r为预设的阈值。
进一步的,所述光流估计网络对训练样本中相邻的两帧图像It和It’进行如下处理:
步骤1:在光流估计网络中采用金字塔结构的编码器提取It和It’之间的n个尺度的特征图像对表示It的第i个尺度的特征图像;表示It’的第i个尺度的特征图像,i=1,2,...,n;
步骤2:在光流估计网络的金字塔结构的解码器中包括n个编码器模块和n个上采样模块,当i=1时,将输入至第一个编码器模块中,得到之间的运动光流当i>1时,将和第i-1个上采样模块输出的上采样光流输入至第i个编码器模块,得到之间的运动光流输入至第i个上采样模块,得到之间的上采样光流当i=n时,将It和It’输入至卷积模块,采用第n个上采样模块对卷积模块的输出和运动光流进行上采样,输出最终的光流估计。
进一步的,所述步骤2中,针对第i个尺度的特征图像对和光流相应的上采样模块进行如下处理:
步骤2.1:采用双线性插值的方式提高的分辨率得到初始光流
其中,p表示初始光流中任意像素点的坐标,N(p/s)表示光流中与点p/s相邻的四个像素点,s是比例放大率,ω(p/s,k)为双线性插值的权重;表示像素点p的初始光流值,表示光流中像素点k的光流值;
步骤2.2:采用编码器计算得到之间的插值流采用插值流对初始光流进行翘曲变换得到光流
其中,N(d)表示初始光流中与像素点d相邻的四个像素点,表示初始光流中像素点k’的光流值,表示像素点p的插值流,ω(d,k’)表示权重;
步骤2.3:根据如下公式将进行融合,得到相应的上采样模块的输出
其中,表示插值映射,⊙表示乘积。
进一步的,所述深度估计网络采用ResNet网络。
有益效果:本发明设计了一种基于光流遮罩的无监督单目深度估计方法,使用金字塔结构从不同的粒度进行光流估计,并加入了上采样模块和插值流,改善了运动边界区域的双线性插值混合问题;而后将根据光流估计重构的图像和当前图像进行比较,将差距较大的部分认为是自我运动的物体,将该部分在深度估计重构时进行掩码,以减少运动物体对深度估计的影响,提高深度估计的准确性;整体而言,本发明可以实现对图像的深度估计,以及 对深度估计精度的部分改善。
附图说明
图1(a)为深度估计网络结构图;
图1(b)为深度估计网络的层级参数设置图;
图2是相机位姿估计模型示意图;
图3是光流估计网络中金字塔结构的解码器示意图;
图4是本发明基于光流作为遮罩的训练总架构示意图。
具体实施方式
构成本发明的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。
本发明提供的一种基于光流遮罩的无监督单目深度估计方法,按如下步骤S1-步骤S5,获得深度估计网络和相机位姿估计模型,然后应用这两个模型,完成对图像的深度估计,得到深度估计值Dt
S1.获得KITTI数据集,使用raw data数据集标准文件(共包含180G数据,分为Road,City,Residential和Person四个序列)。其中深度估计网络使用传感器的原始数据图像作为输入的图像,使用标定文件查看相机内部参数,使用velodyne_points文件查看激光雷达数据作为地面真值。
S2.本实施例中深度估计网络采用ResNet网络,基于ResNet网络以视频帧中的每一帧图像作为输入,估计出图像每个像素的深度值。
S3.基于相机位姿估计模型,以视频帧的两帧连续图像作为输入,以估计的位姿变换矩阵Tt’-t作为输出,其中包括两部分,一部分是相机的旋转变换,一部分是相机的平移变换。
如图1中的图(a),图1中的图(b)以及图2所示,使用ResNet网络进行深度估计的编码子模块,使用残差学习结构防止退化问题的出现,使用前向神经网络和短路机制,以便在编码过程中输出更具有语义信息的特征图,具体步骤如下:
S21:ResNet网络中的编码器以单张图片作为输入,以维度为C*H*W的特征图为输出,其中C为该特征图的通道数,H为该特征图的长度,W为该特征图的宽度,本实施例中ResNet网络输出五级特征,级数越高,特征空间分辨率越低,表示能力越强,特征个数越多。
在输入图片之后,经过ResNet网络第一个7×7的卷积层,输出的channel(通道) 为64,stride(步幅)为2,padding(填充)为3;然后经过一个3×3的最大池化层,stride为2,padding为1;除了最大池化层之外,其他的下采样都是使用卷积层实现的,分为四个卷积组:layer1,layer2,layer3,layer4,除了layer1的下采样使用最大池化层实现之外,其他层的下采样都是邻接上一个卷积组的残差块实现。
在ResNet网络的残差结构中,主分支使用了三个卷积层:一个1×1的卷积层用来压缩channel的维度,一个3×3的卷积层以及一个1×1的卷积层用来还原channel维度。
S22:解码器使用上采样的方式,联合ResNet网络中编码器输出的特征进行深度估计,得到预设的不同尺度下的深度估计值。
对于输入的特征图,先上采样两倍,将特征图的像素复制到行和列,一个像素产生2×2的输出,然后经过卷积操作,不改变分辨率,将通道数调整为一半,通过这样的操作实现通道数减半,分辨率不变。经过上采样的特征图和编码器输出的特征图进行跳跃连接,输出对应的通道数的视差图,最后再经过两个3×3的卷积层和一个sigmoid激活函数得到深度估计。
根据光度一致性原理,可以得知对于同一物体来说,外界的环境在较短的时间基本是没有变化的,那么时间间隔较短的相邻帧中同一物体的光度是具有一致性的特点的。由此根据由深度估计网络和相机位姿估计模型得到的深度重构图像t’=t+1或者t’=t-1,t表示第t帧;可以得到重构的光度损失误差Ls,再将误差回传到两个网络中,训练深度估计网络和相机位姿估计模型,提高估计结果的准确性。本实施例在此损失之上,继续增加了深度估计平滑化作为正则项及图像结构相似性(SSIM)损失,能够获得更好的深度估计效果。
深度重构图像是基于图像的变换完全是由相机的运动产生的这一原理得到的,重构过程使用深度估计网络估计的结果和相机位姿估计模型估计的结果。但是实际场景下大部分存在自我运动的物体,使用该方法进行重构会造成计算的失误,重构图像与原始的当前帧图像It之间差距较大可能不是由于深度估计结果错误,而是由于单纯的相机运动无法正确地重构出运动物体,导致正确的深度重构图像与当前帧图像差距较大,最终导致深度估计结果不准确。基于上述现象,在训练时加入光流估计网络,在深度估计计算损失的部分加入光流重构图像对运动物体的运动进行估计,将光流重构图像作为约束运动物体深度 估计的一部分。使用光流重构图像和当前帧图像的差值作为损失计算的约束。
S4.基于光流估计网络,以视频帧的两帧连续图像作为输入,以估计的两帧图像之间的运动光流作为输出,表示图像中每个像素向下一个图像的运动变化,使用不同的颜色和亮度表示光流的大小和方向。
使用光流估计网络的金字塔结构对两帧连续图像之间的运动光流进行估计,得到光流重构的图像,步骤S4具体包括以下步骤:
S41:在光流估计网络中一般采用金字塔结构从粗粒度到细粒度捕捉全局运动和局部运动,将相邻的两幅图像It与It’输入至光流估计网络,采用H表述参数为θ的光流估计网络,Vf表示从It中的每一个像素到其在It’中对应像素移动的正向流场。
光流估计模型H为金字塔结构,分为两个阶段:金字塔编码和金字塔解码。编码阶段以两帧连续图像作为输入的图像对,经过不同的卷积层之后,输出提取出n个尺度的特征图像对表示It的第i个尺度的特征图像,表示It’的第i个尺度的特征图像,i=1,2,...,n(本实施例中n=5)。
如图3所示,针对第一个尺度的特征图像对(也即i=1时),在解码阶段对这对特征图像对使用第一个解码器模块D进行解码操作,由粗到细的方式进行估计,得到第i=1个的特征图像对之间的运动光流采用第一个上采样模块S对运动光流(也即)进行上采样,得到i=1时,之间的上采样光流当i>1时,将和第i-1个上采样模块输出的上采样光流输入至第i个编码器模块,得到之间的运动光流输入至第i个上采样模块,得到之间的上采样光流当i=n时,将It和It’输入至卷积模块,采用第n个上采样模块对卷积模块的输出和运动光流进行上采样,输出最终的光流估计。
在实际应用中,考虑到效率问题,通常情况下使用五个尺度进行光流估计效果最佳。其中实现的逻辑公式如下所示:

其中,S(.)是上采样模块S,D(.)是解码器模块D。
S42:本实施例中上采样模块为自引导上采样模块,本实施例对于双线性上采样的边界融合造成混合插值的情况做出改进。
在使用金字塔结构进行光流估计的时候,需要使用到上采样模块,在尺寸小的图像上进行上采样,一般使用双线性插值的方式。但是在运动边界附近而言,如果使用双线性插值的方式,对于两边运动情况不一致的运动边界附近,会由运动1,2进行插值计算,运动1靠近边界的区域会受到运动2的影响,运动2所在的靠近边界的区域会受到运动1插值的影响,产生混合插值的现象。但是实际情况下,他们属于俩个不同的运动区域,不应该受到其他运动区域的干涉。
为了避免上述这种现象,在上采样过程使用自引导上采样模块,对于运动边界区域来说,使用和它统一运动方向的附近的点进行插值计算,在这之后将该值通过学习到的插值流进行移动,将最终移动到边界位置的区域作为该区域的插值点。
针对得到的第i-1个尺度的特征图像对对应的运动光流(此时为低分辨率),首先通过双线性插值的方式提高的分辨率生成初始光流
其中,p表示初始光流中任意像素点的坐标,N(p/s)表示光流中与点p/s相邻的四个像素点,s是比例放大率,ω(p/s,k)为线性插值的权重;表示像素点p的初始光流值,表示运动光流中像素点k的光流值。
然后根据特征计算出对应的插值流(本实施例中采用编码器计算插值流),采用插值流对初始光流进行翘曲变换得到光流是双线性插值得到的结果,但是这种插值方式会将边缘部分变为两边运动的差值之和,不符合实际,所以通过插值流将靠近边缘部分的点经过插值流变换边缘的点,如果边缘点d可以由同一运动区域的点p经过插值流变换而成,那么就对点p周围的四个点进行双线性插值计算公式如下所示:
其中,N(d)表示初始光流中与像素点d相邻的四个像素点,表示光流中像素点k’的光流值,表示像素点p的插值流,ω(d,k’)表示权重。
因为混合插值只发生在对象的边缘部分,所以无需在非边缘部分学习插值流。因此使用插值映射强制模型只在边缘部分学习插值流,最终的上采样模块输出的结果的融合,计算公式如下:
其中,⊙为对应元素的权重乘积运算。本实施例中为了产生插值流和插值映射使用一个具有五层卷积层的密集块。具体实现方式为,将连接起来作为密集块的输入。密集块中每个卷积层的卷积核的个数依次为32,32,32,16,8;密集块的输出是一个3通道的张量映射。使用张量图的前两个通道作为插值流,使用最后一个通道通过sigmoid层形成插值映射,最终的自学习插值映射几乎是边缘映射,插值流也集中在对象边缘区域。
S5.如图4所示,基于深度估计网络、相机位姿估计模型以及光流估计网络,可以得到由相邻图像到当前帧图像的重构图像,分别是深度重构图像以及光流重构图像
最终的损失函数计算公式为L=μLp+λLs
其中,λ和μ均为超参数,Lp为光度损失误差,Ls为平滑度损失;Ls的表达式为:
表示在当前帧图像中坐标为(x,y)的像素点的深度归一化的值;表示对x进行求导,对y进行求导。
Lp的表达式为:
其中函数pe(.)的原始表达式为:
其中,Ia和Ib分别表示任意两幅图像帧,α为超参数,SSIM(.)为相似度计算函数。
本实施在该函数pe(.)中加入了光流重构图像则本实施例中函数pe(.)的表达式为:其中Ma的表达式为:
其中,r为预设的阈值。
其中Ma表示根据光流重构图像结果对原始的当前帧图像(也即图4中的原图像)估计进行遮罩,它是将光流重构的图像和实际图像(也即与当前帧图像相邻的图像)的差值大小为依据,而设定的一个掩码,由0,1组成,然后作为权重加入到原来的pe(.)损失函数中,其中如果光流重构图像与It’之间的差距大于0.8,那么认为该处很有可能是移动物体,对于该位置进行遮罩。
本实施例使用估计的光流进行重构图像的合成。因为光流中包括从相邻两帧图像之间的光流运动,包括整个场景中静态背景的刚体运动和场景中的移动物体的非刚体运动,根据光流变化以及与当前帧图像相邻的图,可以合成光流重构图像,这一步的合成图像考虑到了场景中的移动物体。而深度重构图像的计算公式中假设场景中没有移动物体,所以深度重构图像只考虑到了刚体流的部分。本实施例使用光流估计网络对于运动物体的深度估计效果有进一步改善,可以增加深度估计的准确性。
上面结合附图对本发明的实施方式作了详细说明,但是本发明并不限于上述实施方式,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化。

Claims (4)

  1. 一种基于光流遮罩的无监督单目深度估计方法,其特征在于:该方法具体为:采用深度估计网络,对图像帧进行深度估计;对深度估计网络进行训练时引入相机位姿估计模型和光流估计网络;根据光流估计网络输出的视频序列中相邻两幅图像帧It与It’之间的光流估计,对当前帧图像It进行重构,得到光流重构图像t’=t-1或者t’=t+1;根据相机位姿估计模型估计出的相邻两幅图像帧之间的位姿变换矩阵,对当前帧图像进行重构,得到深度重构图像根据建立损失函数L对深度估计网络,相机位姿估计模型和光流估计网络进行联合训练:
    L=μLp+λLs
    其中,λ和μ均为超参数,Lp为光度损失误差,Ls为平滑度损失;Ls的表达式为:
    其中,表示在当前帧图像中坐标为(x,y)的像素点的深度归一化的值;表示对x进行求导,为对y进行求导;
    Lp的表达式为:
    其中,pe(.)的表达式为:其中,Ia和Ib分别表示任意两幅图像帧,α为超参数,SSIM(.)为相似度计算函数,Ma的表达式为:
    其中,r为预设的阈值。
  2. 根据权利要求1所述的一种基于光流遮罩的无监督单目深度估计方法,其特征在于:所述光流估计网络对训练样本中相邻的两帧图像It和It’进行如下处理:
    步骤1:在光流估计网络中采用金字塔结构的编码器提取It和It’之间的n个尺度的特征图 像对表示It的第i个尺度的特征图像;表示It’的第i个尺度的特征图像,i=1,2,...,n;
    步骤2:在光流估计网络的金字塔结构的解码器中包括n个编码器模块和n个上采样模块,当i=1时,将输入至第一个编码器模块中,得到之间的运动光流当i>1时,将和第i-1个上采样模块输出的上采样光流输入至第i个编码器模块,得到之间的运动光流输入至第i个上采样模块,得到之间的上采样光流当i=n时,将It和It’输入至卷积模块,采用第n个上采样模块对卷积模块的输出和运动光流进行上采样,输出最终的光流估计。
  3. 根据权利要求2所述的一种基于光流遮罩的无监督单目深度估计方法,其特征在于:所述步骤2中,针对第i个尺度的特征图像对和运动光流相应的上采样模块进行如下处理:
    步骤2.1:采用双线性插值的方式提高的分辨率得到初始光流
    其中,p表示初始光流中任意像素点的坐标,N(p/s)表示光流中与点p/s相邻的四个像素点,s是比例放大率,ω(p/s,k)为双线性插值的权重;表示像素点p的初始光流值,表示光流中像素点k的光流值;
    步骤2.2:采用编码器计算得到之间的插值流采用插值流对初始光流进行翘曲变换得到光流
    其中,N(d)表示初始光流中与像素点d相邻的四个像素点,表示初始光流 中像素点k’的光流值,表示像素点p的插值流,ω(d,k’)表示权重;
    步骤2.3:根据如下公式将进行融合,得到相应的上采样模块的输出
    其中,表示插值映射,⊙表示乘积。
  4. 根据权利要求1所述的一种基于光流遮罩的无监督单目深度估计方法,其特征在于:所述深度估计网络采用ResNet网络。
PCT/CN2023/092180 2022-09-07 2023-05-05 一种基于光流遮罩的无监督单目深度估计方法 WO2024051184A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211091218.3A CN115187638B (zh) 2022-09-07 2022-09-07 一种基于光流遮罩的无监督单目深度估计方法
CN202211091218.3 2022-09-07

Publications (1)

Publication Number Publication Date
WO2024051184A1 true WO2024051184A1 (zh) 2024-03-14

Family

ID=83522691

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092180 WO2024051184A1 (zh) 2022-09-07 2023-05-05 一种基于光流遮罩的无监督单目深度估计方法

Country Status (2)

Country Link
CN (1) CN115187638B (zh)
WO (1) WO2024051184A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187638B (zh) * 2022-09-07 2022-12-27 南京逸智网络空间技术创新研究院有限公司 一种基于光流遮罩的无监督单目深度估计方法
CN116228834B (zh) * 2022-12-20 2023-11-03 阿波罗智联(北京)科技有限公司 图像深度获取方法、装置、电子设备及存储介质
CN116452638B (zh) * 2023-06-14 2023-09-08 煤炭科学研究总院有限公司 位姿估计模型的训练方法、装置、设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490928A (zh) * 2019-07-05 2019-11-22 天津大学 一种基于深度神经网络的相机姿态估计方法
CN110782490A (zh) * 2019-09-24 2020-02-11 武汉大学 一种具有时空一致性的视频深度图估计方法及装置
CN111105432A (zh) * 2019-12-24 2020-05-05 中国科学技术大学 基于深度学习的无监督端到端的驾驶环境感知方法
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN114693720A (zh) * 2022-02-28 2022-07-01 苏州湘博智能科技有限公司 基于无监督深度学习的单目视觉里程计的设计方法
CN115187638A (zh) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 一种基于光流遮罩的无监督单目深度估计方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127557B (zh) * 2019-12-13 2022-12-13 中国电子科技集团公司第二十研究所 一种基于深度学习的视觉slam前端位姿估计方法
CN112991450B (zh) * 2021-03-25 2022-11-01 武汉大学 一种基于小波的细节增强无监督深度估计方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490928A (zh) * 2019-07-05 2019-11-22 天津大学 一种基于深度神经网络的相机姿态估计方法
CN110782490A (zh) * 2019-09-24 2020-02-11 武汉大学 一种具有时空一致性的视频深度图估计方法及装置
CN111105432A (zh) * 2019-12-24 2020-05-05 中国科学技术大学 基于深度学习的无监督端到端的驾驶环境感知方法
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN114693720A (zh) * 2022-02-28 2022-07-01 苏州湘博智能科技有限公司 基于无监督深度学习的单目视觉里程计的设计方法
CN115187638A (zh) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 一种基于光流遮罩的无监督单目深度估计方法

Also Published As

Publication number Publication date
CN115187638B (zh) 2022-12-27
CN115187638A (zh) 2022-10-14

Similar Documents

Publication Publication Date Title
CN111325794B (zh) 一种基于深度卷积自编码器的视觉同时定位与地图构建方法
CN111739078B (zh) 一种基于上下文注意力机制的单目无监督深度估计方法
WO2024051184A1 (zh) 一种基于光流遮罩的无监督单目深度估计方法
Zhu et al. Unsupervised event-based learning of optical flow, depth, and egomotion
Mitrokhin et al. EV-IMO: Motion segmentation dataset and learning pipeline for event cameras
CN110443842B (zh) 基于视角融合的深度图预测方法
CN110490919B (zh) 一种基于深度神经网络的单目视觉的深度估计方法
CN111105432B (zh) 基于深度学习的无监督端到端的驾驶环境感知方法
CN110782490A (zh) 一种具有时空一致性的视频深度图估计方法及装置
CN111783582A (zh) 一种基于深度学习的无监督单目深度估计算法
Qu et al. Depth completion via deep basis fitting
CN111902826A (zh) 定位、建图和网络训练
CN113313732A (zh) 一种基于自监督学习的前视场景深度估计方法
CN113850900B (zh) 三维重建中基于图像和几何线索恢复深度图的方法及系统
Wang et al. Depth estimation of video sequences with perceptual losses
Hwang et al. Self-supervised monocular depth estimation using hybrid transformer encoder
Hwang et al. Lidar depth completion using color-embedded information via knowledge distillation
CN114677479A (zh) 一种基于深度学习的自然景观多视图三维重建方法
Wang et al. Unsupervised learning of 3d scene flow from monocular camera
Yang et al. SAM-Net: Semantic probabilistic and attention mechanisms of dynamic objects for self-supervised depth and camera pose estimation in visual odometry applications
Shi et al. PanoFlow: Learning 360° optical flow for surrounding temporal understanding
CN116188550A (zh) 一种基于几何约束的自监督深度视觉里程计
Zhang et al. Self-supervised monocular depth estimation with self-perceptual anomaly handling
Wang et al. Cbwloss: constrained bidirectional weighted loss for self-supervised learning of depth and pose
CN115731280A (zh) 基于Swin-Transformer和CNN并行网络的自监督单目深度估计方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861890

Country of ref document: EP

Kind code of ref document: A1