CN111028282A

CN111028282A - Unsupervised pose and depth calculation method and system

Info

Publication number: CN111028282A
Application number: CN201911196111.3A
Authority: CN
Inventors: 蔡行; 张兰清; 李承远; 王璐瑶; 李宏
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-17
Also published as: CN111325784A

Abstract

The invention discloses an unsupervised pose and depth calculation method and a system, wherein the unsupervised pose and depth calculation method mainly adopts the following modules: the pose prediction network model TNet, the depth estimation network model DMNet, the visual reconstruction model V and the error loss function module; calculating a forward motion relative pose and a backward motion relative pose, calculating a depth estimation result of an image and a corresponding depth of the image, summing a reconstruction error, a smoothing error and a twinning consistency error to obtain a loss function, carrying out iterative updating until the loss function is converged, and finally calculating a camera relative pose and a prediction depth map according to the trained model Tnet and the trained model DNet.

Description

Unsupervised pose and depth calculation method and system

Technical Field

The invention belongs to the fields of SLAM (simultaneous Localization And mapping) And SfM (structural from motion), And particularly relates to an unsupervised pose And depth calculation method And system.

Background

In recent years, monocular dense depth estimation based on a depth learning method and algorithms of visual odometry vo (visual odometry) have rapidly developed, and are also key modules of SfM and SLAM systems. Studies have shown that VO and depth estimation based on supervised depth learning achieve good performance in many challenging environments and mitigate performance degradation problems such as scale drift. However, in practical applications it is difficult and expensive to train these supervised models to obtain sufficient data with authentic signatures. In contrast, the unsupervised approach has the great advantage that only unlabeled video sequences are required.

Depth unsupervised models of depth and pose estimation typically employ two modules, one of which predicts the depth map and the other of which estimates the relative pose of the camera. And then, after the image is projected and transformed from the source image to the target image by using the estimated depth map and the estimated posture, the models are trained in an end-to-end mode by using photometric error loss as an optimization target. However, the prior art rarely considers the following key problems: the VO is time-sequenced, the defect that the unmanned data set only has a single motion direction is ignored, the model can only process the motion in a single direction, and the motion constraint of the forward direction and the backward direction is not utilized. The existing model does not consider the complexity of the model, has large parameter quantity, and is difficult to be suitable for the practical application scene of VO.

Disclosure of Invention

The working principle of the invention is as follows: and (3) by utilizing a Twin pose network model and the time sequence information of ConvLSTM learning data, improving a depth estimation network, and providing DispNet (visibility Mobile Net) to enable the pose and depth estimation accuracy to reach higher levels.

In order to solve the above problems, the present invention provides an unsupervised absolute scale calculation method and system.

The technical scheme adopted by the invention is as follows:

an unsupervised pose and depth calculation method comprises a pose network model TNet, a depth network model DNet, an image visual reconstruction model V and a loss function, and comprises the following steps:

s1, preparing a monocular video data set;

s2, extracting continuous images from the monocular video data set in the step S1, sequentially inputting adjacent images into the position network model TNet to obtain a common feature F between the images, inputting the feature F into the position network model TNet, and respectively obtaining a forward motion relative position and a backward motion relative position;

s3, inputting the continuous images in the step S2 into a depth network model DNet, and obtaining a depth estimation result of the images and the corresponding depth of the images through forward propagation;

s4, inputting the continuous images, the forward motion relative pose, the backward motion relative pose and the image corresponding depth in the S2 into an image visual reconstruction model V to obtain a distorted image;

s5, calculating the reconstruction error between the distorted image and the continuous image in S2, calculating the smooth error of the depth estimation result, and calculating the twin consistency error;

s6, obtaining a loss function through the summation of the reconstruction error, the smoothing error and the twin consistency error, carrying out reverse propagation, and carrying out iterative updating until the loss function is converged;

and S7, forecasting, and respectively propagating forward by using the pose network model Tnet and the depth network model DNet to calculate the relative pose of the camera and a forecast depth map.

A brand new twin module is adopted to simultaneously process forward and backward motion of a video sequence, and meanwhile, forward and backward motion is restrained by utilizing a time sequence consistency error item under the constraint of reversal consistency, so that the pose estimation accuracy is greatly improved; by adopting the DispmNet model based on the MobileNet structure, the parameter quantity is reduced by 37%, and meanwhile, the depth estimation accuracy of the model is improved.

Further, the calculation formula of the reconstruction error lreprjection between the warped image in step S5 and the consecutive image in step S2 is:

L_reprojection＝α*L_photometric+(1-α)*L_ssim

where Lphotometric is the photometric error, Lssim is the inter-image similarity, and α is the weight coefficient.

Further, the Lphotomeric is:

where It Is the continuous image, Is the warped image, and L Is the number of continuous image images minus 1.

Further, Lssim is:

where It Is a continuous image and Is a warped image.

Further, the twin consistency error Ltwin in step S6 is:

wherein, I is a unit matrix, L is the number of continuous images minus 1, and T is a relative pose.

Further, the loss function in step S6 is:

L_Total＝L_reprojection+β*LS_mooth+γ*L_Twin

where lreprejection is a reconstruction error, Lsoooth is a smoothing error of the depth estimation result, and β and γ are weight coefficients.

Further, the loss function in step S6 is trained by using Adam optimization method.

A system for unsupervised pose and depth calculation comprises a pose network module TNet, a depth network module DNet, an image vision reconstruction module V and a loss function module; the position and pose network module TNet carries out position and pose estimation, the depth network module DNet carries out depth estimation, the image visual reconstruction module V carries out image projection, and the position and pose network module TNet and the depth network module DNet are restrained through the loss function module.

Preferably, the module TNet comprises an encoder and a twin module, the encoder comprises a convolutional layer and an activation function, the twin module comprises a pose prediction module with the same structure, and the pose prediction module comprises ConvLstm and a convolutional layer; the module DNet comprises an encoder comprising a convolutional layer and a Dwise, and a decoder comprising an anti-convolutional layer, a convolutional layer and a Dwise.

Compared with the prior art, the invention has the following advantages and effects:

1. a novel unsupervised framework for monocular vision and depth estimation is provided, and a pose network model of the framework adopts time sequence information using ConvLSTM learning data to improve pose estimation accuracy.

2. The pose network adopts a brand new twin module, simultaneously processes forward and backward movement of a video sequence, and simultaneously restrains forward and backward movement by utilizing a time sequence consistency error item under the constraint of reversal consistency, thereby greatly improving the pose estimation accuracy.

3. The DispmNet model based on the MobileNet structure is provided, the parameter quantity is reduced by 37%, and meanwhile, the depth estimation accuracy of the model is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a block diagram of a model TNet of the present invention;

FIG. 3 is a block diagram of a model DMNet of the present invention;

FIG. 4 is a comparison of the depth map results of the present invention with the GrountTruth algorithm, SfmLearner algorithm;

FIG. 5 is a comparison of pose estimation results of the present invention with other algorithms;

FIG. 6 is a comparison of depth estimation results of the present invention with other algorithms.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1:

as shown in fig. 1-6, an unsupervised pose and depth calculation method mainly employs the following modules: the system comprises a pose prediction network model TNet, a depth estimation network model DMNet, a visual reconstruction model V and an error loss function module. The TNet model comprises an encoder and a twin module, wherein the encoder comprises 7 convolutional layers, an activation function is connected behind each convolutional layer, and the sizes of convolutional kernels are 7, 5, 3 and 3 respectively; the twin module comprises two sub-network modules with the same structure, and the two sub-network modules are respectively used for processing pose prediction in forward movement or backward movement, and each sub-module is composed of a ConvLstm layer and a convolution layer Conv with the convolution kernel size of 1. DMNet contains encoder, decoder, tie layer triplex, and wherein the encoder comprises 7 layers of convolution module, and each convolution module specifically contains: convolutional layers (convolutional kernel size 1x1, Relu (activation function)), Dwise (3x3, Relu), convolutional layers (1x1, Relu), Dwise (3x3, Relu), convolutional layers (1x1, Relu); the decoder comprises 6 layers of deconvolution modules, and each deconvolution module specifically comprises: deconvolution (convolution kernel size 3x3, Relu), convolution (1x1, Relu), Dwise (3x3, Relu), convolution (1x1, Relu); the connection layer is used for transmitting the network shallow feature to a back-end decoder and cascading with the back-end feature.

Step 1, accurately obtaining monocular video sequences, such as KITTI unmanned data set, EuRoc data set, TUM data set and Oxford data set.

Step 2, each time a video segment V with a fixed frame length is taken, two adjacent frames are input into the pose network in sequence, for example, the length of the video segment V is 5 frames, and two adjacent frames (t) are input₀And t₁，t₁And t₂，t₂And t₃，t₃And t₄) Inputting into network, 4 groups of characteristics F common to two frames can be obtained₁、F₂、F₃、F₄. 4 feature groups are respectively independent and pass through two pose prediction modules of the TNet module, any sub-module can be appointed to perform forward pose prediction, the other sub-module is used for backward pose prediction, and for the forward module, the features are F₁To F₄The two frames of relative pose prediction results of forward motion can be obtained: t is_0-1，T_1-2，T_2-3，T_3-4For backward modules, the features are as follows₄To F₁To obtain the relative pose of the backward motion, T_4-3，T_3-2，T_2-1，T_1-0；

For example, a video segment V of length 3 frames, two adjacent frames (t)₀And t₁，t₁And t₂) Inputting into network, obtaining 2 groups of characteristics F common to two frames₁、F₂. 2 feature groups are respectively independent and pass through two pose prediction modules of a TNet module, and for a forward module, the features are F₁To F₂The two frames of relative pose prediction results of forward motion can be obtained: t is_0-1，T_1-2For the backward module, the feature is as F₂To F₁To obtain the relative pose of the backward motion, T_2-1，T_1-0。

Step 3, for the video clip V, each frame I_i(i-0, 1,2 …) inputting the depth estimation network separately, obtaining the depth estimation result of single frame by network forward propagation calculation, each image corresponding to depth D_i(i ═ 0,1,2 …). For example, if the length of the video clip V is 5 frames, i is 0,1,2,3, 4.

Step 4, combining the relative poses T between two frames by using the image segment V_n-m，T_m-n(n-0, 1,2 …; m-i +1) and a depth per frame D_iAnd obtaining a distorted image I 'through a visual reconstruction module by adopting formula 1, wherein I' comprises a forward distorted image and a backward distorted image. For example, if the length of the video segment V is 5 frames, n is 0,1,2,3, and m is 1,2,3, 4.

Wherein Pt is the pixel coordinate, K is the camera reference, Dt is the predicted depth map, Tt_→And s is the predicted pose.

Step 5, comparing the image I in the image segment V with the distorted image I 'obtained in the step 4 pixel by pixel, calculating the reconstruction error between the image I in the image segment V and the distorted image I' obtained in the step 3 by adopting a formula 2,

L_reprojection＝α*L_photometric+(1-α)*L_ssim(2)

wherein Lphotometric is the luminosity error and is calculated by formula 3, Lssim is the similarity between images and is calculated by formula 4, α is a weight coefficient, and the value range is 0-1, such as 0.85;

wherein, I_tIs a continuous image, I_sIs a warped image, L is the number of consecutive image images minus 1 (i.e., L ═ i-1), for example, if the length of the video segment V is 5 frames, L ═ 4;

calculating a smoothing error of the predicted depth map;

the twin consistency error is calculated using equation 5,

wherein, I is a unit matrix, L is the number of continuous images minus 1 (i.e., L ═ I-1), T is a pose transformation matrix, and T is a pose transformation matrix_n-m*T_m-nI (n-0, 1,2 …; m-I + 1). For example, if the length of the video segment V is 5 frames, n is 0,1,2,3, m is 1,2,3,4, and L is 4.

And 6, summing the reconstruction error, the smoothing error and the twin consistency error obtained in the step 5 by adopting a formula 6 to obtain a final loss function.

L_Total＝L_reprojection+β*L_Smooth+γ*L_Twin(6)

Wherein lreprjection is the reconstruction error calculated in step 5, Lsoooth is the smoothing error of the depth estimation result, β and γ are weight coefficients, β and γ range from 0 to 1, for example, β value is 0.85, and γ value is 0.5.

And then performing back propagation by using an Adam optimization method, and performing iterative updating on parameter values in all modules in the frame until the loss function is converged, so that the training stage of the method is completed.

And 7, in a testing stage, preparing a testing data set, inputting a pair of source images for a pose estimation task, and calculating the relative pose of the camera between two frames by forward propagation by using the TNet network trained in the steps 1 to 6 to obtain a prediction result. For a depth estimation task, inputting a single frame image to a trained DMNet module, and calculating to obtain a prediction depth map through network forward propagation.

As shown in fig. 5, the pose estimation result of the algorithm is compared with other algorithms, and the result of the algorithm is displayed from the result of the video sequence 09-10, so that the result of the algorithm is most accurate; as shown in fig. 6, comparing the depth estimation result of the present algorithm with other algorithms, the abs rel absolute difference, sq rel square difference, RMSE mean square difference, log R log mean square difference and the highest accuracy of the present algorithm are seen from the error metric and accuracy metric accuracy metrics.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An unsupervised pose and depth calculation method is characterized by comprising a pose network model TNet, a depth network model DNet, an image visual reconstruction model V and a loss function, and comprises the following steps:

s1, preparing a monocular video data set;

2. The unsupervised pose and depth calculation method according to claim 1, wherein the calculation formula of the reconstruction error between the warped image in step S5 and the continuous image in step S2 is:

L_reprojection＝α*L_photometric+(1-α)*L_ssim

3. The unsupervised pose and depth calculation method of claim 2, wherein the Lphotomeric is:

4. The unsupervised pose and depth calculation method of claim 2, wherein the Lssim is:

where It Is a continuous image and Is a warped image.

5. The unsupervised pose and depth calculation method according to claim 1, wherein the twin consistency error in step S6 is:

wherein, I is an identity matrix, L is the number of continuous images minus 1, and T is a pose transformation matrix.

6. The unsupervised pose and depth calculation method according to claim 5, wherein the loss function in step S6 is:

L_Total＝L_{Reconstruction}+β*L_Smooth+γ*L_Twin

wherein Lreconstruction is, Lsoooth is the smoothing error of the depth estimation result, and β and γ are weight coefficients.

7. The unsupervised pose and depth calculation method of claim 1, wherein the loss function in step S6 is trained using Adam optimization.

8. A system for unsupervised pose and depth calculation is characterized by comprising a pose network module TNet, a depth network module DNet, an image vision reconstruction module V and a loss function module; the position and pose network module TNet carries out position and pose estimation, the depth network module DNet carries out depth estimation, the image visual reconstruction module V carries out image projection, and the position and pose network module TNet and the depth network module DNet are restrained through the loss function module.

9. The system of unsupervised pose and depth computation of claim 8, wherein the module TNet comprises an encoder and a twin module, the encoder comprising a convolutional layer and an activation function, the twin module comprising a pose prediction module of identical construction, the pose prediction module comprising ConvLstm and a convolutional layer; the module DNet comprises an encoder comprising a convolutional layer and a Dwise, and a decoder comprising an anti-convolutional layer, a convolutional layer and a Dwise.