CN113570658A

CN113570658A - Monocular video depth estimation method based on depth convolutional network

Info

Publication number: CN113570658A
Application number: CN202110648477.0A
Authority: CN
Inventors: 陈渤; 曾泽群
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-10-29

Abstract

The invention belongs to the technical field of video processing, and discloses a monocular video depth estimation method based on a depth convolution network, which comprises the following steps: acquiring training data and a monocular video to be tested; constructing a depth estimation network model, wherein the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, and a decoder comprises an up-sampling module and a dense hole pyramid module; performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function; a depth map of the monocular video to be tested is estimated. The invention utilizes more space information of the original image, and effectively improves the precision of depth prediction.

Description

Monocular video depth estimation method based on depth convolutional network

Technical Field

The invention belongs to the technical field of video processing, and further relates to a monocular video depth estimation method based on a depth convolution network, which can be used for three-dimensional reconstruction, robot navigation and automatic driving.

Background

Depth estimation is indispensable in many tasks, such as three-dimensional reconstruction, automatic driving, robot navigation, and other important fields. The binocular depth estimation algorithm is the most common depth estimation algorithm at present, which estimates depth by simulating human eyes and using parallax between pictures of different visual angles taken by a stereo camera or a plurality of cameras. However, the binocular depth estimation algorithm has a lot of problems, such as high computational complexity, high difficulty in acquiring binocular pictures, difficult matching of low-texture regions, and the like. Single-view pictures tend to be less difficult to acquire than multi-view pictures. The monocular depth estimation algorithm obtains depth from a picture or a video shot by a single camera, and can greatly reduce cost and data obtaining difficulty.

In addition, in the depth estimation problem, the acquisition cost of the depth true value is very high, and the image is usually labeled by acquiring depth information through a light sensor (indoor) and a laser radar (outdoor). The unsupervised depth estimation method based on the video sequence considers the depth prediction problem of the video sequence as an intermediate process of an image synthesis process between adjacent frames, so that a depth true value is not required for training.

A paper "Unsupervised Learning of Depth and Ego-Motion from Video" (The IEEE Conference on Computer Vision and Pattern Recognition, 2017) published by zhou.t.h, brown.m, snavely.n, lowe.d. discloses an Unsupervised Video Depth estimation algorithm based on Depth Learning. The algorithm does not need a depth true value, predicts the depth based on the multi-angle matching relation between video sequences, provides geometric consistency constraint after considering the problem of the inconsistency of the output scales of previous work, provides a self-discovery mask module on the basis, solves the problem of the inconsistency of the scales between frames of an output depth image, and has higher precision on depth prediction.

But still have the following disadvantages: the network used by the method does not fully utilize multi-scale feature fusion information to improve the accuracy of depth prediction. The feature reuse effect of the backbone network is limited, and the image features cannot be fully extracted.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a monocular video depth estimation method and system based on a depth convolutional network, which improve the accuracy of a finally obtained depth map by utilizing a deep convolutional network structure.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

The monocular video depth estimation method based on the depth convolutional network comprises the following steps:

step 1, acquiring training data and a monocular video to be tested;

wherein the training data comprises an RGB optical video sequence I ═ { I ═ I_tT is more than or equal to 0 and less than or equal to T, T belongs to Z and the corresponding depth true value image sequence D ═ D_tT is more than or equal to 0 and less than or equal to T, T belongs to Z, Z represents a time set, I_tRepresenting RGB image at time t, D_tRepresenting a depth truth image at time t;

step 2, a depth estimation network model is built, the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, the depth prediction sub-network is a self-coding network and comprises an encoder and a decoder, the encoder is a densely connected depth convolution network, and the decoder comprises an up-sampling module and a dense hole pyramid module; the camera pose estimation sub-network is a deep convolutional neural network;

step 3, performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function to obtain a trained depth prediction sub-network;

wherein the loss function comprises an image reconstruction error L_pScale consistency error L_GCAnd the error of the smoothing term L_s；

Step 4, inputting the monocular video to be tested into the trained depth prediction sub-network, and outputting a normalized depth prediction image; and calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.

Compared with the prior art, the invention has the beneficial effects that:

because the constructed depth prediction sub-network has the densely connected deep structure and the multi-scale pyramid feature fusion module, more image information can be extracted, and the defects that the depth prediction is carried out only by using skip level connection and utilizing the multi-scale information and the feature extraction network cannot carry out feature reuse in the prior art are overcome, so that more original image space information is utilized, and the precision of the depth prediction is effectively improved.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram of the deep convolutional network architecture of the present invention;

FIG. 3 is an RGB image of adjacent frames input in an embodiment of the present invention;

FIG. 4 is an output depth map of adjacent frame images obtained using the present invention;

fig. 5 is a schematic diagram of the image reconstruction process of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

Referring to fig. 1, the monocular video depth estimation method based on the depth convolutional network provided by the present invention includes the following steps:

step 1, acquiring training data and a monocular video to be tested;

in the embodiment, the RGB image sequence and the 3D laser radar point cloud data in the KITTI data set are randomly divided into a training set and a testing set. The samples in the test set correspond to the monocular video to be tested.

Randomly sampling in a training set to obtain RGB images I at t moment and t-1 moment of two adjacent frames_t，I_t-1And then the corresponding t time and t-1 time of the 3D laser radar point cloud data recovery are utilizedTrue depth map D of the scale_t，D_t-1。

specifically, the structure of the depth estimation network model is shown in fig. 2:

the depth prediction sub-network is a self-coding network, and the encoder is a densely connected depth convolution network DenseNet; the main body of the decoder is image up-sampling, and a dense hole pyramid module DenseASPP is additionally introduced to perform multi-scale feature fusion. RGB image I of two adjacent frames_t，I_t-1As the input of the depth prediction sub-network, as shown in FIG. 3, the network output is the corresponding depth prediction graph

As shown in FIG. 4, wherein I_tAnd

the subscript t of (a) represents the time t,

the upper mark of (A) represents that the depth prediction network predicts the result and the depth true value D obtained by the sensor_tAnd (5) distinguishing.

The sub-network for predicting the pose of the camera is a deep convolutional network, and the input of the network is RGB images I of two adjacent frames_t，I_t-1The output is a camera motion matrix T from time T to time T-1_t→_t-1。

(3.1) randomly sampling from Gaussian distribution with the mean value of 0 and the variance of 0.01, and taking an array of random sampling as an initialization parameter of the depth estimation network model;

(3.2) comparing the RGB image I of two adjacent frames_t，I_t-1Respectively inputting a depth prediction sub-network and a camera attitude prediction sub-network, and then respectively calculating the mask weight, the scale consistency error, the image reconstruction error and the smooth regular term error of each sub-network;

(3.3) jointly training the depth prediction sub-network and the camera attitude estimation sub-network by minimizing an overall error, so that the depth prediction sub-network can output a high-precision depth map;

and (3.4) performing iterative updating on all parameters in the depth prediction sub-network and the camera pose estimation sub-network obtained in the step (3.3) by using a batch random gradient descent method until the model converges, and finishing the optimization of the network model.

The loss function mainly includes an image reconstruction error L_pError of scale uniformity L_GCError of smoothing term L_s. In the process of image reconstruction, moving objects between adjacent frames, a sheltering area or other complex pixel points which are difficult to interpret often cause poor image reconstruction performance. Therefore, the pixel points of the parts need to be detected first, and then the pixel points are given lower weight, the step of detecting the complex pixel points is called as a mask module, and the specific implementation flow is shown as (3.2 a).

(3.2a) output graph of the depth prediction sub-network at time t

And a camera motion matrix T from the output T moment to the T-1 moment of the camera pose estimation sub-network_t→_t-1And a depth map under the camera view angle at the t-1 moment can be reconstructed

Then will be

And the output of the depth prediction sub-network at time t-1

Making a normalized difference value to obtain a depth prediction error D based on the pixel point p_diff(p) the following:

in the above formula, p represents a certain pixel. D_diff(p) is a group of [0, 1]Where moving objects, occlusion areas or other difficult to interpret pixels D_diffThe larger (p) is close to 1, and D does not belong to the pixel points_diffThe smaller the (p) is, the closer to 0 is, in order to give D_diff(p) the lower weight of the pixel with a large value, and the mask weight M (p) based on the pixel p is calculated as follows:

M(p)＝1-D_diff(p)

this weight will be applied to the scaled image reconstruction error in (3.3)

(3.2b) Pixel depth prediction error D for the entire map_diff(p) taking the mean to obtain the scale consistency error:

where V is the effective pixel set of the whole picture, and num (V) represents the number of effective pixels.

(3.2c) As shown in FIG. 5, the image reconstruction process is as follows, combining the RGB image I at time t_tPredicted depth map

Camera sportsMoving matrix T_t→t-1The RGB image at the t-1 moment can be reconstructed

Besides the gray value error, the image reconstruction error also introduces a structural similarity error SSIM, and the image reconstruction error formula is as follows in combination with the mask weight M (p) obtained in step (3 a):

wherein λ_i＝0.15，λ_sThe left side of the plus sign in the above formula is the absolute value error of image reconstruction, and the right side ssim (p) is the structural similarity error of the two graphs at the time t-1.

Among them, ssim (structural similarity), structural similarity, is an index for measuring the similarity between two images. The index was first proposed by the Laboratory for Image and Video Engineering (Laboratory for Image and Video Engineering) at the university of Texas, Austin.

Given two images x and y, the structural similarity of the two images can be found as follows:

wherein mu_xIs the average value of x, μ_yIs the average value of y and is,

is the variance of x and is,

is the variance of y, σ_xyIs the covariance of x and y. c. C₁＝(k₁L)，c₂＝(k₂L) is a constant used to maintain stability. L is the dynamic range of the pixel values. k is a radical of₁0.01, and 0.03 for k 2. The more similar the two pictures, the closer the SSIM value is to 1.

(3.2d) to solve the problem of noise and low texture region gradient disappearance, a smoothing term error is introduced, which is shown below:

wherein the content of the first and second substances,

for the gradient at the pixel point p in the input RGB image,

is the gradient of pixel point p in the depth map.

Reconstructing the image with an error L_pError of scale uniformity L_GCError of smoothing term L_sTaking the weighted sum, the overall loss function is as follows:

L＝aL_p+βL_s+γL_GC

wherein α is 1.0, β is 0.1, and γ is 0.5. Alpha, beta and gamma respectively represent the weight of the corresponding error, and the value of the weight is between [0 and 1 ].

And training and optimizing the network model by minimizing a loss function, namely the overall error L.

And (4.1) inputting the single RGB picture of the test sample into a depth prediction sub-network, and outputting a corresponding normalized depth map.

And (4.2) calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.

Simulation experiment

The effectiveness of the invention is verified by simulation experiments as follows

1. Simulation conditions are as follows:

the simulation test of the invention is carried out under the linux operating environment with the GPU being Tesla P4. Dividing pictures, training set: 5240 pictures, verification set: 2070 pictures, test set 200 pictures.

2. Simulation content:

simulation 1, depth prediction is performed on the RGB image shown in fig. 3 from the KITTI image set by using the present invention, and a predicted depth map is obtained, as shown in fig. 4. It can be seen from fig. 4 that the present invention can recover a depth map from a single picture.

Simulation 2, performing a depth prediction experiment on the KITTI image set by using the method of the invention and the existing video-based unsupervised monocular depth estimation algorithm SC-sfmlearer, and comparing the accuracy of monocular video depth estimation performed by using the two methods by taking the relative square error SqRel, the root mean square error RMSE and the root mean square error RMSE _ log of the prediction result as comparison standards, wherein the lower the values of SqRel, RMSE and RMSE _ log are, the higher the accuracy of depth prediction is indicated, and the experiment results are shown in Table 1:

TABLE 1 comparison of prediction accuracy of the method of the present invention and conventional SC-sfmlearner

Estimation method	SqRel	RMSE	RMSE_log
				SC-sfmlearner	0.1834	6.8903	0.2630
The invention	0.1751	6.4451	0.2496

From the results in table 1 it can be seen that: compared with the existing SC-sfmlearner image depth prediction method, the relative square error SqRel, the root mean square error RMSE and the root mean square logarithmic error RMSE _ log predicted by the method are smaller, which shows the effectiveness of the method provided by the invention.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. The monocular video depth estimation method based on the depth convolutional network is characterized by comprising the following steps of:

step 1, acquiring training data and a monocular video to be tested;

2. The method of claim 1, wherein the encoder is a densely connected deep convolutional network DenseNet; the main body of the decoder is image up-sampling, and multi-scale feature fusion is carried out by adding an introduced dense hollow pyramid module DenseASPP.

3. The method for monocular video depth estimation based on depth convolutional network of claim 1, wherein the joint training of the depth prediction subnetwork and the camera pose estimation subnetwork is performed by using training data, and the specific process is as follows:

(3.1) randomly initializing network parameters of the depth estimation network model;

and (3.4) iteratively updating all network parameters in the depth prediction sub-network and the camera pose estimation sub-network obtained in the step (3.3) by using a batch random gradient descent method until the model converges, and finishing the optimization of the network model.

4. The method as claimed in claim 3, wherein the random sampling is performed from a Gaussian distribution with a mean value of 0 and a variance of 0.01, and the random sampling array is used as an initialization parameter of the depth estimation network model.

5. The method for monocular video depth estimation based on depth convolutional network of claim 3, wherein the mask weight, scale consistency error, image reconstruction error and smooth regularization term error of each sub-network are calculated respectively, and the specific steps are as follows:

(3.2a) predicting the output map of the subnetwork according to the depth at time t

Camera motion matrix T from output T moment to T-1 moment of camera pose estimation sub-network_t→t-1Reconstructing a depth map under the camera view angle at the time t-1

Then will be

And the output of the depth prediction sub-network at time t-1

Making a normalized difference value to obtain a depth prediction error D based on the pixel point p_diff(p)：

In the formula, D_diff(p) is of pixel p, whose value belongs to [0, 1]]To (c) to (d);

to administer D_diff(p) the lower weight of the pixel with a large value, and the mask weight M (p) based on the pixel p is calculated as follows:

M(p)＝1-D_diff(p)；

(3.2b) Pixel depth prediction error D for the entire map_diff(p) taking the mean value to obtain the scale consistency error:

wherein, V is the effective pixel set of the whole picture, num (V) represents the number of effective pixels;

(3.2c) combining the RGB image I at time t_tPredicted depth map

And camera motion matrix T_t→t-1Reconstructing the RGB image at the time of t-1

The error term in the process includes an image reconstruction gray value error and a structure similarity error SSIM, and then an image reconstruction error formula is as follows:

wherein λ is_i、λ_sThe weight parameters are respectively, and the sum of the weight parameters and the weight parameters is 1;

RGB image representing reconstructed time t-1

The gray value of the middle pixel point p; SSIM (p) is at time t-1

And I_t-1Structural similarity errors of pixel points p on the two images;

(3.2d) the equation for calculating the error of the smoothing term is as follows:

wherein the content of the first and second substances,

for the gradient at the pixel point p in the input RGB image,

is the gradient of pixel point p in the depth map.

6. The method of claim 5, wherein the overall error is an image reconstruction error L_pScale consistency error L_GCAnd the error of the smoothing term L_sTaking the weighted sum:

L＝αL_p+βL_s+γL_GC

wherein alpha, beta and gamma respectively represent the weight of the corresponding error, and the value of the weight is between [0 and 1 ].