CN113379821B

CN113379821B - Stable monocular video depth estimation method based on deep learning

Info

Publication number: CN113379821B
Application number: CN202110695235.7A
Authority: CN
Inventors: 肖春霞; 罗飞; 魏林
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2022-10-11
Anticipated expiration: 2041-06-23
Also published as: CN113379821A

Abstract

The invention provides a monocular video depth estimation method based on deep learning stability. The method completely utilizes the monocular video sequence to train the proposed model, does not need monitoring of a depth map or the GroudTruth of the camera pose in the training process, and is a completely unsupervised method. Compared with the existing monocular video-based depth estimation, the method has the characteristics that the depth estimation result estimated on continuous video frames is stable, and the depth estimation result between frames does not have a large inconsistency. In addition, a solution is also provided for the difficulty in depth estimation, namely depth estimation of a moving object.

Description

Stable monocular video depth estimation method based on deep learning

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a stable monocular video depth estimation method based on deep learning. And returning to a stable depth map of continuous frames without group Truth as supervision.

Background

Depth estimation is a fundamental task of computer vision, aiming at estimating depth from 2D images. The input to this task is the RGB image and the output is the depth map. The depth here refers to the distance from the object to the camera optical center, and the problem to be solved in the depth estimation is to solve the distance from the object to the camera optical center in the shooting scene. The depth estimation has wide application prospect and application value, such as automatic driving for researching more fire ^[1] The field, as well as conventional three-dimensional reconstruction, augmented reality, etc., all require the use of related techniques to depth estimation.

Methods of acquiring depth information include using depth sensors such as LIDAR sensors, toF sensors, etc., or solving using depth estimation algorithms. The problem with using sensors to obtain depth maps is that the equipment is expensive and the acquisition cost is high, especially to obtain high precision depth data. And the depth data collected by the sensors is also range limited and subject to environmental factors. Therefore, if the problem of acquiring the depth data can be solved through the depth estimation algorithm, the data acquisition cost can be greatly saved, and the large-scale popularization and application of related applications are facilitated.

The depth estimation algorithm can be divided into a conventional method and a method based on deep learning. The traditional method relies on accurate extraction and matching of image feature points, which results in that in some low-texture regions or in the presence of occlusion and moving objects in the scene, the acquired depth data is less than ideal, and the estimated depth map is usually sparse. The method based on deep learning can well solve the problems of the traditional method and can estimate a dense depth result. Depth estimation based on deep learning can be classified into supervised depth estimation and unsupervised depth estimation, the supervised depth estimation has the problem that a large amount of real depth data is needed as supervision in the training process, but a large amount of manpower and material resources are consumed for collecting the real depth data, and the real depth data is not paid. Therefore, unsupervised depth estimation is a current trend of research development in this direction, wherein a deep learning method using monocular video sequences as training data is the most promising method among all depth estimation methods based on deep learning due to the convenience of data acquisition. However, one of the existing problems of the unsupervised depth estimation based on monocular video is that the estimated depth between consecutive video frames is not stable, which is a problem to be solved urgently, and the depth estimation between video sequences is stable to bring stable depth-based applications, such as augmented reality, three-dimensional reconstruction, and the like.

Disclosure of Invention

In order to overcome the defects, the invention provides a stable monocular video depth estimation method based on deep learning, which utilizes the strong fitting capacity of a convolutional neural network, and finally learns the depth map corresponding to the two-dimensional RGB picture by learning a large amount of two-dimensional RGB picture data and combining with a designed loss function. The method completely utilizes the monocular video sequence to train the proposed model, does not need the monitoring of the Ground Truth of the depth map or the camera pose in the training process, and is a completely unsupervised method.

The existing method for unsupervised depth estimation based on monocular video has great progress in precision, and the test result on a single picture has a good visualization result. However, after experiments are carried out on the existing monocular video depth estimation method, it is found that although a single picture can have a better test result, the test result on consecutive video frames has an unstable phenomenon, which is also a problem that the present invention mainly intends to solve. The patent makes innovations in that a time sequence smoothing term is proposed for continuous depth maps, and a time sequence smoothing loss is constructed by the smoothing term and used for restraining stability between continuous depth maps. And the other is to provide a self-discovery mask (mask) which is used for solving the depth estimation of a moving object in the scene.

In order to achieve the above object, a stable monocular video depth estimation method based on deep learning is characterized in that:

firstly, inputting a single color picture into a depth estimation network for depth estimation, and then inputting two continuous video frames into a camera pose estimation network for relative camera pose estimation; reconstructing an image by combining depth information output by a depth estimation network and camera pose information output by a camera pose network; the two continuous video frames are both pictures involved in depth estimation;

the method comprises the following steps of constructing a loss function to solve the unstable problem of depth estimation of continuous video frames, and specifically defining the following steps:

L _gs ＝|S _a -S _b |，

S _a ＝median(D _a )，

wherein L is _gs Representing the time-sequence smoothing loss function term for two consecutive video frames I _a 、I _b ，D _a 、D _b Represents I _a And I _b Result of depth estimation of S _a And S _b Then it is the time sequence smoothing term of the consecutive video frames, and the mean represents the median operation.

Further, the view synthesis loss of the dynamic object is constrained to deal with the problem that the depth estimation of the dynamic object violating the static scene assumption is inaccurate when view synthesis is performed, which is specifically performed as follows:

for two front and back adjacent views I _a 、I _b In obtaining a depth estimation networkThe camera pose output by the output depth information and camera pose estimation network may then be viewed from view I _b View synthesis is performed to obtain a view I _a Composite Picture I at View Angle _a′ Then, the gray difference between the original view and the synthesized view, i.e. the view synthesis loss P, is calculated _daa′ P can also be obtained _dbb′ (ii) a Calculating to obtain a mask M after obtaining the view synthesis loss from the front frame to the rear frame and the view synthesis loss from the rear frame to the front frame, wherein the value of M is as small as possible under the condition that the pose of a depth map and a camera output by a network is accurate, the value of M is relatively large for an area with poor reconstruction effect, and the learning weight of the area is relatively set to be small; correspondingly, when calculating the loss, punishment should be carried out on the part of the area, and a smaller weight is set for the inaccurate area.

Preferably, the deep learning framework is a pyroch environment, and the version of the pyroch is more than 1.0.1. The network is built based on ResNet.

Further, after each epoch training is completed, testing the model trained until the current epoch, and evaluating the training effect of the current model by combining evaluation indexes; after the complete training is completed, the complete model training result is evaluated by combining the test set, and then the parameters of the model are adjusted to continue training so as to achieve the best training result. The evaluation index mainly comprises a root mean square error, a logarithmic root mean square error, an absolute relative error, a squared relative error and accuracy.

Preferably, the image is reconstructed with a reconstruction function as follows:

L _p ＝∑ _s ||pe(I _t -I _s-t )|| ₁ ，

I _s-t ＝I _s KT _t-s D _t K ^-1 ，

in which I _t Representing the target video frame to be depth estimated, I _s-t Representing video frames I from adjacent sources _s Camera pose T from source view to target view by combining camera internal parameter K _t-s Depth of target viewDegree diagram D _t A synthesized synthesis target view frame, pe representing a view synthesis loss; the specific view synthesis loss function is as follows:

where α, β are hyper-parameters, set here to 0.15, 0.85, respectively, ssim is a structural similarity function.

The invention has the advantages that: the depth estimation method of the invention can complete the depth estimation of the monocular video, and has the advantage that the result of the depth estimation can be kept stable between continuous video frames.

Drawings

FIG. 1 is a flow chart of a model of the present invention.

Detailed Description

For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting.

Example 1

The embodiment of the depth estimation method based on the deep learning monocular video can well realize stable depth estimation of continuous video frames.

Fig. 1 is a flowchart of a model for monocular video depth estimation based on deep learning according to this embodiment. The model mainly comprises two parts, namely a depth estimation network and a camera pose estimation network. Firstly, inputting a single color RGB picture into a depth estimation network for depth estimation, and then inputting two continuous video frames into a camera pose estimation network for relative camera pose estimation, wherein the two continuous video frames are pictures contained in the depth estimation. By combining the depth information output by the depth estimation network and the camera pose information output by the camera pose network, view synthesis, namely image reconstruction can be performed, a loss function is constructed to supervise the training of the network, and then the parameters are updated through direction propagation.

As shown in fig. 1, based on the depth estimation method in this embodiment, the specific implementation steps are as follows:

step S1: the deep learning framework adopted by the invention is a pyrrch, so the pyrrch environment needs to be configured before operation, and the version of the pyrrch is more than 1.0.1.

Step S2: preparation of experimental data, namely a KITTI data set is adopted as a data set, data needs to be processed before training, the data is processed into two parts which are respectively training sets for training models, the training sets usually comprise a large number of pictures, the pictures in the training sets are processed to be the same in size, and classification can be carried out according to different scenes; and the test set is used for verifying the model, the difference between the test set and the training set is the number of pictures, the test set only contains relatively small amount of picture data, and the test set also contains corresponding real depth data. At the same time, the still picture frames should be removed during the processing, because the still pictures do not satisfy the prior assumption of view synthesis.

And step S3: the model is trained, the network model is constructed based on ResNet, and the model can be trained by a server after a corresponding software running environment is configured. Different training configurations, such as different network layer numbers, can be performed during training, but it should be noted that the higher the network layer number is, the larger the required computing resources are, and it is necessary to ensure that the video card capacity of the server is sufficiently used. The maximum capacity used in the present invention is 12G. After the image depth information and the camera pose information are obtained, view synthesis or image reconstruction can be carried out according to an image reconstruction function described below, the pixel difference between a reconstructed image and an original image can be used for constructing a loss function, and a model is supervised and trained during a back propagation period;

and step S4: after each epoch training is completed, the model trained until the current epoch is tested, and the effect of the current model training is evaluated by combining the evaluation indexes. One epoch is used for training the model by using all the training set data, and the value of the epoch is set to be 150, namely 150 times of training are required;

step S5: after the complete training is completed, the complete model training result is evaluated by combining the test set, and then the parameters of the model are adjusted to continue training so as to achieve the best training result. The evaluation indexes mainly include 5 items, namely Root Mean Square Error (RMSE), log root mean square error (RMSE log), absolute relative error (abssel), square relative error (SqRel), and accuracy (% correct), namely, the depth data output by the network model and the real depth data are respectively calculated and compared. The adjustment of the model parameters of the network can be adjusted according to the speed of the training process, whether the loss function is reduced or not and the descending trend;

specifically, in order to solve the problem of unstable depth estimation of consecutive video frames, the patent proposes a timing sequence smoothing idea suitable for the situation, and proposes a new loss function term, which is defined as follows:

L _gs ＝|S _a -S _b |，

S _a ＝median(D _a )，

for ensuring the stability of the depth estimation of successive video frames.

Wherein L is _gs Representing the time-sequence smoothing loss function term for two consecutive video frames I _a 、I _b ，D _a 、D _b Represents I _a And I _b Result of depth estimation of S _a And S _b Then it is the time-sequence smoothing term of the consecutive video frames before and after, and mean represents the median-removing operation.

The image reconstruction function is as follows:

L _p ＝∑ _s ||pe(I _t -I _s-t )|| ₁ ，

I _s-t ＝I _s KT _t-s D _t K ^-1 ，

in which I _t Representing the target video frame to be depth estimated, I _s-t Representing video frames I from adjacent sources _s Camera pose T combining camera internal reference K and source view to target view _t-s Target view depth mapD _t The synthesized synthesis target view frame, pe, represents a view synthesis loss. The specific view synthesis loss function is as follows:

where α, β are hyper-parameters, set here to 0.15, 0.85, respectively, and ssim is a structural similarity function.

In order to solve the problem that the depth estimation of a dynamic object violating the static scene assumption is inaccurate during view synthesis, the invention provides a mask idea based on front and back view synthesis inconsistency, and the view synthesis loss of the dynamic object is restrained, and the specific method comprises the following steps:

for two front and back adjacent views I _a 、I _b After obtaining the depth information output by the depth estimation network and the camera pose output by the camera pose estimation network, the depth information and the camera pose can be obtained by a view I _b View synthesis is carried out to obtain a view I _a Composite Picture I at View Angle _a′ Then, the gray difference between the original view and the synthesized view, i.e. the view synthesis loss P, can be calculated _daa′ P can also be obtained _dbb′ . The mask M can be calculated after the view synthesis loss from the front frame to the rear frame and the view synthesis loss from the rear frame to the front frame are obtained, the value of M is as small as possible under the condition that the position and the posture of a depth map and a camera output by a network are accurate, and for an area with poor reconstruction effect, the value of M is relatively large, and the area with poor reconstruction effect is obtainedThe learning weight of the partial region should be set relatively small. Correspondingly, when calculating the loss, punishment should be carried out on the part of the area, and a smaller weight is set for the inaccurate area.

L _gs Two adjacent depth maps in front and back are constrained globally, and the embodiment also constrains the local part of the depth map, and the specific formula is as follows:

D′＝D/S

d ' is the depth map D ' from which the time sequence is smoothed ' _t-s (p) is the depth map D 'smoothed by the de-temporal sequence of the target view' _t Depth map at source view perspective, D ', synthesized in conjunction with camera pose' _s (p) depth map D which is a source view _s And (3) calculating the difference value of each point p in the depth map after sampling, and constraining the depth maps of the adjacent video frames pixel by pixel to ensure that the depth maps of the adjacent front and rear frames are consistent.

The embodiment provides a monocular video depth estimation method based on deep learning, which is characterized in that a depth map corresponding to a two-dimensional RGB picture is finally learned by learning a large amount of two-dimensional RGB picture data and combining a designed loss function by utilizing the strong fitting capacity of a convolutional neural network. The method solves the unstable phenomenon of the depth estimation on continuous video frames based on the monocular video depth estimation method.

Claims

1. A stable monocular video depth estimation method based on deep learning is characterized in that:

the method comprises the following steps of constructing a loss function to solve the problem of unstable depth estimation of continuous video frames, and specifically defining the following steps:

L _gs ＝|S _a -S _b |，

S _a ＝median(D _a )，

wherein L is _gs Representing the time-sequence smoothing loss function term, for two consecutive video frames I _a 、I _b ，D _a 、D _b Represents I _a And I _b Result of depth estimation of S _a And S _b Then, the time sequence smoothing item of the front and back continuous video frames is used, and the mean represents the median operation;

L _gs globally constraining two adjacent depth maps in front and back, and locally constraining the depth maps by using the following loss, wherein a specific formula is as follows:

D′＝D/S

d 'is the depth map after the depth map D has been subjected to time-series smoothing, D' _t-s (p) depth map D after de-temporal smoothing from the target view _t ' depth map under source view perspective, D, combined with Camera pose Synthesis _s ' (p) is a depth map D of the source view _s And (3) calculating the difference value of each point p in the depth map after sampling, and constraining the depth maps of the adjacent video frames pixel by pixel to ensure that the depth maps of the adjacent front and rear frames are consistent.

2. The method of claim 1, wherein the method comprises:

the view synthesis loss of the dynamic object is constrained to deal with the problem that the depth estimation of the dynamic object violating the static scene assumption is inaccurate when view synthesis is performed, and the specific method is as follows:

for two front and back adjacent views I _a 、I _b From view I, after obtaining depth information output by the depth estimation network and camera pose output by the camera pose estimation network _b View synthesis is performed to obtain a view I _a Composite Picture I at View Angle _a′ Then, the gray difference between the original view and the synthesized view, i.e. the view synthesis loss P, is calculated _daa′ P can also be obtained _dbb′ (ii) a And calculating to obtain the mask M after obtaining the view synthesis loss from the front frame to the rear frame and the view synthesis loss from the rear frame to the front frame.

3. The method of claim 2, wherein the method comprises: the deep learning framework is adopted as a pytorch environment, and the version of the torch is more than 1.0.1.

4. The method of claim 2, wherein the method comprises: the network is built based on ResNet.

5. The method of claim 2, wherein the method comprises: after each epoch is trained, testing the model trained until the current epoch, and evaluating the training effect of the current model by combining evaluation indexes; after the complete training is completed, the complete model training result is evaluated by combining the test set, and then the parameters of the model are adjusted to continue training so as to achieve the best training result.

6. The method for estimating depth of a stable monocular video based on deep learning according to claim 5, wherein: the evaluation indexes mainly comprise root mean square error, logarithmic root mean square error, absolute relative error, squared relative error and accuracy.

7. The method of claim 2, wherein the method comprises:

the image is reconstructed with the following reconstruction function:

L _p ＝∑ _s ||pe(I _t -I _s-t )|| ₁ ，

I _s-t ＝I _s KT _t-s D _t K ^-1 ，

wherein I _t Representing the target video frame to be depth estimated, I _s-t Representing video frames I from adjacent sources _s Camera pose T combining camera internal reference K and source view to target view _t-s Target view depth map D _t A synthesized target view frame, pe, representing a view synthesis penalty; the specific view synthesis loss function is as follows: