CN111783582A

CN111783582A - Unsupervised monocular depth estimation algorithm based on deep learning

Info

Publication number: CN111783582A
Application number: CN202010571133.XA
Authority: CN
Inventors: 王腾; 高昊昇; 薛磊
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-16

Abstract

The invention discloses an unsupervised monocular depth estimation algorithm based on depth learning, which can be used for detecting a moving target in a scene by comparing the difference between an optical flow generated by camera motion and a full-optical flow, and finally improving the depth estimation effect of the algorithm.

Description

Unsupervised monocular depth estimation algorithm based on deep learning

Technical Field

The invention relates to a monocular depth estimation algorithm, in particular to an unsupervised monocular depth estimation algorithm based on deep learning.

Background

Computer vision simulates the human visual function through a computer, enabling the computer to have human-like capabilities of perceiving a real three-dimensional scene from two-dimensional planar images, including understanding and recognizing information in the scene, such as content, motion, and structure. However, techniques based on two-dimensional images suffer from some inherent drawbacks, since planar images lack depth information in three-dimensional space during imaging. Therefore, how to reconstruct three-dimensional information of a scene from a single image or multiple images, i.e. depth estimation, is a very important fundamental subject of current research in the field of computer vision. The depth refers to the distance between a point in a scene and a plane where a camera is located, depth information corresponding to an image can be described by a depth image, and the gray value of each pixel point of the depth image can be used for representing the distance between a certain point in the scene and the camera. With the progress of research, the depth estimation technology is gradually applied to the fields of intelligent robots, intelligent medical treatment, unmanned driving, target detection and tracking, face recognition, 3D video production and the like, and has great social value and economic value.

Depth estimation algorithms can be classified into multi-view image-based, binocular image-based and monocular image-based according to the number of viewpoints of a scene image. Compared with the former two methods, the monocular image lacks abundant spatial structure information, and is the most difficult of the three methods. However, the depth estimation through the monocular image is convenient to use, low in cost and most close to the actual application requirements, so that the method has high research value and is a hotspot in the field of current depth estimation.

Most conventional depth estimation methods estimate image depth directly through visual cues. However, the traditional method has strict use conditions, and the calculation amount is relatively large. In recent years, the deep learning technology has been rapidly developed, and therefore, an image depth estimation method combined with deep learning also starts to get attention of researchers at home and abroad. Monocular depth estimation algorithms based on deep learning can be classified into supervised and unsupervised types according to whether a real depth label is used. The supervised method takes a single image as training data, considers depth estimation as a dense predictive regression task, and fits depth values using a convolutional neural network. The disadvantages of this approach are also apparent in that it relies on a large amount of tag data and the cost of obtaining a corresponding deep tag is high. Unsupervised methods derive heuristics from traditional motion-based methods, using a continuous sequence of images as training data, and inferring the three-dimensional structure of the scene based on the motion of the camera. But such methods need to assume that only the motion of the camera is present in the scene, i.e. neglecting the presence of moving objects such as vehicles, pedestrians. The prediction accuracy of such methods can be greatly affected when there are a large number of moving objects in the scene.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides an algorithm for unsupervised estimation of monocular image depth without depending on labels.

The technical scheme is as follows: an unsupervised monocular depth estimation algorithm based on deep learning comprises the following steps:

step 1: processing a video shot by a monocular camera to obtain an image sequence with the length of N, and taking an intermediate frame in the image sequence as a target image I_tThe rest frames are used as source images I_s；

Step 2: the target image I obtained in the step 1 is processed_tInputting the depth image into a constructed depth network DepthNet to obtain a depth image

The target image I obtained in the step 1 is processed_tAnd a source image I_sThe tensors connected according to the channels are input into a constructed camera pose network PoseNet to obtain camera pose transformation

Depth-based image

And camera pose transformation

Solving to obtain rigid motion light stream caused by rigid motion of camera

Subsequently reconstructing the image

Calculating depth smoothing loss L_ds；

And step 3: inputting the image sequence obtained in the step 1 into a constructed optical flow network FlowNet to obtain a full optical flow caused by camera motion and the self-movement of an object

Based on full light flow

Reconstructing an image

And calculating reconstruction loss

And to combat the loss L_adv；

And 4, step 4: comparing the rigid motion light flow obtained in step 2

And the total optical flow obtained in step 3

To obtain a moving target mask

Mask based on moving target

Calculating to obtain the consistency loss L of the optical flow_fcAnd loss of rigidity reconstruction

And 5: based on the antagonistic loss L_advOptical flow uniformity loss L_fcLoss of rigidity reconstruction

Loss of reconstruction

And depth smoothing loss L_dsConstructing a loss function L_totalIterate until the loss function L_totalConverging to obtain a trained depth network DepthNet, a camera pose network PoseNet and an optical flow network FlowNet;

step 6: and respectively inputting the images to be estimated into the trained depth network DepthNet, camera pose network PosenET and optical flow network FlowNet to obtain unsupervised estimation results of the corresponding image depth, camera pose and motion optical flow.

Further, the depth network DepthNet in the step 2 is a full convolution network, and comprises an encoder and a decoder, wherein the encoder and the decoder are connected in a cross-layer manner;

the depth image

Is an input target image I_tGray scale images of equal resolution.

Further, the depth-based image in step 2

And camera pose transformation

Solving to obtain rigid motion light stream caused by rigid motion of camera

The method comprises the following steps:

calculating according to the formula (1) to obtain a certain pixel in a source image I_sProjected coordinates of

In the formula, p_tIs a target image I_tThe secondary coordinate of the last pixel;

the optical flow at a certain pixel is calculated according to equation (2):

reconstructing the image in step 2

The method comprises the following steps:

in the source image I_sUpsampled projection coordinates

A plurality of surrounding pixels

Obtained by bilinear interpolation

Is reconstructed to obtain

Further, the depth smoothing loss L in step 2_dsCalculated according to equation (3):

wherein the content of the first and second substances,

respectively representing the longitudinal and transverse gradients, p_tIs a target image I_tThe next coordinate of the last pixel.

Further, the optical flow network FlowNet in the step 3 is a countermeasure network comprising a generator and a discriminator, wherein the generator accepts the target image I_tAnd a source image I_sThe tensors connected according to the channels are used as input to output full light flow

The discriminator receives a target image I_tAnd reconstructing the image

As input, a target image I_tReconstruction of images, viewed as true images

The generated image is regarded as a generated image, and a probability value representing that the generated image is a real image is output.

Further, the structure of the generator is consistent with that of the deep network DepthNet.

Further, in step 3, the reconstruction loss is calculated according to equation (4)

In the formula, SSIM represents a structural similarity index, w is a parameter,

as a full light stream

The corresponding effective mask.

Further, in step 3, the countermeasure loss L is calculated according to the formula (5)_adv：

Where G, D denotes the generator and the discriminator, respectively, I, X denotes the real image and the data distribution of the real image,

respectively, a generated image and a data distribution of the generated image.

Further, in step 4, the target mask is obtained according to the formula (6)

In the formula, 1(·) is an indicator function, and alpha is a threshold;

obtaining the light flow consistency loss L according to the equation (7)_fc：

Obtaining the rigid reconstruction loss according to equation (8)

Further, the loss function L in step 5_totalExpressed as:

wherein λ is_adv、L_ds、λ_r、λ_f、λ_fcRespectively, the weights corresponding to the losses.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1. the method realizes the detection of the moving target in the scene by comparing the difference between the optical flow generated by the motion of the camera and the all-optical flow, and finally improves the depth estimation effect of the algorithm;

2. according to the method, the accuracy and robustness of the algorithm are effectively enhanced by detecting the dynamic target in the scene;

3. the method takes the video shot by the monocular camera as training data, does not need expensive depth labels, and can greatly reduce the influence of the moving target on an unsupervised method by modeling the moving target, thereby ensuring that the algorithm can obtain good effect in monocular depth estimation, camera pose prediction and optical flow estimation tasks;

4. the invention generates the countermeasure loss introduced by the countermeasure network structure, and obviously improves the precision of the optical flow prediction.

Drawings

FIG. 1 is a schematic view of a model structure;

FIG. 2 is a generative countermeasure network structure of an optical flow network FlowNet;

FIG. 3 is a target image I_tAnd corresponding depth image

Examples are given;

FIG. 4 shows the target image I from the top down_tSource image I_sAnd rigid motion light flow

Examples are given;

FIG. 5 shows the target image I from the top down_tReconstructing the image

And corresponding effective mask

Examples are given;

FIG. 6 shows the target image I from top to bottom_tSource image I_sAnd all optical flow

Examples are given;

FIG. 7 shows the optical flow of rigid motion from top to bottom

Full light stream

And moving the target mask

Examples are given.

Detailed Description

The technical solution of the present invention will be further explained with reference to the accompanying drawings and examples.

Referring to fig. 1, the algorithm model of the invention is composed of a depth network DepthNet, a camera pose network PosenNet and an optical flow network FlowNet. The depth network depthNet outputs a depth image with resolution equal to that of a monocular input image, the depth value is represented by gray scale, the camera pose network PosenNet is used for estimating the distance between adjacent frame images, the pose transformation quantity of a camera in a three-dimensional space, the optical flow network FlowNet is used for estimating the full optical flow between the adjacent frame images, and FIG. 2 shows a generation countermeasure network structure of the optical flow network FlowNet.

Based on the model, the unsupervised monocular depth estimation algorithm based on depth learning is designed, the detection of the moving target in the scene is realized by comparing the difference between the optical flow generated by the motion of the camera and the plenoptic flow, and the depth estimation effect of the algorithm is finally improved. The method can realize the unsupervised estimation of the depth image, the camera pose and the motion optical flow for the moving monocular camera video without training the label, and the three tasks have excellent prediction precision.

The method specifically comprises the following steps:

step 1: the method comprises the steps of taking a video shot by a monocular camera as a training set, obtaining a series of image sequences with the length N of 3 after processing by taking camera internal parameters K as known, and taking the image sequences as data of a final input model, wherein an intermediate frame is taken as a target image I_tThe rest frames are used as source images I_s；

Step 2: constructing a depth network DepthNet and a camera pose network PoseNet, combining the input in the step 1, and respectively outputting a depth image

And camera pose transformation

Solving the optical flow caused by rigid body motion of camera

And corresponding effective mask

Subsequently reconstructing the image

Calculating depth smoothing loss L_ds；

The deep network DepthNet structure is described as follows: DepthNet is a full convolutional network of encoder-decoder architecture, with cross-layer connections between the encoder and decoder. The encoder consists of 7 pairs of convolution layers with convolution step sizes of 2 and 1 respectively, and the convolution kernel numbers are 32, 64, 128, 256, 512 and 512 respectively; the decoder consists of a series of successive deconvolution and convolution layers, and finally, as shown in FIG. 3, outputs and inputs the target image I_tEqual resolution gray scale image

The grayscale size represents the depth value at that pixel. The convolution kernel sizes for all layers are 3, except for the 2 pairs of convolutional layers before the encoder, which have convolution kernel sizes of 7 and 5. All layers except the last output layer were normalized using the LeakyReLU activation function and batch.

The camera pose network PoseNet structure is described as follows: posenet is composed of 7 layers of convolution layers, the number of convolution kernels is 16, 32, 64, 128, 256 and 256 respectively, convolution step length is 2, except the size of the convolution kernel of the first 2 layers is set to be 7 and 5, the sizes of all the convolution kernels are 3, and Posenet receives a target image I_tAnd a source image I_sThe tensors connected according to the channels are used as input, and finally the position and attitude of the camera are converted through a 1 x 1 convolution layer of a 6 channel

Representing a secondary target image I_tTo the source image I_sThe rigid motion of the camera in space includes 3 euler angles and 3 translation amounts.

Camera motion light stream

Constructing and reconstructing images

The description is as follows: note p_tIs a target image I_tThe secondary coordinate of the last pixel is combined with the depth image

And camera pose transformation

Can obtain the pixel in the source image I_sProjected coordinates of

The optical flow at the pixel can be found as:

the optical flow represents the variation of the position of the same pixel between the target image and the source image. As shown in FIG. 4, the target image I is sequentially arranged from top to bottom_tSource image I_sAnd rigid motion light flow

Examples are given.

Due to the fact that

May exceed the image boundary, so a corresponding effective mask needs to be established

In the reconstructed image

Due to the fact that

The value is continuous and the number of the first and second,

by sampling on the source image

The bilinear interpolation of 4 surrounding pixels is calculated, thereby reconstructing

As shown in FIG. 5, the target image I is sequentially arranged from top to bottom_tReconstructing the image

And corresponding effective mask

Examples are given.

Depth smoothing loss L_dsThe calculation is as follows:

wherein the content of the first and second substances,

representing the longitudinal and transverse gradients, respectively, the depth smoothing penalty L_dsThe depth change of the object contour and other positions in the depth image is ensured to be large, and the depth images of the rest positions are as smooth as possible.

And step 3: constructing optical flow networksFlowNet, combining the inputs in step 1, outputs a plenoptic flow caused by camera motion and the movement of the object itself

Solving for the corresponding effective mask

Subsequently reconstructing the image

Calculating reconstruction loss

And to combat the loss L_adv；

The structure of the optical flow network FlowNet is described as follows: FlowNet is a form of generating a countermeasure network as shown in FIG. 2, and is composed of a generator that accepts a target image I and a discriminator_tAnd a source image I_sThe tensors after the channel connection are used as input, and the full optical flow shown in FIG. 6 is output

The optical flow is caused by the camera motion and the object self-movement, and the structure of the generator is completely the same as that of the depth network DepthNet except that the number of channels of the final output layer is 2. In combination with step 2, according to total light flow

Reconfigurable image

And constructing corresponding effective mask

Calculating reconstruction loss

The following were used:

where SSIM denotes a structural similarity index, and the parameter w is set to 0.85. Theoretically, if the depth estimation and the camera pose estimation are error-free, the effective mask is

In the interior of said container body,

and I_tShould be identical, the reconstruction loss should be zero. The discriminator receives I_tAnd

as an input, I_tTo be regarded as a real image,

treating as generating an image; and outputting a probability value which represents the probability that the corresponding input image is a real image. The structure of the discriminator is similar to PoseNet, and the discriminator is composed of 7 convolutional layers and is finally output after global average pooling and sigmoid activation functions.

Against loss L_advThe formula is as follows:

where G, D denote the generator and discriminator sections, respectively, I, X are the real image and the data distribution of the real image,

respectively, a generated image and a data distribution of the generated image.

And 4, step 4: comparing rigid body motion light flow

And all optical flow

Detecting the difference of (2), detecting the shiftMoving object, outputting moving object mask

Computing optical flow consistency loss L_fcWhile calculating the loss of stiffness reconstruction in conjunction with step 2

Moving target mask

The structure is as follows:

where 1 () is the indicator function, the threshold α is set to 7, theoretically, if

And

estimation is completed without error and at the moving target

And

the optical-flow difference should be large and the two types of optical-flow values should be exactly equal at the static background. As shown in FIG. 7, the optical flow is a rigid motion from top to bottom

Full light stream

And moving the target mask

Examples are given.

Optical flow uniformityLoss L_fcComprises the following steps:

this loss guarantee

And

the two types of light flow are as equal as possible at the static background.

Loss of rigidity reconstruction

Comprises the following steps:

Loss of reconstruction

And depth smoothing loss L_dsConstructing a loss function L_totalMinimizing L using Adam iterator_totalUntil convergence, obtaining a trained depth network DepthNet, a camera pose network PoseNet and an optical flow network FlowNet;

final loss function L_totalThe formula is as follows:

wherein λ_adv、L_ds、λ_r、λ_f、λ_fcWeights for losses of respective terms, respectively, of the magnitude of 0.005, 1, 10, 1 and 0.01, respectively parameter β of Adam iterator₁、β₂The sizes were 0.9 and 0.999, respectively. During model training, the initial learning rate is 0.0002, and the batch size is set to 8.

Claims

1. An unsupervised monocular depth estimation algorithm based on deep learning is characterized in that: the method comprises the following steps:

Depth-based image

And camera pose transformation

Solving to obtain rigid motion light stream caused by rigid motion of camera

Subsequently reconstructing the image

Calculating depth smoothing loss L_ds；

Based on full light flow

Reconstructing an image

And calculating reconstruction loss

And to combat the loss L_adv；

And 4, step 4: comparing the rigid motion light flow obtained in step 2

And the total optical flow obtained in step 3

To obtain a moving target mask

Mask based on moving target

And 5: based on the antagonistic loss L_advOptical flow uniformity loss L_fcRigid reconstructionLoss of power

Loss of reconstruction

2. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: the depth network DepthNet in the step 2 is a full convolution network and comprises an encoder and a decoder, wherein the encoder and the decoder are connected in a cross-layer manner;

the depth image

Is an input target image I_tGray scale images of equal resolution.

3. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: depth-based image in step 2

And camera pose transformation

Solving to obtain rigid motion light stream caused by rigid motion of camera

The method comprises the following steps:

the optical flow at a certain pixel is calculated according to equation (2):

reconstructing the image in step 2

The method comprises the following steps:

in the source image I_sUpsampled projection coordinates

A plurality of surrounding pixels

Obtained by bilinear interpolation

Is reconstructed to obtain

4. An unsupervised monocular depth estimation algorithm based on deep learning according to claim 1 or 3, characterized in that: depth smoothing loss L in step 2_dsCalculated according to equation (3):

wherein the content of the first and second substances,

5. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: the optical flow network FlowNet in the step 3 is a countermeasure network and comprises a generator and a discriminator, wherein the generator receives a target image I_tAnd a source image I_sThe tensors connected according to the channels are used as input to output full light flow

The discriminator receives a target image I_tAnd reconstructing the image

As input, a target image I_tReconstruction of images, viewed as true images

6. The unsupervised monocular depth estimation algorithm based on deep learning of claim 5, wherein: the structure of the generator is consistent with that of the deep network DepthNet.

7. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: in step 3, the reconstruction loss is calculated according to equation (4)

as a full light stream

The corresponding effective mask.

8. The unsupervised monocular depth estimation algorithm based on deep learning of claim 5, wherein: in step 3, the countermeasure loss L is calculated from the formula (5)_adv：

respectively, a generated image and a data distribution of the generated image.

9. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: in step 4, the target mask is obtained according to the formula (6)

In the formula, 1(·) is an indicator function, and alpha is a threshold;

obtaining the light flow consistency loss L according to the equation (7)_fc：

Obtaining the rigid reconstruction loss according to equation (8)

10. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: loss function L in step 5_totalExpressed as: