CN111783582A - Unsupervised monocular depth estimation algorithm based on deep learning - Google Patents

Unsupervised monocular depth estimation algorithm based on deep learning Download PDF

Info

Publication number
CN111783582A
CN111783582A CN202010571133.XA CN202010571133A CN111783582A CN 111783582 A CN111783582 A CN 111783582A CN 202010571133 A CN202010571133 A CN 202010571133A CN 111783582 A CN111783582 A CN 111783582A
Authority
CN
China
Prior art keywords
image
depth
loss
network
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010571133.XA
Other languages
Chinese (zh)
Inventor
王腾
高昊昇
薛磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010571133.XA priority Critical patent/CN111783582A/en
Publication of CN111783582A publication Critical patent/CN111783582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Abstract

The invention discloses an unsupervised monocular depth estimation algorithm based on depth learning, which can be used for detecting a moving target in a scene by comparing the difference between an optical flow generated by camera motion and a full-optical flow, and finally improving the depth estimation effect of the algorithm.

Description

Unsupervised monocular depth estimation algorithm based on deep learning
Technical Field
The invention relates to a monocular depth estimation algorithm, in particular to an unsupervised monocular depth estimation algorithm based on deep learning.
Background
Computer vision simulates the human visual function through a computer, enabling the computer to have human-like capabilities of perceiving a real three-dimensional scene from two-dimensional planar images, including understanding and recognizing information in the scene, such as content, motion, and structure. However, techniques based on two-dimensional images suffer from some inherent drawbacks, since planar images lack depth information in three-dimensional space during imaging. Therefore, how to reconstruct three-dimensional information of a scene from a single image or multiple images, i.e. depth estimation, is a very important fundamental subject of current research in the field of computer vision. The depth refers to the distance between a point in a scene and a plane where a camera is located, depth information corresponding to an image can be described by a depth image, and the gray value of each pixel point of the depth image can be used for representing the distance between a certain point in the scene and the camera. With the progress of research, the depth estimation technology is gradually applied to the fields of intelligent robots, intelligent medical treatment, unmanned driving, target detection and tracking, face recognition, 3D video production and the like, and has great social value and economic value.
Depth estimation algorithms can be classified into multi-view image-based, binocular image-based and monocular image-based according to the number of viewpoints of a scene image. Compared with the former two methods, the monocular image lacks abundant spatial structure information, and is the most difficult of the three methods. However, the depth estimation through the monocular image is convenient to use, low in cost and most close to the actual application requirements, so that the method has high research value and is a hotspot in the field of current depth estimation.
Most conventional depth estimation methods estimate image depth directly through visual cues. However, the traditional method has strict use conditions, and the calculation amount is relatively large. In recent years, the deep learning technology has been rapidly developed, and therefore, an image depth estimation method combined with deep learning also starts to get attention of researchers at home and abroad. Monocular depth estimation algorithms based on deep learning can be classified into supervised and unsupervised types according to whether a real depth label is used. The supervised method takes a single image as training data, considers depth estimation as a dense predictive regression task, and fits depth values using a convolutional neural network. The disadvantages of this approach are also apparent in that it relies on a large amount of tag data and the cost of obtaining a corresponding deep tag is high. Unsupervised methods derive heuristics from traditional motion-based methods, using a continuous sequence of images as training data, and inferring the three-dimensional structure of the scene based on the motion of the camera. But such methods need to assume that only the motion of the camera is present in the scene, i.e. neglecting the presence of moving objects such as vehicles, pedestrians. The prediction accuracy of such methods can be greatly affected when there are a large number of moving objects in the scene.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides an algorithm for unsupervised estimation of monocular image depth without depending on labels.
The technical scheme is as follows: an unsupervised monocular depth estimation algorithm based on deep learning comprises the following steps:
step 1: processing a video shot by a monocular camera to obtain an image sequence with the length of N, and taking an intermediate frame in the image sequence as a target image ItThe rest frames are used as source images Is
Step 2: the target image I obtained in the step 1 is processedtInputting the depth image into a constructed depth network DepthNet to obtain a depth image
Figure BDA0002549586400000021
The target image I obtained in the step 1 is processedtAnd a source image IsThe tensors connected according to the channels are input into a constructed camera pose network PoseNet to obtain camera pose transformation
Figure BDA0002549586400000022
Depth-based image
Figure BDA0002549586400000023
And camera pose transformation
Figure BDA0002549586400000024
Solving to obtain rigid motion light stream caused by rigid motion of camera
Figure BDA0002549586400000025
Subsequently reconstructing the image
Figure BDA0002549586400000026
Calculating depth smoothing loss Lds
And step 3: inputting the image sequence obtained in the step 1 into a constructed optical flow network FlowNet to obtain a full optical flow caused by camera motion and the self-movement of an object
Figure BDA0002549586400000027
Based on full light flow
Figure BDA0002549586400000028
Reconstructing an image
Figure BDA0002549586400000029
And calculating reconstruction loss
Figure BDA00025495864000000210
And to combat the loss Ladv
And 4, step 4: comparing the rigid motion light flow obtained in step 2
Figure BDA00025495864000000211
And the total optical flow obtained in step 3
Figure BDA00025495864000000212
To obtain a moving target mask
Figure BDA00025495864000000213
Mask based on moving target
Figure BDA00025495864000000214
Calculating to obtain the consistency loss L of the optical flowfcAnd loss of rigidity reconstruction
Figure BDA00025495864000000215
And 5: based on the antagonistic loss LadvOptical flow uniformity loss LfcLoss of rigidity reconstruction
Figure BDA00025495864000000216
Loss of reconstruction
Figure BDA00025495864000000217
And depth smoothing loss LdsConstructing a loss function LtotalIterate until the loss function LtotalConverging to obtain a trained depth network DepthNet, a camera pose network PoseNet and an optical flow network FlowNet;
step 6: and respectively inputting the images to be estimated into the trained depth network DepthNet, camera pose network PosenET and optical flow network FlowNet to obtain unsupervised estimation results of the corresponding image depth, camera pose and motion optical flow.
Further, the depth network DepthNet in the step 2 is a full convolution network, and comprises an encoder and a decoder, wherein the encoder and the decoder are connected in a cross-layer manner;
the depth image
Figure BDA00025495864000000218
Is an input target image ItGray scale images of equal resolution.
Further, the depth-based image in step 2
Figure BDA00025495864000000219
And camera pose transformation
Figure BDA00025495864000000220
Solving to obtain rigid motion light stream caused by rigid motion of camera
Figure BDA00025495864000000221
The method comprises the following steps:
calculating according to the formula (1) to obtain a certain pixel in a source image IsProjected coordinates of
Figure BDA00025495864000000222
Figure BDA00025495864000000223
In the formula, ptIs a target image ItThe secondary coordinate of the last pixel;
the optical flow at a certain pixel is calculated according to equation (2):
Figure BDA00025495864000000224
reconstructing the image in step 2
Figure BDA00025495864000000225
The method comprises the following steps:
in the source image IsUpsampled projection coordinates
Figure BDA00025495864000000226
A plurality of surrounding pixels
Obtained by bilinear interpolation
Figure BDA00025495864000000227
Is reconstructed to obtain
Figure BDA00025495864000000228
Further, the depth smoothing loss L in step 2dsCalculated according to equation (3):
Figure BDA0002549586400000031
wherein the content of the first and second substances,
Figure BDA0002549586400000032
respectively representing the longitudinal and transverse gradients, ptIs a target image ItThe next coordinate of the last pixel.
Further, the optical flow network FlowNet in the step 3 is a countermeasure network comprising a generator and a discriminator, wherein the generator accepts the target image ItAnd a source image IsThe tensors connected according to the channels are used as input to output full light flow
Figure BDA0002549586400000033
The discriminator receives a target image ItAnd reconstructing the image
Figure BDA0002549586400000034
As input, a target image ItReconstruction of images, viewed as true images
Figure BDA0002549586400000035
The generated image is regarded as a generated image, and a probability value representing that the generated image is a real image is output.
Further, the structure of the generator is consistent with that of the deep network DepthNet.
Further, in step 3, the reconstruction loss is calculated according to equation (4)
Figure BDA0002549586400000036
Figure BDA0002549586400000037
In the formula, SSIM represents a structural similarity index, w is a parameter,
Figure BDA0002549586400000038
as a full light stream
Figure BDA0002549586400000039
The corresponding effective mask.
Further, in step 3, the countermeasure loss L is calculated according to the formula (5)adv
Figure BDA00025495864000000310
Where G, D denotes the generator and the discriminator, respectively, I, X denotes the real image and the data distribution of the real image,
Figure BDA00025495864000000311
respectively, a generated image and a data distribution of the generated image.
Further, in step 4, the target mask is obtained according to the formula (6)
Figure BDA00025495864000000312
Figure BDA00025495864000000313
In the formula, 1(·) is an indicator function, and alpha is a threshold;
obtaining the light flow consistency loss L according to the equation (7)fc
Figure BDA00025495864000000314
Obtaining the rigid reconstruction loss according to equation (8)
Figure BDA00025495864000000315
Figure BDA00025495864000000316
Further, the loss function L in step 5totalExpressed as:
Figure BDA00025495864000000317
wherein λ isadv、Lds、λr、λf、λfcRespectively, the weights corresponding to the losses.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1. the method realizes the detection of the moving target in the scene by comparing the difference between the optical flow generated by the motion of the camera and the all-optical flow, and finally improves the depth estimation effect of the algorithm;
2. according to the method, the accuracy and robustness of the algorithm are effectively enhanced by detecting the dynamic target in the scene;
3. the method takes the video shot by the monocular camera as training data, does not need expensive depth labels, and can greatly reduce the influence of the moving target on an unsupervised method by modeling the moving target, thereby ensuring that the algorithm can obtain good effect in monocular depth estimation, camera pose prediction and optical flow estimation tasks;
4. the invention generates the countermeasure loss introduced by the countermeasure network structure, and obviously improves the precision of the optical flow prediction.
Drawings
FIG. 1 is a schematic view of a model structure;
FIG. 2 is a generative countermeasure network structure of an optical flow network FlowNet;
FIG. 3 is a target image ItAnd corresponding depth image
Figure BDA0002549586400000041
Examples are given;
FIG. 4 shows the target image I from the top downtSource image IsAnd rigid motion light flow
Figure BDA0002549586400000042
Examples are given;
FIG. 5 shows the target image I from the top downtReconstructing the image
Figure BDA0002549586400000043
And corresponding effective mask
Figure BDA0002549586400000044
Examples are given;
FIG. 6 shows the target image I from top to bottomtSource image IsAnd all optical flow
Figure BDA0002549586400000045
Examples are given;
FIG. 7 shows the optical flow of rigid motion from top to bottom
Figure BDA0002549586400000046
Full light stream
Figure BDA0002549586400000047
And moving the target mask
Figure BDA0002549586400000048
Examples are given.
Detailed Description
The technical solution of the present invention will be further explained with reference to the accompanying drawings and examples.
Referring to fig. 1, the algorithm model of the invention is composed of a depth network DepthNet, a camera pose network PosenNet and an optical flow network FlowNet. The depth network depthNet outputs a depth image with resolution equal to that of a monocular input image, the depth value is represented by gray scale, the camera pose network PosenNet is used for estimating the distance between adjacent frame images, the pose transformation quantity of a camera in a three-dimensional space, the optical flow network FlowNet is used for estimating the full optical flow between the adjacent frame images, and FIG. 2 shows a generation countermeasure network structure of the optical flow network FlowNet.
Based on the model, the unsupervised monocular depth estimation algorithm based on depth learning is designed, the detection of the moving target in the scene is realized by comparing the difference between the optical flow generated by the motion of the camera and the plenoptic flow, and the depth estimation effect of the algorithm is finally improved. The method can realize the unsupervised estimation of the depth image, the camera pose and the motion optical flow for the moving monocular camera video without training the label, and the three tasks have excellent prediction precision.
The method specifically comprises the following steps:
step 1: the method comprises the steps of taking a video shot by a monocular camera as a training set, obtaining a series of image sequences with the length N of 3 after processing by taking camera internal parameters K as known, and taking the image sequences as data of a final input model, wherein an intermediate frame is taken as a target image ItThe rest frames are used as source images Is
Step 2: constructing a depth network DepthNet and a camera pose network PoseNet, combining the input in the step 1, and respectively outputting a depth image
Figure BDA0002549586400000049
And camera pose transformation
Figure BDA00025495864000000410
Solving the optical flow caused by rigid body motion of camera
Figure BDA00025495864000000411
And corresponding effective mask
Figure BDA00025495864000000412
Subsequently reconstructing the image
Figure BDA00025495864000000413
Calculating depth smoothing loss Lds
The deep network DepthNet structure is described as follows: DepthNet is a full convolutional network of encoder-decoder architecture, with cross-layer connections between the encoder and decoder. The encoder consists of 7 pairs of convolution layers with convolution step sizes of 2 and 1 respectively, and the convolution kernel numbers are 32, 64, 128, 256, 512 and 512 respectively; the decoder consists of a series of successive deconvolution and convolution layers, and finally, as shown in FIG. 3, outputs and inputs the target image ItEqual resolution gray scale image
Figure BDA0002549586400000051
The grayscale size represents the depth value at that pixel. The convolution kernel sizes for all layers are 3, except for the 2 pairs of convolutional layers before the encoder, which have convolution kernel sizes of 7 and 5. All layers except the last output layer were normalized using the LeakyReLU activation function and batch.
The camera pose network PoseNet structure is described as follows: posenet is composed of 7 layers of convolution layers, the number of convolution kernels is 16, 32, 64, 128, 256 and 256 respectively, convolution step length is 2, except the size of the convolution kernel of the first 2 layers is set to be 7 and 5, the sizes of all the convolution kernels are 3, and Posenet receives a target image ItAnd a source image IsThe tensors connected according to the channels are used as input, and finally the position and attitude of the camera are converted through a 1 x 1 convolution layer of a 6 channel
Figure BDA0002549586400000052
Representing a secondary target image ItTo the source image IsThe rigid motion of the camera in space includes 3 euler angles and 3 translation amounts.
Camera motion light stream
Figure BDA0002549586400000053
Constructing and reconstructing images
Figure BDA0002549586400000054
The description is as follows: note ptIs a target image ItThe secondary coordinate of the last pixel is combined with the depth image
Figure BDA0002549586400000055
And camera pose transformation
Figure BDA0002549586400000056
Can obtain the pixel in the source image IsProjected coordinates of
Figure BDA0002549586400000057
Figure BDA0002549586400000058
The optical flow at the pixel can be found as:
Figure BDA0002549586400000059
the optical flow represents the variation of the position of the same pixel between the target image and the source image. As shown in FIG. 4, the target image I is sequentially arranged from top to bottomtSource image IsAnd rigid motion light flow
Figure BDA00025495864000000510
Examples are given.
Due to the fact that
Figure BDA00025495864000000511
May exceed the image boundary, so a corresponding effective mask needs to be established
Figure BDA00025495864000000512
In the reconstructed image
Figure BDA00025495864000000513
Due to the fact that
Figure BDA00025495864000000514
The value is continuous and the number of the first and second,
Figure BDA00025495864000000515
by sampling on the source image
Figure BDA00025495864000000516
The bilinear interpolation of 4 surrounding pixels is calculated, thereby reconstructing
Figure BDA00025495864000000517
As shown in FIG. 5, the target image I is sequentially arranged from top to bottomtReconstructing the image
Figure BDA00025495864000000518
And corresponding effective mask
Figure BDA00025495864000000519
Examples are given.
Depth smoothing loss LdsThe calculation is as follows:
Figure BDA00025495864000000520
wherein the content of the first and second substances,
Figure BDA00025495864000000521
representing the longitudinal and transverse gradients, respectively, the depth smoothing penalty LdsThe depth change of the object contour and other positions in the depth image is ensured to be large, and the depth images of the rest positions are as smooth as possible.
And step 3: constructing optical flow networksFlowNet, combining the inputs in step 1, outputs a plenoptic flow caused by camera motion and the movement of the object itself
Figure BDA00025495864000000522
Solving for the corresponding effective mask
Figure BDA00025495864000000523
Subsequently reconstructing the image
Figure BDA00025495864000000524
Calculating reconstruction loss
Figure BDA00025495864000000525
And to combat the loss Ladv
The structure of the optical flow network FlowNet is described as follows: FlowNet is a form of generating a countermeasure network as shown in FIG. 2, and is composed of a generator that accepts a target image I and a discriminatortAnd a source image IsThe tensors after the channel connection are used as input, and the full optical flow shown in FIG. 6 is output
Figure BDA00025495864000000526
The optical flow is caused by the camera motion and the object self-movement, and the structure of the generator is completely the same as that of the depth network DepthNet except that the number of channels of the final output layer is 2. In combination with step 2, according to total light flow
Figure BDA00025495864000000527
Reconfigurable image
Figure BDA00025495864000000528
And constructing corresponding effective mask
Figure BDA00025495864000000529
Calculating reconstruction loss
Figure BDA00025495864000000530
The following were used:
Figure BDA0002549586400000061
where SSIM denotes a structural similarity index, and the parameter w is set to 0.85. Theoretically, if the depth estimation and the camera pose estimation are error-free, the effective mask is
Figure BDA0002549586400000062
In the interior of said container body,
Figure BDA0002549586400000063
and ItShould be identical, the reconstruction loss should be zero. The discriminator receives ItAnd
Figure BDA0002549586400000064
as an input, ItTo be regarded as a real image,
Figure BDA0002549586400000065
treating as generating an image; and outputting a probability value which represents the probability that the corresponding input image is a real image. The structure of the discriminator is similar to PoseNet, and the discriminator is composed of 7 convolutional layers and is finally output after global average pooling and sigmoid activation functions.
Against loss LadvThe formula is as follows:
Figure BDA0002549586400000066
where G, D denote the generator and discriminator sections, respectively, I, X are the real image and the data distribution of the real image,
Figure BDA0002549586400000067
respectively, a generated image and a data distribution of the generated image.
And 4, step 4: comparing rigid body motion light flow
Figure BDA0002549586400000068
And all optical flow
Figure BDA0002549586400000069
Detecting the difference of (2), detecting the shiftMoving object, outputting moving object mask
Figure BDA00025495864000000610
Computing optical flow consistency loss LfcWhile calculating the loss of stiffness reconstruction in conjunction with step 2
Figure BDA00025495864000000611
Moving target mask
Figure BDA00025495864000000612
The structure is as follows:
Figure BDA00025495864000000613
where 1 () is the indicator function, the threshold α is set to 7, theoretically, if
Figure BDA00025495864000000614
And
Figure BDA00025495864000000615
estimation is completed without error and at the moving target
Figure BDA00025495864000000616
And
Figure BDA00025495864000000617
the optical-flow difference should be large and the two types of optical-flow values should be exactly equal at the static background. As shown in FIG. 7, the optical flow is a rigid motion from top to bottom
Figure BDA00025495864000000618
Full light stream
Figure BDA00025495864000000619
And moving the target mask
Figure BDA00025495864000000620
Examples are given.
Optical flow uniformityLoss LfcComprises the following steps:
Figure BDA00025495864000000621
this loss guarantee
Figure BDA00025495864000000622
And
Figure BDA00025495864000000623
the two types of light flow are as equal as possible at the static background.
Loss of rigidity reconstruction
Figure BDA00025495864000000624
Comprises the following steps:
Figure BDA00025495864000000625
and 5: based on the antagonistic loss LadvOptical flow uniformity loss LfcLoss of rigidity reconstruction
Figure BDA00025495864000000626
Loss of reconstruction
Figure BDA00025495864000000627
And depth smoothing loss LdsConstructing a loss function LtotalMinimizing L using Adam iteratortotalUntil convergence, obtaining a trained depth network DepthNet, a camera pose network PoseNet and an optical flow network FlowNet;
final loss function LtotalThe formula is as follows:
Figure BDA00025495864000000628
wherein λadv、Lds、λr、λf、λfcWeights for losses of respective terms, respectively, of the magnitude of 0.005, 1, 10, 1 and 0.01, respectively parameter β of Adam iterator1、β2The sizes were 0.9 and 0.999, respectively. During model training, the initial learning rate is 0.0002, and the batch size is set to 8.
Step 6: and respectively inputting the images to be estimated into the trained depth network DepthNet, camera pose network PosenET and optical flow network FlowNet to obtain unsupervised estimation results of the corresponding image depth, camera pose and motion optical flow.

Claims (10)

1. An unsupervised monocular depth estimation algorithm based on deep learning is characterized in that: the method comprises the following steps:
step 1: processing a video shot by a monocular camera to obtain an image sequence with the length of N, and taking an intermediate frame in the image sequence as a target image ItThe rest frames are used as source images Is
Step 2: the target image I obtained in the step 1 is processedtInputting the depth image into a constructed depth network DepthNet to obtain a depth image
Figure FDA0002549586390000011
The target image I obtained in the step 1 is processedtAnd a source image IsThe tensors connected according to the channels are input into a constructed camera pose network PoseNet to obtain camera pose transformation
Figure FDA0002549586390000012
Depth-based image
Figure FDA0002549586390000013
And camera pose transformation
Figure FDA0002549586390000014
Solving to obtain rigid motion light stream caused by rigid motion of camera
Figure FDA0002549586390000015
Subsequently reconstructing the image
Figure FDA0002549586390000016
Calculating depth smoothing loss Lds
And step 3: inputting the image sequence obtained in the step 1 into a constructed optical flow network FlowNet to obtain a full optical flow caused by camera motion and the self-movement of an object
Figure FDA0002549586390000017
Based on full light flow
Figure FDA0002549586390000018
Reconstructing an image
Figure FDA0002549586390000019
And calculating reconstruction loss
Figure FDA00025495863900000110
And to combat the loss Ladv
And 4, step 4: comparing the rigid motion light flow obtained in step 2
Figure FDA00025495863900000111
And the total optical flow obtained in step 3
Figure FDA00025495863900000112
To obtain a moving target mask
Figure FDA00025495863900000113
Mask based on moving target
Figure FDA00025495863900000114
Calculating to obtain the consistency loss L of the optical flowfcAnd loss of rigidity reconstruction
Figure FDA00025495863900000115
And 5: based on the antagonistic loss LadvOptical flow uniformity loss LfcRigid reconstructionLoss of power
Figure FDA00025495863900000116
Loss of reconstruction
Figure FDA00025495863900000117
And depth smoothing loss LdsConstructing a loss function LtotalIterate until the loss function LtotalConverging to obtain a trained depth network DepthNet, a camera pose network PoseNet and an optical flow network FlowNet;
step 6: and respectively inputting the images to be estimated into the trained depth network DepthNet, camera pose network PosenET and optical flow network FlowNet to obtain unsupervised estimation results of the corresponding image depth, camera pose and motion optical flow.
2. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: the depth network DepthNet in the step 2 is a full convolution network and comprises an encoder and a decoder, wherein the encoder and the decoder are connected in a cross-layer manner;
the depth image
Figure FDA00025495863900000118
Is an input target image ItGray scale images of equal resolution.
3. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: depth-based image in step 2
Figure FDA00025495863900000119
And camera pose transformation
Figure FDA00025495863900000120
Solving to obtain rigid motion light stream caused by rigid motion of camera
Figure FDA00025495863900000121
The method comprises the following steps:
calculating according to the formula (1) to obtain a certain pixel in a source image IsProjected coordinates of
Figure FDA00025495863900000122
Figure FDA00025495863900000123
In the formula, ptIs a target image ItThe secondary coordinate of the last pixel;
the optical flow at a certain pixel is calculated according to equation (2):
Figure FDA00025495863900000124
reconstructing the image in step 2
Figure FDA00025495863900000125
The method comprises the following steps:
in the source image IsUpsampled projection coordinates
Figure FDA0002549586390000021
A plurality of surrounding pixels
Obtained by bilinear interpolation
Figure FDA0002549586390000022
Is reconstructed to obtain
Figure FDA0002549586390000023
4. An unsupervised monocular depth estimation algorithm based on deep learning according to claim 1 or 3, characterized in that: depth smoothing loss L in step 2dsCalculated according to equation (3):
Figure FDA0002549586390000024
wherein the content of the first and second substances,
Figure FDA0002549586390000025
respectively representing the longitudinal and transverse gradients, ptIs a target image ItThe next coordinate of the last pixel.
5. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: the optical flow network FlowNet in the step 3 is a countermeasure network and comprises a generator and a discriminator, wherein the generator receives a target image ItAnd a source image IsThe tensors connected according to the channels are used as input to output full light flow
Figure FDA0002549586390000026
The discriminator receives a target image ItAnd reconstructing the image
Figure FDA0002549586390000027
As input, a target image ItReconstruction of images, viewed as true images
Figure FDA0002549586390000028
The generated image is regarded as a generated image, and a probability value representing that the generated image is a real image is output.
6. The unsupervised monocular depth estimation algorithm based on deep learning of claim 5, wherein: the structure of the generator is consistent with that of the deep network DepthNet.
7. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: in step 3, the reconstruction loss is calculated according to equation (4)
Figure FDA0002549586390000029
Figure FDA00025495863900000210
In the formula, SSIM represents a structural similarity index, w is a parameter,
Figure FDA00025495863900000211
as a full light stream
Figure FDA00025495863900000212
The corresponding effective mask.
8. The unsupervised monocular depth estimation algorithm based on deep learning of claim 5, wherein: in step 3, the countermeasure loss L is calculated from the formula (5)adv
Figure FDA00025495863900000213
Where G, D denotes the generator and the discriminator, respectively, I, X denotes the real image and the data distribution of the real image,
Figure FDA00025495863900000214
respectively, a generated image and a data distribution of the generated image.
9. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: in step 4, the target mask is obtained according to the formula (6)
Figure FDA00025495863900000215
Figure FDA00025495863900000216
In the formula, 1(·) is an indicator function, and alpha is a threshold;
obtaining the light flow consistency loss L according to the equation (7)fc
Figure FDA00025495863900000217
Obtaining the rigid reconstruction loss according to equation (8)
Figure FDA00025495863900000218
Figure FDA0002549586390000031
10. The unsupervised monocular depth estimation algorithm based on deep learning of claim 1, wherein: loss function L in step 5totalExpressed as:
Figure FDA0002549586390000032
wherein λ isadv、Lds、λr、λf、λfcRespectively, the weights corresponding to the losses.
CN202010571133.XA 2020-06-22 2020-06-22 Unsupervised monocular depth estimation algorithm based on deep learning Pending CN111783582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010571133.XA CN111783582A (en) 2020-06-22 2020-06-22 Unsupervised monocular depth estimation algorithm based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010571133.XA CN111783582A (en) 2020-06-22 2020-06-22 Unsupervised monocular depth estimation algorithm based on deep learning

Publications (1)

Publication Number Publication Date
CN111783582A true CN111783582A (en) 2020-10-16

Family

ID=72756281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010571133.XA Pending CN111783582A (en) 2020-06-22 2020-06-22 Unsupervised monocular depth estimation algorithm based on deep learning

Country Status (1)

Country Link
CN (1) CN111783582A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN112396657A (en) * 2020-11-25 2021-02-23 河北工程大学 Neural network-based depth pose estimation method and device and terminal equipment
CN113139990A (en) * 2021-05-08 2021-07-20 电子科技大学 Depth grid stream robust image alignment method based on content perception
CN113160294A (en) * 2021-03-31 2021-07-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium
CN113313732A (en) * 2021-06-25 2021-08-27 南京航空航天大学 Forward-looking scene depth estimation method based on self-supervision learning
CN113379821A (en) * 2021-06-23 2021-09-10 武汉大学 Stable monocular video depth estimation method based on deep learning
CN113724155A (en) * 2021-08-05 2021-11-30 中山大学 Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
CN114066987A (en) * 2022-01-12 2022-02-18 深圳佑驾创新科技有限公司 Camera pose estimation method, device, equipment and storage medium
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN116164770A (en) * 2023-04-23 2023-05-26 禾多科技(北京)有限公司 Path planning method, path planning device, electronic equipment and computer readable medium
WO2023178951A1 (en) * 2022-03-25 2023-09-28 上海商汤智能科技有限公司 Image analysis method and apparatus, model training method and apparatus, and device, medium and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522828A (en) * 2018-11-01 2019-03-26 上海科技大学 A kind of accident detection method and system, storage medium and terminal
CN109977847A (en) * 2019-03-22 2019-07-05 北京市商汤科技开发有限公司 Image generating method and device, electronic equipment and storage medium
CN110705376A (en) * 2019-09-11 2020-01-17 南京邮电大学 Abnormal behavior detection method based on generative countermeasure network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522828A (en) * 2018-11-01 2019-03-26 上海科技大学 A kind of accident detection method and system, storage medium and terminal
CN109977847A (en) * 2019-03-22 2019-07-05 北京市商汤科技开发有限公司 Image generating method and device, electronic equipment and storage medium
CN110705376A (en) * 2019-09-11 2020-01-17 南京邮电大学 Abnormal behavior detection method based on generative countermeasure network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GAO HAOSHENG,TENG WANG: "Unsupervised Learning of Monocular Depth from Videos", 《2019 CHINESE AUTOMATION CONGRESS (CAC)》 *
WEI-SHENG LAI等: "Semi-Supervised Learning for Optical Flow with Generative Adversarial Networks", 《《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN112396657A (en) * 2020-11-25 2021-02-23 河北工程大学 Neural network-based depth pose estimation method and device and terminal equipment
CN113160294A (en) * 2021-03-31 2021-07-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium
CN113139990A (en) * 2021-05-08 2021-07-20 电子科技大学 Depth grid stream robust image alignment method based on content perception
CN113379821A (en) * 2021-06-23 2021-09-10 武汉大学 Stable monocular video depth estimation method based on deep learning
CN113313732A (en) * 2021-06-25 2021-08-27 南京航空航天大学 Forward-looking scene depth estimation method based on self-supervision learning
CN113724155A (en) * 2021-08-05 2021-11-30 中山大学 Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
CN113724155B (en) * 2021-08-05 2023-09-05 中山大学 Self-lifting learning method, device and equipment for self-supervision monocular depth estimation
CN114066987A (en) * 2022-01-12 2022-02-18 深圳佑驾创新科技有限公司 Camera pose estimation method, device, equipment and storage medium
WO2023178951A1 (en) * 2022-03-25 2023-09-28 上海商汤智能科技有限公司 Image analysis method and apparatus, model training method and apparatus, and device, medium and program
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN114998411B (en) * 2022-04-29 2024-01-09 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss
CN116164770A (en) * 2023-04-23 2023-05-26 禾多科技(北京)有限公司 Path planning method, path planning device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
CN111783582A (en) Unsupervised monocular depth estimation algorithm based on deep learning
Zhai et al. Optical flow and scene flow estimation: A survey
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN109377530B (en) Binocular depth estimation method based on depth neural network
Lv et al. Learning rigidity in dynamic scenes with a moving camera for 3d motion field estimation
Yan et al. Ddrnet: Depth map denoising and refinement for consumer depth cameras using cascaded cnns
JP7177062B2 (en) Depth Prediction from Image Data Using Statistical Model
WO2019174377A1 (en) Monocular camera-based three-dimensional scene dense reconstruction method
CN111105432A (en) Unsupervised end-to-end driving environment perception method based on deep learning
CN115187638B (en) Unsupervised monocular depth estimation method based on optical flow mask
Wang et al. Depth estimation of video sequences with perceptual losses
CN113850900A (en) Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
Song et al. Depth estimation from a single image using guided deep network
Liu et al. A survey on deep learning methods for scene flow estimation
Feng et al. Deep depth estimation on 360 images with a double quaternion loss
JP2024510230A (en) Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture
Wang et al. Recurrent neural network for learning densedepth and ego-motion from video
Durasov et al. Double refinement network for efficient monocular depth estimation
Basak et al. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image
Kashyap et al. Sparse representations for object-and ego-motion estimations in dynamic scenes
CN113436254A (en) Cascade decoupling pose estimation method
Hou et al. Joint learning of image deblurring and depth estimation through adversarial multi-task network
Rabby et al. Beyondpixels: A comprehensive review of the evolution of neural radiance fields
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201016