CN113570658A - Monocular video depth estimation method based on depth convolutional network - Google Patents

Monocular video depth estimation method based on depth convolutional network Download PDF

Info

Publication number
CN113570658A
CN113570658A CN202110648477.0A CN202110648477A CN113570658A CN 113570658 A CN113570658 A CN 113570658A CN 202110648477 A CN202110648477 A CN 202110648477A CN 113570658 A CN113570658 A CN 113570658A
Authority
CN
China
Prior art keywords
network
depth
error
sub
estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110648477.0A
Other languages
Chinese (zh)
Inventor
陈渤
曾泽群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110648477.0A priority Critical patent/CN113570658A/en
Publication of CN113570658A publication Critical patent/CN113570658A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of video processing, and discloses a monocular video depth estimation method based on a depth convolution network, which comprises the following steps: acquiring training data and a monocular video to be tested; constructing a depth estimation network model, wherein the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, and a decoder comprises an up-sampling module and a dense hole pyramid module; performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function; a depth map of the monocular video to be tested is estimated. The invention utilizes more space information of the original image, and effectively improves the precision of depth prediction.

Description

Monocular video depth estimation method based on depth convolutional network
Technical Field
The invention belongs to the technical field of video processing, and further relates to a monocular video depth estimation method based on a depth convolution network, which can be used for three-dimensional reconstruction, robot navigation and automatic driving.
Background
Depth estimation is indispensable in many tasks, such as three-dimensional reconstruction, automatic driving, robot navigation, and other important fields. The binocular depth estimation algorithm is the most common depth estimation algorithm at present, which estimates depth by simulating human eyes and using parallax between pictures of different visual angles taken by a stereo camera or a plurality of cameras. However, the binocular depth estimation algorithm has a lot of problems, such as high computational complexity, high difficulty in acquiring binocular pictures, difficult matching of low-texture regions, and the like. Single-view pictures tend to be less difficult to acquire than multi-view pictures. The monocular depth estimation algorithm obtains depth from a picture or a video shot by a single camera, and can greatly reduce cost and data obtaining difficulty.
In addition, in the depth estimation problem, the acquisition cost of the depth true value is very high, and the image is usually labeled by acquiring depth information through a light sensor (indoor) and a laser radar (outdoor). The unsupervised depth estimation method based on the video sequence considers the depth prediction problem of the video sequence as an intermediate process of an image synthesis process between adjacent frames, so that a depth true value is not required for training.
A paper "Unsupervised Learning of Depth and Ego-Motion from Video" (The IEEE Conference on Computer Vision and Pattern Recognition, 2017) published by zhou.t.h, brown.m, snavely.n, lowe.d. discloses an Unsupervised Video Depth estimation algorithm based on Depth Learning. The algorithm does not need a depth true value, predicts the depth based on the multi-angle matching relation between video sequences, provides geometric consistency constraint after considering the problem of the inconsistency of the output scales of previous work, provides a self-discovery mask module on the basis, solves the problem of the inconsistency of the scales between frames of an output depth image, and has higher precision on depth prediction.
But still have the following disadvantages: the network used by the method does not fully utilize multi-scale feature fusion information to improve the accuracy of depth prediction. The feature reuse effect of the backbone network is limited, and the image features cannot be fully extracted.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a monocular video depth estimation method and system based on a depth convolutional network, which improve the accuracy of a finally obtained depth map by utilizing a deep convolutional network structure.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme.
The monocular video depth estimation method based on the depth convolutional network comprises the following steps:
step 1, acquiring training data and a monocular video to be tested;
wherein the training data comprises an RGB optical video sequence I ═ { I ═ ItT is more than or equal to 0 and less than or equal to T, T belongs to Z and the corresponding depth true value image sequence D ═ DtT is more than or equal to 0 and less than or equal to T, T belongs to Z, Z represents a time set, ItRepresenting RGB image at time t, DtRepresenting a depth truth image at time t;
step 2, a depth estimation network model is built, the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, the depth prediction sub-network is a self-coding network and comprises an encoder and a decoder, the encoder is a densely connected depth convolution network, and the decoder comprises an up-sampling module and a dense hole pyramid module; the camera pose estimation sub-network is a deep convolutional neural network;
step 3, performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function to obtain a trained depth prediction sub-network;
wherein the loss function comprises an image reconstruction error LpScale consistency error LGCAnd the error of the smoothing term Ls
Step 4, inputting the monocular video to be tested into the trained depth prediction sub-network, and outputting a normalized depth prediction image; and calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.
Compared with the prior art, the invention has the beneficial effects that:
because the constructed depth prediction sub-network has the densely connected deep structure and the multi-scale pyramid feature fusion module, more image information can be extracted, and the defects that the depth prediction is carried out only by using skip level connection and utilizing the multi-scale information and the feature extraction network cannot carry out feature reuse in the prior art are overcome, so that more original image space information is utilized, and the precision of the depth prediction is effectively improved.
Drawings
The invention is described in further detail below with reference to the figures and specific embodiments.
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a diagram of the deep convolutional network architecture of the present invention;
FIG. 3 is an RGB image of adjacent frames input in an embodiment of the present invention;
FIG. 4 is an output depth map of adjacent frame images obtained using the present invention;
fig. 5 is a schematic diagram of the image reconstruction process of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.
Referring to fig. 1, the monocular video depth estimation method based on the depth convolutional network provided by the present invention includes the following steps:
step 1, acquiring training data and a monocular video to be tested;
wherein the training data comprises an RGB optical video sequence I ═ { I ═ ItT is more than or equal to 0 and less than or equal to T, T belongs to Z and the corresponding depth true value image sequence D ═ DtT is more than or equal to 0 and less than or equal to T, T belongs to Z, Z represents a time set, ItRepresenting RGB image at time t, DtRepresenting a depth truth image at time t;
in the embodiment, the RGB image sequence and the 3D laser radar point cloud data in the KITTI data set are randomly divided into a training set and a testing set. The samples in the test set correspond to the monocular video to be tested.
Randomly sampling in a training set to obtain RGB images I at t moment and t-1 moment of two adjacent framest,It-1And then the corresponding t time and t-1 time of the 3D laser radar point cloud data recovery are utilizedTrue depth map D of the scalet,Dt-1
Step 2, a depth estimation network model is built, the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, the depth prediction sub-network is a self-coding network and comprises an encoder and a decoder, the encoder is a densely connected depth convolution network, and the decoder comprises an up-sampling module and a dense hole pyramid module; the camera pose estimation sub-network is a deep convolutional neural network;
specifically, the structure of the depth estimation network model is shown in fig. 2:
the depth prediction sub-network is a self-coding network, and the encoder is a densely connected depth convolution network DenseNet; the main body of the decoder is image up-sampling, and a dense hole pyramid module DenseASPP is additionally introduced to perform multi-scale feature fusion. RGB image I of two adjacent framest,It-1As the input of the depth prediction sub-network, as shown in FIG. 3, the network output is the corresponding depth prediction graph
Figure BDA0003110137470000041
As shown in FIG. 4, wherein ItAnd
Figure BDA0003110137470000042
the subscript t of (a) represents the time t,
Figure BDA0003110137470000043
the upper mark of (A) represents that the depth prediction network predicts the result and the depth true value D obtained by the sensortAnd (5) distinguishing.
The sub-network for predicting the pose of the camera is a deep convolutional network, and the input of the network is RGB images I of two adjacent framest,It-1The output is a camera motion matrix T from time T to time T-1tt-1
Step 3, performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function to obtain a trained depth prediction sub-network;
wherein the loss function comprises an image reconstruction error LpScale consistency error LGCAnd the error of the smoothing term Ls
(3.1) randomly sampling from Gaussian distribution with the mean value of 0 and the variance of 0.01, and taking an array of random sampling as an initialization parameter of the depth estimation network model;
(3.2) comparing the RGB image I of two adjacent framest,It-1Respectively inputting a depth prediction sub-network and a camera attitude prediction sub-network, and then respectively calculating the mask weight, the scale consistency error, the image reconstruction error and the smooth regular term error of each sub-network;
(3.3) jointly training the depth prediction sub-network and the camera attitude estimation sub-network by minimizing an overall error, so that the depth prediction sub-network can output a high-precision depth map;
and (3.4) performing iterative updating on all parameters in the depth prediction sub-network and the camera pose estimation sub-network obtained in the step (3.3) by using a batch random gradient descent method until the model converges, and finishing the optimization of the network model.
The loss function mainly includes an image reconstruction error LpError of scale uniformity LGCError of smoothing term Ls. In the process of image reconstruction, moving objects between adjacent frames, a sheltering area or other complex pixel points which are difficult to interpret often cause poor image reconstruction performance. Therefore, the pixel points of the parts need to be detected first, and then the pixel points are given lower weight, the step of detecting the complex pixel points is called as a mask module, and the specific implementation flow is shown as (3.2 a).
(3.2a) output graph of the depth prediction sub-network at time t
Figure BDA0003110137470000051
And a camera motion matrix T from the output T moment to the T-1 moment of the camera pose estimation sub-networktt-1And a depth map under the camera view angle at the t-1 moment can be reconstructed
Figure BDA0003110137470000061
Then will be
Figure BDA0003110137470000062
And the output of the depth prediction sub-network at time t-1
Figure BDA0003110137470000063
Making a normalized difference value to obtain a depth prediction error D based on the pixel point pdiff(p) the following:
Figure BDA0003110137470000064
in the above formula, p represents a certain pixel. Ddiff(p) is a group of [0, 1]Where moving objects, occlusion areas or other difficult to interpret pixels DdiffThe larger (p) is close to 1, and D does not belong to the pixel pointsdiffThe smaller the (p) is, the closer to 0 is, in order to give Ddiff(p) the lower weight of the pixel with a large value, and the mask weight M (p) based on the pixel p is calculated as follows:
M(p)=1-Ddiff(p)
this weight will be applied to the scaled image reconstruction error in (3.3)
(3.2b) Pixel depth prediction error D for the entire mapdiff(p) taking the mean to obtain the scale consistency error:
Figure BDA0003110137470000065
where V is the effective pixel set of the whole picture, and num (V) represents the number of effective pixels.
(3.2c) As shown in FIG. 5, the image reconstruction process is as follows, combining the RGB image I at time ttPredicted depth map
Figure BDA0003110137470000066
Camera sportsMoving matrix Tt→t-1The RGB image at the t-1 moment can be reconstructed
Figure BDA0003110137470000067
Besides the gray value error, the image reconstruction error also introduces a structural similarity error SSIM, and the image reconstruction error formula is as follows in combination with the mask weight M (p) obtained in step (3 a):
Figure BDA0003110137470000068
wherein λi=0.15,λsThe left side of the plus sign in the above formula is the absolute value error of image reconstruction, and the right side ssim (p) is the structural similarity error of the two graphs at the time t-1.
Among them, ssim (structural similarity), structural similarity, is an index for measuring the similarity between two images. The index was first proposed by the Laboratory for Image and Video Engineering (Laboratory for Image and Video Engineering) at the university of Texas, Austin.
Given two images x and y, the structural similarity of the two images can be found as follows:
Figure BDA0003110137470000071
wherein muxIs the average value of x, μyIs the average value of y and is,
Figure BDA0003110137470000072
is the variance of x and is,
Figure BDA0003110137470000073
is the variance of y, σxyIs the covariance of x and y. c. C1=(k1L),c2=(k2L) is a constant used to maintain stability. L is the dynamic range of the pixel values. k is a radical of10.01, and 0.03 for k 2. The more similar the two pictures, the closer the SSIM value is to 1.
(3.2d) to solve the problem of noise and low texture region gradient disappearance, a smoothing term error is introduced, which is shown below:
Figure BDA0003110137470000074
wherein the content of the first and second substances,
Figure BDA0003110137470000075
for the gradient at the pixel point p in the input RGB image,
Figure BDA0003110137470000076
is the gradient of pixel point p in the depth map.
Reconstructing the image with an error LpError of scale uniformity LGCError of smoothing term LsTaking the weighted sum, the overall loss function is as follows:
L=aLp+βLs+γLGC
wherein α is 1.0, β is 0.1, and γ is 0.5. Alpha, beta and gamma respectively represent the weight of the corresponding error, and the value of the weight is between [0 and 1 ].
And training and optimizing the network model by minimizing a loss function, namely the overall error L.
Step 4, inputting the monocular video to be tested into the trained depth prediction sub-network, and outputting a normalized depth prediction image; and calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.
And (4.1) inputting the single RGB picture of the test sample into a depth prediction sub-network, and outputting a corresponding normalized depth map.
And (4.2) calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.
Simulation experiment
The effectiveness of the invention is verified by simulation experiments as follows
1. Simulation conditions are as follows:
the simulation test of the invention is carried out under the linux operating environment with the GPU being Tesla P4. Dividing pictures, training set: 5240 pictures, verification set: 2070 pictures, test set 200 pictures.
2. Simulation content:
simulation 1, depth prediction is performed on the RGB image shown in fig. 3 from the KITTI image set by using the present invention, and a predicted depth map is obtained, as shown in fig. 4. It can be seen from fig. 4 that the present invention can recover a depth map from a single picture.
Simulation 2, performing a depth prediction experiment on the KITTI image set by using the method of the invention and the existing video-based unsupervised monocular depth estimation algorithm SC-sfmlearer, and comparing the accuracy of monocular video depth estimation performed by using the two methods by taking the relative square error SqRel, the root mean square error RMSE and the root mean square error RMSE _ log of the prediction result as comparison standards, wherein the lower the values of SqRel, RMSE and RMSE _ log are, the higher the accuracy of depth prediction is indicated, and the experiment results are shown in Table 1:
TABLE 1 comparison of prediction accuracy of the method of the present invention and conventional SC-sfmlearner
Estimation method SqRel RMSE RMSE_log
SC-sfmlearner 0.1834 6.8903 0.2630
The invention 0.1751 6.4451 0.2496
From the results in table 1 it can be seen that: compared with the existing SC-sfmlearner image depth prediction method, the relative square error SqRel, the root mean square error RMSE and the root mean square logarithmic error RMSE _ log predicted by the method are smaller, which shows the effectiveness of the method provided by the invention.
Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (6)

1. The monocular video depth estimation method based on the depth convolutional network is characterized by comprising the following steps of:
step 1, acquiring training data and a monocular video to be tested;
wherein the training data comprises an RGB optical video sequence I ═ { I ═ ItT is more than or equal to 0 and less than or equal to T, T belongs to Z and the corresponding depth true value image sequence D ═ DtT is more than or equal to 0 and less than or equal to T, T belongs to Z, Z represents a time set, ItRepresenting RGB image at time t, DtRepresenting a depth truth image at time t;
step 2, a depth estimation network model is built, the depth estimation network model comprises a depth prediction sub-network and a camera pose estimation sub-network, the depth prediction sub-network is a self-coding network and comprises an encoder and a decoder, the encoder is a densely connected depth convolution network, and the decoder comprises an up-sampling module and a dense hole pyramid module; the camera pose estimation sub-network is a deep convolutional neural network;
step 3, performing joint training on the depth prediction sub-network and the camera pose estimation sub-network by using training data, and performing iterative updating on network parameters of the two sub-networks by using a loss function to obtain a trained depth prediction sub-network;
wherein the loss function comprises an image reconstruction error LpScale consistency error LGCAnd the error of the smoothing term Ls
Step 4, inputting the monocular video to be tested into the trained depth prediction sub-network, and outputting a normalized depth prediction image; and calibrating the output normalized depth map according to the actual physical scale to obtain a final predicted depth map.
2. The method of claim 1, wherein the encoder is a densely connected deep convolutional network DenseNet; the main body of the decoder is image up-sampling, and multi-scale feature fusion is carried out by adding an introduced dense hollow pyramid module DenseASPP.
3. The method for monocular video depth estimation based on depth convolutional network of claim 1, wherein the joint training of the depth prediction subnetwork and the camera pose estimation subnetwork is performed by using training data, and the specific process is as follows:
(3.1) randomly initializing network parameters of the depth estimation network model;
(3.2) comparing the RGB image I of two adjacent framest,It-1Respectively inputting a depth prediction sub-network and a camera attitude prediction sub-network, and then respectively calculating the mask weight, the scale consistency error, the image reconstruction error and the smooth regular term error of each sub-network;
(3.3) jointly training the depth prediction sub-network and the camera attitude estimation sub-network by minimizing an overall error, so that the depth prediction sub-network can output a high-precision depth map;
and (3.4) iteratively updating all network parameters in the depth prediction sub-network and the camera pose estimation sub-network obtained in the step (3.3) by using a batch random gradient descent method until the model converges, and finishing the optimization of the network model.
4. The method as claimed in claim 3, wherein the random sampling is performed from a Gaussian distribution with a mean value of 0 and a variance of 0.01, and the random sampling array is used as an initialization parameter of the depth estimation network model.
5. The method for monocular video depth estimation based on depth convolutional network of claim 3, wherein the mask weight, scale consistency error, image reconstruction error and smooth regularization term error of each sub-network are calculated respectively, and the specific steps are as follows:
(3.2a) predicting the output map of the subnetwork according to the depth at time t
Figure FDA0003110137460000021
Camera motion matrix T from output T moment to T-1 moment of camera pose estimation sub-networkt→t-1Reconstructing a depth map under the camera view angle at the time t-1
Figure FDA0003110137460000022
Then will be
Figure FDA0003110137460000023
And the output of the depth prediction sub-network at time t-1
Figure FDA0003110137460000024
Making a normalized difference value to obtain a depth prediction error D based on the pixel point pdiff(p):
Figure FDA0003110137460000031
In the formula, Ddiff(p) is of pixel p, whose value belongs to [0, 1]]To (c) to (d);
to administer Ddiff(p) the lower weight of the pixel with a large value, and the mask weight M (p) based on the pixel p is calculated as follows:
M(p)=1-Ddiff(p);
(3.2b) Pixel depth prediction error D for the entire mapdiff(p) taking the mean value to obtain the scale consistency error:
Figure FDA0003110137460000032
wherein, V is the effective pixel set of the whole picture, num (V) represents the number of effective pixels;
(3.2c) combining the RGB image I at time ttPredicted depth map
Figure FDA0003110137460000033
And camera motion matrix Tt→t-1Reconstructing the RGB image at the time of t-1
Figure FDA0003110137460000034
The error term in the process includes an image reconstruction gray value error and a structure similarity error SSIM, and then an image reconstruction error formula is as follows:
Figure FDA0003110137460000035
wherein λ isi、λsThe weight parameters are respectively, and the sum of the weight parameters and the weight parameters is 1;
Figure FDA0003110137460000036
RGB image representing reconstructed time t-1
Figure FDA00031101374600000311
The gray value of the middle pixel point p; SSIM (p) is at time t-1
Figure FDA0003110137460000037
And It-1Structural similarity errors of pixel points p on the two images;
(3.2d) the equation for calculating the error of the smoothing term is as follows:
Figure FDA0003110137460000038
wherein the content of the first and second substances,
Figure FDA0003110137460000039
for the gradient at the pixel point p in the input RGB image,
Figure FDA00031101374600000310
is the gradient of pixel point p in the depth map.
6. The method of claim 5, wherein the overall error is an image reconstruction error LpScale consistency error LGCAnd the error of the smoothing term LsTaking the weighted sum:
L=αLp+βLs+γLGC
wherein alpha, beta and gamma respectively represent the weight of the corresponding error, and the value of the weight is between [0 and 1 ].
CN202110648477.0A 2021-06-10 2021-06-10 Monocular video depth estimation method based on depth convolutional network Pending CN113570658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110648477.0A CN113570658A (en) 2021-06-10 2021-06-10 Monocular video depth estimation method based on depth convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110648477.0A CN113570658A (en) 2021-06-10 2021-06-10 Monocular video depth estimation method based on depth convolutional network

Publications (1)

Publication Number Publication Date
CN113570658A true CN113570658A (en) 2021-10-29

Family

ID=78161933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110648477.0A Pending CN113570658A (en) 2021-06-10 2021-06-10 Monocular video depth estimation method based on depth convolutional network

Country Status (1)

Country Link
CN (1) CN113570658A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627351A (en) * 2022-02-18 2022-06-14 电子科技大学 Fusion depth estimation method based on vision and millimeter wave radar
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN115272438A (en) * 2022-08-19 2022-11-01 中国矿业大学 High-precision monocular depth estimation system and method for three-dimensional scene reconstruction
WO2023155043A1 (en) * 2022-02-15 2023-08-24 中国科学院深圳先进技术研究院 Historical information-based scene depth reasoning method and apparatus, and electronic device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510535A (en) * 2018-03-14 2018-09-07 大连理工大学 A kind of high quality depth estimation method based on depth prediction and enhancing sub-network
CN109741383A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Picture depth estimating system and method based on empty convolution sum semi-supervised learning
WO2019223382A1 (en) * 2018-05-22 2019-11-28 深圳市商汤科技有限公司 Method for estimating monocular depth, apparatus and device therefor, and storage medium
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
CN111369608A (en) * 2020-05-29 2020-07-03 南京晓庄学院 Visual odometer method based on image depth estimation
CN111739078A (en) * 2020-06-15 2020-10-02 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism
CN111860386A (en) * 2020-07-27 2020-10-30 山东大学 Video semantic segmentation method based on ConvLSTM convolutional neural network
WO2021013334A1 (en) * 2019-07-22 2021-01-28 Toyota Motor Europe Depth maps prediction system and training method for such a system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510535A (en) * 2018-03-14 2018-09-07 大连理工大学 A kind of high quality depth estimation method based on depth prediction and enhancing sub-network
WO2019174378A1 (en) * 2018-03-14 2019-09-19 大连理工大学 High-quality depth estimation method based on depth prediction and enhancement sub-networks
WO2019223382A1 (en) * 2018-05-22 2019-11-28 深圳市商汤科技有限公司 Method for estimating monocular depth, apparatus and device therefor, and storage medium
CN109741383A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Picture depth estimating system and method based on empty convolution sum semi-supervised learning
WO2021013334A1 (en) * 2019-07-22 2021-01-28 Toyota Motor Europe Depth maps prediction system and training method for such a system
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
CN111369608A (en) * 2020-05-29 2020-07-03 南京晓庄学院 Visual odometer method based on image depth estimation
CN111739078A (en) * 2020-06-15 2020-10-02 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism
CN111860386A (en) * 2020-07-27 2020-10-30 山东大学 Video semantic segmentation method based on ConvLSTM convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
岑仕杰;何元烈;陈小聪;: "结合注意力与无监督深度学习的单目深度估计", 广东工业大学学报, no. 04 *
王欣盛;张桂玲;: "基于卷积神经网络的单目深度估计", 计算机工程与应用, no. 13 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023155043A1 (en) * 2022-02-15 2023-08-24 中国科学院深圳先进技术研究院 Historical information-based scene depth reasoning method and apparatus, and electronic device
CN114627351A (en) * 2022-02-18 2022-06-14 电子科技大学 Fusion depth estimation method based on vision and millimeter wave radar
CN114627351B (en) * 2022-02-18 2023-05-16 电子科技大学 Fusion depth estimation method based on vision and millimeter wave radar
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN114998411B (en) * 2022-04-29 2024-01-09 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss
CN115272438A (en) * 2022-08-19 2022-11-01 中国矿业大学 High-precision monocular depth estimation system and method for three-dimensional scene reconstruction

Similar Documents

Publication Publication Date Title
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN113570658A (en) Monocular video depth estimation method based on depth convolutional network
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
US11715258B2 (en) Method for reconstructing a 3D object based on dynamic graph network
CN107818554B (en) Information processing apparatus and information processing method
CN110084304B (en) Target detection method based on synthetic data set
CN111462206B (en) Monocular structure light depth imaging method based on convolutional neural network
CN108171249B (en) RGBD data-based local descriptor learning method
CN105513033B (en) A kind of super resolution ratio reconstruction method that non local joint sparse indicates
CN112819853B (en) Visual odometer method based on semantic priori
CN113177592B (en) Image segmentation method and device, computer equipment and storage medium
CN113450396A (en) Three-dimensional/two-dimensional image registration method and device based on bone features
Eichhardt et al. Affine correspondences between central cameras for rapid relative pose estimation
CN112288788A (en) Monocular image depth estimation method
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN107392211B (en) Salient target detection method based on visual sparse cognition
CN114332125A (en) Point cloud reconstruction method and device, electronic equipment and storage medium
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN111160362B (en) FAST feature homogenizing extraction and interframe feature mismatching removal method
CN111401209B (en) Action recognition method based on deep learning
Nouduri et al. Deep realistic novel view generation for city-scale aerial images
CN117274515A (en) Visual SLAM method and system based on ORB and NeRF mapping
CN111696167A (en) Single image super-resolution reconstruction method guided by self-example learning
CN111553954A (en) Direct method monocular SLAM-based online luminosity calibration method
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination