CN111340867A - Depth estimation method and device for image frame, electronic equipment and storage medium - Google Patents

Depth estimation method and device for image frame, electronic equipment and storage medium Download PDF

Info

Publication number
CN111340867A
CN111340867A CN202010121139.7A CN202010121139A CN111340867A CN 111340867 A CN111340867 A CN 111340867A CN 202010121139 A CN202010121139 A CN 202010121139A CN 111340867 A CN111340867 A CN 111340867A
Authority
CN
China
Prior art keywords
depth
image
image frames
optical flow
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010121139.7A
Other languages
Chinese (zh)
Other versions
CN111340867B (en
Inventor
刘永进
赵旺
舒叶芷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Deep Blue Technology Shanghai Co Ltd
Original Assignee
Tsinghua University
Deep Blue Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Deep Blue Technology Shanghai Co Ltd filed Critical Tsinghua University
Priority to CN202010121139.7A priority Critical patent/CN111340867B/en
Publication of CN111340867A publication Critical patent/CN111340867A/en
Application granted granted Critical
Publication of CN111340867B publication Critical patent/CN111340867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a depth estimation method and device of an image frame, electronic equipment and a storage medium, wherein pixel relations are extracted by using an optical flow prediction network obtained through unsupervised training to replace the traditional manually set image characteristics SIFT and the like for matching, the determination of the relations among pixels becomes more accurate, and confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.

Description

Depth estimation method and device for image frame, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a depth estimation method and device for an image frame, electronic equipment and a storage medium.
Background
The depth estimation of the monocular image is a hotspot problem in computer vision and robotics, and has wide application in the fields of automatic driving, robot navigation, three-dimensional reconstruction and the like; visual odometry, or camera pose estimation, is an important hotspot problem in the field of robots. In a video sequence, depth information and camera pose information are constrained and influenced with each other, so that combined solution and application of depth estimation and pose estimation are receiving more and more attention.
The development of the neural network and the appearance of deep learning bring new ideas and schemes for solving the traditional computer vision task. It becomes possible to learn estimated depth information from large-scale data. Data-driven supervised learning requires raw data and corresponding labels, however, obtaining depth labels for color (RGB) images is difficult, especially in outdoor scenes, where only sparse point cloud depth information can be obtained by a laser radar, and a color depth (RGBD) camera cannot accurately obtain depth values at a far distance. This presents challenges to how to collect, utilize data and design algorithmic systems.
The deep learning system based on multitask and unsupervised learning has the core idea that mutual constraints among a plurality of tasks are explored to construct a loss function to supervise a neural network, so that the supervision by using a label is not needed. The image depth prediction task and the camera pose prediction task can be trained without label information because the constraint can be constructed by the back projection-pose transformation-projection reconstruction of the depth map. The existing image depth prediction and pose prediction system based on multitask and unsupervised learning is mostly established on the basis of predicting the pose change and the image depth of a camera by using two independent end-to-end neural networks PoseNet and DepthNet respectively and then calculating a loss function. However, the use of PoseNet to predict camera pose changes is not robust enough, and the trained PoseNet cannot give effective prediction on pose data distribution that does not appear in the training set, indicating that the generalization performance is limited.
Therefore, how to improve the stability and robustness of the system while inheriting the advantages of the unsupervised training neural network is a problem to be solved.
Disclosure of Invention
Because the existing methods have the problems, embodiments of the present invention provide a method and an apparatus for estimating depth of an image frame, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present invention provides a method for estimating depth of an image frame, including:
acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;
performing confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to a result of the confidence coefficient sampling to obtain a camera pose change estimated value, and performing triangularization operation according to the camera pose change estimated value and the corresponding relation between partial pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;
calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the pose change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;
and respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.
Optionally, the performing confidence level sampling on the corresponding relationship between all pixels of the two image frames, and estimating a relative change value of the camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value specifically includes:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
Optionally, the depth prediction network is based on an encoder-decoder architecture and adds a hopping connection between encoder and decoder.
Optionally, the training process of the optical flow prediction network is as follows:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image1The error and the SSIM error are used as supervision signals;
edge-sensitive smoothness loss function LsmoothTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmooth
Lrecons=‖Mo(I1-I′1)‖+(1-SSIM(MoI1,MoI′1))
Figure BDA0002393002640000031
wherein, I1Is the first frame image, I'1For the reconstructed first frame image, f1For forward optical flow prediction from a first frame to a second frame, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
In a second aspect, an embodiment of the present invention further provides a depth estimation apparatus for an image frame, including:
the optical flow prediction module is used for acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;
the point cloud acquisition module is used for carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to the result of the confidence coefficient sampling to obtain a camera pose change estimated value, and carrying out triangularization operation according to the corresponding relation between the camera pose change estimated value and part of pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;
the network training module is used for calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the machine posture change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;
and the depth estimation module is used for respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.
Optionally, the point cloud obtaining module is specifically configured to:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
Optionally, the depth prediction network is based on an encoder-decoder architecture and adds a hopping connection between encoder and decoder.
Optionally, the training process of the optical flow prediction network is as follows:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image1The error and the SSIM error are used as supervision signals;
edge-sensitive smoothness loss function LsmoothTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmoot
Lrecons=‖Mo(I1-I′1)‖+(1-SSIM(MoI1,MoI′1))
Figure BDA0002393002640000051
wherein, I1Is the first frame image, I'1For the reconstructed first frame image, f1For forward optical flow prediction from a first frame to a second frame, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.
According to the technical scheme, the pixel relationship is extracted by using the optical flow prediction network obtained through unsupervised training to replace the traditional manually set image features SIFT and the like for matching, the determination of the relationship between the pixels becomes more accurate, and the confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a depth estimation method for an image frame according to an embodiment of the present invention;
FIG. 2 is an interaction diagram of a prediction model for depth estimation of an image frame according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an image frame depth estimation apparatus according to an embodiment of the present invention;
fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Fig. 1 shows a flowchart of a depth estimation method for an image frame provided in this embodiment, which includes:
s101, two adjacent image frames are obtained from a training video sequence, and the two image frames are respectively input into an optical flow prediction network obtained through unsupervised training, so that the corresponding relation between all pixels of the two image frames output by the optical flow prediction network is obtained.
Wherein the training video sequence is a video sequence used for training an optical flow prediction network and a depth prediction network.
The optical flow prediction network and the depth prediction network are unsupervised and do not need to input annotation information other than image frames.
An unsupervised training process for an optical flow prediction network, comprising: inputting two image frames into an optical flow prediction network, reconstructing a second frame image by using a first frame image and a corresponding optical flow prediction result, and training a neural network by minimizing errors of the reconstructed image and an original image; the optical flow prediction network explicitly calculates to obtain an image shielding mask according to the predicted optical flow value, and the shielded area does not participate in the calculation of reconstruction errors; and simultaneously calculating the consistency loss of the forward optical flow and the backward optical flow of the two frames of images to give confidence of the pixel corresponding relation established based on the optical flows.
S102, performing confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of the camera pose according to the result of the confidence coefficient sampling to obtain a camera pose change estimated value, and performing triangularization operation according to the corresponding relation between the camera pose change estimated value and part of the pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system.
The estimation of the camera pose means that the external parameter change of the camera is solved from an image video sequence, and the motion trail of the camera is recovered. Pose estimation is similar to the positioning task in simultaneous localization mapping (SLAM), and localization is achieved by solving for the relative pose. In the traditional SLAM algorithm, the corresponding relation between pixels is established by using manual visual features, such as scale invariant feature transform descriptors (SIFT) or oriented fast rotation binary coding descriptors (ORB), and then the least square problem is solved by epipolar geometric constraint, so that the relative pose change of the camera can be obtained. By combining random sample consensus (RANSAC) and other noise reduction algorithms, the camera pose with more accuracy and robustness can be recovered. However, the pose accuracy obtained by the solution depends on the pixel correspondence established based on SIFT or ORB, so that the pose estimation accuracy is greatly reduced due to inaccurate pixel correspondence under the condition of fast camera motion or low image quality. The depth estimation method for the image frame provided by the embodiment establishes the pixel corresponding relation based on the depth learning of the optical flow, so that the acquisition of the pixel corresponding relation becomes more accurate and robust.
The triangularization operation is guided, and end-to-end training can be guaranteed; the operation of triangularizing to obtain the point cloud is based on the corresponding relation of a small number of high-confidence pixels obtained by confidence sampling, namely the triangulated point cloud is sparse.
S103, calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the pose change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map.
Specifically, the obtained sparse point cloud is projected to obtain sparse depth map reconstruction, the depth value of the predicted depth map is subjected to scale transformation to minimize errors of the sparse reconstructed depth map and the predicted depth map, and then the L2 errors of the sparse reconstructed depth map and the scale-transformed predicted depth map are used for supervising the depth prediction network.
S104, respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.
The depth estimation of the image frame refers to acquiring a scene depth value corresponding to each pixel from a single color image. The acquisition of scene depth values from a sequence of images of a monocular camera or from a single image is an underdetermined problem: the projections of all depth estimation values differing by a constant scale factor are the same on the image plane and conform to the current color picture, so that the true size of the depth value cannot be recovered, and only the relative size of the depth value can be recovered.
The embodiment provides a monocular camera pose and image depth prediction method based on multitask and unsupervised learning, which comprises the following steps: acquiring two adjacent image frames from a training video sequence; inputting the two frames of images into an optical flow prediction network obtained through unsupervised training together, and outputting the corresponding relation between all pixels of the two frames of images; inputting the two frames of images into a depth prediction network respectively, and outputting depth estimation values of the two frames of images; performing confidence coefficient sampling on the corresponding relation between all pixels of the two frames of images, selecting partial pixels with the highest confidence coefficient, inputting the partial pixels into an eight-point method and a random sample consensus (RANSAC) algorithm, and estimating a relative change value of the camera pose; performing triangularization operation according to the estimated value of the pose change of the camera and the corresponding relation of the pixels to obtain point cloud in a coordinate system of the three-dimensional camera; calculating a point cloud projection reconstruction depth map to supervise a depth prediction network, simultaneously performing inverse projection-transformation-projection reconstruction on depth prediction values of two frames of images by using a camera pose relative change estimation value, and realizing the training of the depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; and applying the trained optical flow network, depth prediction network and pose calculation algorithm to the pose and image depth estimation of the monocular camera.
In the embodiment, the pixel relationship is extracted by using the optical flow prediction network obtained through unsupervised training to replace the traditional manually set image characteristics SIFT and the like for matching, the determination of the relationship between pixels becomes more accurate, and the confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.
Further, on the basis of the above method embodiment, the performing confidence level sampling on the corresponding relationship between all pixels of the two image frames in S102, and estimating a relative change value of the camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value specifically includes:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
The confidence coefficient sampling is carried out according to an occlusion mask output by the optical flow prediction network and the consistency loss of a forward optical flow and a backward optical flow; after a camera transformation basic matrix is obtained by an eight-point method and a RANSAC algorithm, an optimal rotation matrix and translation matrix are determined by verifying the feasibility of triangulated point cloud, wherein the feasibility of triangulated point cloud means that the solution obtained by solving the rotation and translation matrix from the basic matrix is required to meet the condition that the point cloud position is in front of two camera planes.
Further, on the basis of the above method embodiment, the depth prediction network is based on an encoder-decoder structure and adds a hopping connection between encoder and decoder.
Where the output of the depth prediction network is disparity, i.e. the inverse of depth.
The refinement of the output results is improved by adding a jump connection between the encoder and the decoder.
Further, on the basis of the above method embodiment, the training process of the optical flow prediction network is as follows:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating an L1 error and an SSIM error between the reconstructed image and the original image as a supervision signal;
edge-sensitive smoothness loss function LsmooTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmooth
Lrecons=‖Mo(I1-I′1)‖+(1-SSIM(MoI1,MoI′1))
Figure BDA0002393002640000101
wherein, I1Is the first frame image, I'1For the reconstructed first frame image, f1For forward optical flow prediction from a first frame to a second frame, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
Specifically, a constraint relation is established according to the predicted depth maps of the two image frames and the obtained relative pose change of the camera, and specifically: and reversely projecting the predicted depth map of the first frame to a three-dimensional space, performing rotation translation transformation on the point cloud by using the camera pose change parameters, projecting the point cloud to an image plane to reconstruct a depth map of a second frame, and supervising the depth prediction network by using an L2 error between the reconstructed depth map of the second frame and the predicted depth map of the second frame.
And calculating an edge-related image smoothing loss function through the obtained predicted depth maps of the two image frames, so that the predicted depth maps conform to the smooth change characteristic of the image and the sharp change characteristic of the edge.
The obtained prediction depth maps of the two image frames are subjected to inverse projection-rotational translation transformation-projection operation, so that a pixel corresponding relation between the two image frames can be established, the corresponding relation is compared with the corresponding relation established based on the optical flow, and supervision signals for a depth prediction network and the optical flow prediction network are generated.
The method comprises the steps that a deep neural network system based on multi-task learning and unsupervised learning simultaneously predicts an optical flow, a pose and a depth value during training, and obtains a loss function by utilizing inherent constraints of three tasks; during testing, tasks are decoupled, a depth prediction network or an optical flow prediction network can be tested independently, or the pose of a camera can be solved according to the optical flow prediction result.
Specifically, the depth estimation method for image frames provided by this embodiment may include the following specific steps:
step a1, two adjacent image frames are obtained from the training video sequence.
The adjacent two frames of images are not necessarily two adjacent frames in the original video sequence, but may be two images separated by several frames in time.
Step A2, inputting the two frames of images into an optical flow prediction network obtained through unsupervised training, and outputting the corresponding relation between all pixels of the two frames of images;
the optical flow prediction network inputs the stacking of two images and outputs the movement distance (delta x, delta y) from each pixel in the first image to the corresponding pixel in the second image; the corresponding pixels here refer to a pair of pixels in the two frame images that refer to the same region of the real world.
Obtaining optical flow f of first frame to second frame image using optical flow prediction network1And optical flow f of second frame to first frame image2(ii) a Ideally, by f1Transformation plus f2Transformed, reconstructed pixels should return to their original positions. Using f1Transformation plus f2Obtaining the consistency confidence coefficient of the corresponding relation between the pixels by the distance difference Delta D between the pixel position after the transformation and the original position before the transformation
Figure BDA0002393002640000111
The confidence is used in subsequent sampling to obtain reliable pixel correspondences for calculating camera pose changes and triangularizing operations.
And step A3, inputting the two frames of images into a depth prediction network respectively, and outputting depth estimation values of the two frames of images.
The depth prediction network is based on an encoder-decoder structure, and a jump connection between an encoder and a decoder is added to improve the fineness of an output result; the output of the depth prediction network is the disparity, i.e. the inverse form of the depth.
And A4, performing confidence coefficient sampling on the corresponding relation between all pixels of the two frames of images, and selecting partial pixels with the highest confidence coefficient to input into an eight-point method and a random sample consensus RANSAC algorithm to estimate the relative change value of the camera pose.
The confidence coefficient sampling is the shielding mask M output by the optical flow prediction networkoAnd a forward-backward optical flow consistency confidence C; after a camera transformation basis matrix is obtained by an eight-point method and a random sampling consensus RANSAC algorithm, an optimal rotation matrix and translation matrix are determined by verifying the feasibility of the triangulated point cloud.
The eight-point method obtains a basic matrix by solving a least square problem, and the random sampling consensus algorithm verifies the coincidence degree of the obtained basic matrix and the existing pixel corresponding relation through the loop detection of sampling, hypothesis, solving and verification, and finally realizes the removal of noise points, such as dynamic objects in a scene and pixel corresponding relations with inaccurate prediction. According to the final resultThe obtained basic matrix can calculate the distance from each pair of pixel corresponding points to the polar line corresponding to the pixel corresponding points, the pixel corresponding points under the ideal condition should fall on the polar line, and the distance is binarized by using a proper threshold value to obtain an inner point mask MiA mask median of 0 represents a more likely dynamic object or inaccurate pixel correspondence, and a mask median of 1 represents a more reliable pixel correspondence.
And A5, performing triangularization operation according to the camera pose change estimation value and the pixel corresponding relation to obtain point cloud in a three-dimensional camera coordinate system.
The triangularization operation is guided, and end-to-end training can be guaranteed; the operation of triangularizing to obtain the point cloud is based on the corresponding relation of a small number of high-confidence pixels obtained by confidence sampling, namely the triangulated point cloud is sparse.
The Triangulation method adopts a Mid-Point Triangulation algorithm (Mid-Point Triangulation), and has the advantages of simplicity, easy calculation and clear geometric meaning.
A6, calculating a point cloud projection reconstruction depth map to supervise a depth prediction network, performing inverse projection-transformation-projection reconstruction on depth prediction values of two frames of images by using a camera pose relative change estimation value, and realizing the training of the depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; the projection relation of the two frames of depth maps can be used for establishing the corresponding relation of pixels, the corresponding relation of the pixels is compared with the corresponding relation of the pixels established according to the predicted optical flow, and supervision signals for the depth prediction network and the optical flow prediction network are generated.
Loss function L for supervising deep prediction networkdComprises four items:
Ld=w1Ltd+w2Lpd+w3Lsd+w4Lfd
wherein L istdRepresenting a loss term supervised using triangulated depth maps, LpdRepresenting a loss term, L, of a mutual projection reconstruction of two frame depth mapssdRepresenting the loss of smoothness term, L, of the depth mapfdRepresenting two frames deepThe pixel correspondence calculated by the degree map and the camera pose, and the pixel correspondence calculated by the predicted optical flow. In particular, the present invention relates to a method for producing,
Figure BDA0002393002640000121
wherein D istriIs a projection reconstructed sparse depth map of a triangulated sparse point cloud DtAnd c is the calculated optimal scale transformation factor. L ispdThe expression is as follows:
Figure BDA0002393002640000131
p2d=φ(K[T12D1(p1)K-1(h(p1))])
wherein p is1Representing a certain pixel position (x, y), MoFor occlusion masks, MiIn the form of an interior point mask,
Figure BDA0002393002640000136
for regularization; d1For the predicted depth map of the first frame,
Figure BDA0002393002640000132
representing a second frame depth map reconstructed from the first frame depth map,
Figure BDA0002393002640000133
representing the depth map of the second frame at all corresponding grid points p2dA depth map is obtained through interpolation; k is camera internal reference, T12Representing the relative change of the pose of the camera, phi representing the conversion from a camera coordinate system to a pixel coordinate system, and h representing the conversion into homogeneous coordinates. L isfdThe expression is as follows:
Figure BDA0002393002640000134
p2f=p1+F12(p1)
wherein, F12Representing the optical flow prediction results. L issdThe expression is as follows:
Figure BDA0002393002640000135
and A7, applying the trained optical flow prediction network, depth prediction network and pose calculation algorithm to the pose and image depth estimation of the monocular camera.
Fig. 2 is a frame diagram of a testing process of the method for predicting camera pose, optical flow between images, and pose change according to this embodiment, where the process includes:
and step B1, inputting a video sequence into the trained optical flow prediction network to obtain the optical flow prediction between adjacent image frames.
The occlusion mask and the confidence of the front lane and back lane consistency still need to be calculated for confidence sampling.
And step B2, obtaining the corresponding relation between pixels according to the optical flow prediction, obtaining a more reliable pixel relation through confidence coefficient sampling, solving a basic matrix by using an eight-point method and a random sampling consensus algorithm, and obtaining an optimal rotation matrix and an optimal translation matrix through triangularization verification.
The sampling and solving method is consistent with the training, but the random sampling consistent algorithm can be repeatedly operated in an iterative mode for many times, and more accurate results are obtained.
And step B3, inputting the same video sequence into the trained depth prediction network to obtain the depth map prediction result of each frame.
The prediction of the depth map is done frame by frame and does not require the entire piece of video context information.
The deep neural network system based on multi-task learning and unsupervised learning predicts optical flow, pose and depth value simultaneously during training, and obtains a loss function by utilizing the inherent constraints of three tasks; during testing, tasks are decoupled, a depth prediction network or an optical flow prediction network can be tested independently, or the pose of a camera can be solved according to the optical flow prediction result.
In the work of unsupervised learning of image depth and camera pose in the prior art, the learning of the end-to-end pose prediction neural network depends on the distribution of training data, the generalization capability is poor, and the performance on data which does not appear in a training set is poor. Therefore, the embodiment provides a camera pose and image depth prediction method based on multitask and unsupervised learning, and a neural network is combined to effectively estimate an optical flow and solve a pose of an antipodal geometric physical model, so that a more robust and generalized depth learning system is realized.
Fig. 3 is a schematic structural diagram illustrating a depth estimation apparatus for image frames according to this embodiment, where the apparatus includes: an optical flow prediction module 301, a point cloud acquisition module 302, a network training module 303 and a depth estimation module 304; wherein:
the optical flow prediction module 301 is configured to obtain two adjacent image frames from a training video sequence, and input the two image frames into an optical flow prediction network obtained through unsupervised training, respectively, to obtain a corresponding relationship between all pixels of the two image frames output by the optical flow prediction network;
the point cloud obtaining module 302 is configured to perform confidence level sampling on correspondence between all pixels of the two image frames, estimate a relative change value of a camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value, and perform triangulation operation according to the camera pose change estimation value and correspondence between some pixels after sampling between the two image frames to obtain a point cloud in a three-dimensional camera coordinate system;
the network training module 303 is configured to calculate a projection of the point cloud, reconstruct a depth map, perform inverse projection-transformation-projection reconstruction on depth prediction values of the two image frames in the depth map according to the pose change estimation value, and implement training on a depth prediction network by minimizing an error between the reconstructed depth map and a predicted depth map;
the depth estimation module 304 is configured to input the two image frames into the depth prediction network, respectively, to obtain depth estimation values of the two image frames output by the depth prediction network.
Specifically, the optical flow prediction module 301 obtains two adjacent image frames from a training video sequence, and respectively inputs the two image frames into an optical flow prediction network obtained through unsupervised training, so as to obtain a corresponding relationship between all pixels of the two image frames output by the optical flow prediction network; the point cloud obtaining module 302 performs confidence coefficient sampling on the corresponding relationship between all pixels of the two image frames, estimates a relative change value of a camera pose according to a result of the confidence coefficient sampling to obtain a camera pose change estimation value, and performs triangulation operation according to the camera pose change estimation value and the corresponding relationship between part of the pixels after sampling between the two image frames to obtain a point cloud in a three-dimensional camera coordinate system; the network training module 303 calculates the projection of the point cloud, reconstructs a depth map, performs inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the machine pose change estimation value, and realizes the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; the depth estimation module 304 inputs the two image frames into the depth prediction network, respectively, to obtain depth estimation values of the two image frames output by the depth prediction network.
In the embodiment, the pixel relationship is extracted by using the optical flow prediction network obtained through unsupervised training to replace the traditional manually set image characteristics SIFT and the like for matching, the determination of the relationship between pixels becomes more accurate, and the confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.
Further, on the basis of the above apparatus embodiment, the point cloud obtaining module 302 is specifically configured to:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
Further, on the basis of the above apparatus embodiment, the depth prediction network is based on an encoder-decoder structure and adds a hopping connection between encoder and decoder.
Further, on the basis of the above device embodiment, the training process of the optical flow prediction network is as follows:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image1The error and the SSIM error are used as supervision signals;
edge-sensitive smoothness loss function LsmoothTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmoot
Lrecons=‖Mo(I1-I′1)‖+(1-SSIM(MoI1,MoI′1))
Figure BDA0002393002640000161
wherein, I1For the first frame image, I1' is the reconstructed first frame image, f1For forward optical flow prediction from a first frame to a second frame, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
The depth estimation apparatus for image frames described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.
Referring to fig. 4, the electronic device includes: a processor (processor)401, a memory (memory)402, and a bus 403;
wherein,
the processor 401 and the memory 402 complete communication with each other through the bus 403;
the processor 401 is configured to call program instructions in the memory 402 to perform the methods provided by the above-described method embodiments.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of depth estimation of an image frame, comprising:
acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;
performing confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to a result of the confidence coefficient sampling to obtain a camera pose change estimated value, and performing triangularization operation according to the camera pose change estimated value and the corresponding relation between partial pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;
calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the pose change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;
and respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.
2. The method for depth estimation of image frames according to claim 1, wherein the confidence sampling is performed on correspondence between all pixels of the two image frames, and a relative change value of a camera pose is estimated according to a result of the confidence sampling to obtain a camera pose change estimation value, specifically comprising:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
3. The method of depth estimation of image frames according to claim 1, characterized in that said depth prediction network is based on an encoder-decoder structure and adds a jump connection between encoder and decoder.
4. The method for estimating the depth of the image frame according to claim 1, wherein the training process of the optical flow prediction network is:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image1The error and the SSIM error are used as supervision signals;
edge-sensitive smoothness loss function LsmoothTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmoot
Lrecons=||Mo(I1-I′1)]]+(1-SSIM(MoI1,Mol′1))
Figure FDA0002393002630000021
wherein, I1Is the first frame image, I'1For the reconstructed first frame image, f1For forward optical flow prediction from a first frame to a second frame, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
5. An apparatus for depth estimation of an image frame, comprising:
the optical flow prediction module is used for acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;
the point cloud acquisition module is used for carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to the result of the confidence coefficient sampling to obtain a camera pose change estimated value, and carrying out triangularization operation according to the corresponding relation between the camera pose change estimated value and part of pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;
the network training module is used for calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the machine posture change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;
and the depth estimation module is used for respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.
6. The image frame depth estimation device of claim 5, wherein the point cloud acquisition module is specifically configured to:
and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.
7. The apparatus for depth estimation of image frames according to claim 5, wherein said depth prediction network is based on an encoder-decoder architecture and adds a skip connection between encoder and decoder.
8. The image frame depth estimation device of claim 5, wherein the optical flow prediction network is trained by:
inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;
carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image1The error and the SSIM error are used as supervision signals;
edge-sensitive smoothness loss function LsmoothTraining loss function L added to the optical flow network ensembleflowThe method comprises the following steps:
Lflow=Lrecons+Lsmooth
Lrecons=||Mo(I1-I′1)||+(1-SSIM(MoI1,MoI′1))
Figure FDA0002393002630000031
wherein, I1Is the first frame image, I'1For the reconstructed first frame image, f1From the first frame to the second frameTwo frame forward optical flow prediction, MoIs formed by f1The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements a method of depth estimation of image frames according to any of claims 1 to 4.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, which, when being executed by a processor, implements a method of depth estimation of image frames according to any one of claims 1 to 4.
CN202010121139.7A 2020-02-26 2020-02-26 Depth estimation method and device for image frame, electronic equipment and storage medium Active CN111340867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010121139.7A CN111340867B (en) 2020-02-26 2020-02-26 Depth estimation method and device for image frame, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010121139.7A CN111340867B (en) 2020-02-26 2020-02-26 Depth estimation method and device for image frame, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111340867A true CN111340867A (en) 2020-06-26
CN111340867B CN111340867B (en) 2022-10-18

Family

ID=71187112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010121139.7A Active CN111340867B (en) 2020-02-26 2020-02-26 Depth estimation method and device for image frame, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111340867B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116653A (en) * 2020-11-23 2020-12-22 华南理工大学 Object posture estimation method for multiple RGB pictures
CN112184611A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Image generation model training method and device
CN112348843A (en) * 2020-10-29 2021-02-09 北京嘀嘀无限科技发展有限公司 Method and device for adjusting depth image prediction model and electronic equipment
CN112381868A (en) * 2020-11-13 2021-02-19 北京地平线信息技术有限公司 Image depth estimation method and device, readable storage medium and electronic equipment
CN112561978A (en) * 2020-12-18 2021-03-26 北京百度网讯科技有限公司 Training method of depth estimation network, depth estimation method of image and equipment
CN112672150A (en) * 2020-12-22 2021-04-16 福州大学 Video coding method based on video prediction
CN112954293A (en) * 2021-01-27 2021-06-11 北京达佳互联信息技术有限公司 Depth map acquisition method, reference frame generation method, encoding and decoding method and device
CN112991418A (en) * 2021-03-09 2021-06-18 北京地平线信息技术有限公司 Image depth prediction and neural network training method and device, medium and equipment
CN113298860A (en) * 2020-12-14 2021-08-24 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and storage medium
CN113781542A (en) * 2021-09-23 2021-12-10 Oppo广东移动通信有限公司 Model generation method, depth estimation device and electronic equipment
CN113899363A (en) * 2021-09-29 2022-01-07 北京百度网讯科技有限公司 Vehicle positioning method and device and automatic driving vehicle
CN114463409A (en) * 2022-02-11 2022-05-10 北京百度网讯科技有限公司 Method and device for determining image depth information, electronic equipment and medium
CN115272423A (en) * 2022-09-19 2022-11-01 深圳比特微电子科技有限公司 Method and device for training optical flow estimation model and readable storage medium
WO2022241874A1 (en) * 2021-05-18 2022-11-24 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision ranging method and related assembly
CN116758131A (en) * 2023-08-21 2023-09-15 之江实验室 Monocular image depth estimation method and device and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
CN109472830A (en) * 2018-09-28 2019-03-15 中山大学 A kind of monocular visual positioning method based on unsupervised learning
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
CN109472830A (en) * 2018-09-28 2019-03-15 中山大学 A kind of monocular visual positioning method based on unsupervised learning
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348843A (en) * 2020-10-29 2021-02-09 北京嘀嘀无限科技发展有限公司 Method and device for adjusting depth image prediction model and electronic equipment
CN112184611A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Image generation model training method and device
CN112381868A (en) * 2020-11-13 2021-02-19 北京地平线信息技术有限公司 Image depth estimation method and device, readable storage medium and electronic equipment
CN112381868B (en) * 2020-11-13 2024-08-02 北京地平线信息技术有限公司 Image depth estimation method and device, readable storage medium and electronic equipment
CN112116653B (en) * 2020-11-23 2021-03-30 华南理工大学 Object posture estimation method for multiple RGB pictures
CN112116653A (en) * 2020-11-23 2020-12-22 华南理工大学 Object posture estimation method for multiple RGB pictures
CN113298860A (en) * 2020-12-14 2021-08-24 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and storage medium
CN112561978A (en) * 2020-12-18 2021-03-26 北京百度网讯科技有限公司 Training method of depth estimation network, depth estimation method of image and equipment
CN112561978B (en) * 2020-12-18 2023-11-17 北京百度网讯科技有限公司 Training method of depth estimation network, depth estimation method of image and equipment
CN112672150A (en) * 2020-12-22 2021-04-16 福州大学 Video coding method based on video prediction
CN112954293A (en) * 2021-01-27 2021-06-11 北京达佳互联信息技术有限公司 Depth map acquisition method, reference frame generation method, encoding and decoding method and device
CN112954293B (en) * 2021-01-27 2023-03-24 北京达佳互联信息技术有限公司 Depth map acquisition method, reference frame generation method, encoding and decoding method and device
CN112991418A (en) * 2021-03-09 2021-06-18 北京地平线信息技术有限公司 Image depth prediction and neural network training method and device, medium and equipment
CN112991418B (en) * 2021-03-09 2024-03-29 北京地平线信息技术有限公司 Image depth prediction and neural network training method and device, medium and equipment
WO2022241874A1 (en) * 2021-05-18 2022-11-24 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision ranging method and related assembly
CN113781542A (en) * 2021-09-23 2021-12-10 Oppo广东移动通信有限公司 Model generation method, depth estimation device and electronic equipment
CN113899363A (en) * 2021-09-29 2022-01-07 北京百度网讯科技有限公司 Vehicle positioning method and device and automatic driving vehicle
US11953609B2 (en) 2021-09-29 2024-04-09 Beijing Baidu Netcom Science Technology Co., Ltd. Vehicle positioning method, apparatus and autonomous driving vehicle
CN114463409A (en) * 2022-02-11 2022-05-10 北京百度网讯科技有限公司 Method and device for determining image depth information, electronic equipment and medium
US11783501B2 (en) 2022-02-11 2023-10-10 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for determining image depth information, electronic device, and media
CN114463409B (en) * 2022-02-11 2023-09-26 北京百度网讯科技有限公司 Image depth information determining method and device, electronic equipment and medium
CN115272423B (en) * 2022-09-19 2022-12-16 深圳比特微电子科技有限公司 Method and device for training optical flow estimation model and readable storage medium
CN115272423A (en) * 2022-09-19 2022-11-01 深圳比特微电子科技有限公司 Method and device for training optical flow estimation model and readable storage medium
CN116758131B (en) * 2023-08-21 2023-11-28 之江实验室 Monocular image depth estimation method and device and computer equipment
CN116758131A (en) * 2023-08-21 2023-09-15 之江实验室 Monocular image depth estimation method and device and computer equipment

Also Published As

Publication number Publication date
CN111340867B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN111340867B (en) Depth estimation method and device for image frame, electronic equipment and storage medium
Zhu et al. Unsupervised event-based learning of optical flow, depth, and egomotion
US10553026B2 (en) Dense visual SLAM with probabilistic surfel map
US20170278302A1 (en) Method and device for registering an image to a model
CN106934827A (en) The method for reconstructing and device of three-dimensional scenic
EP3293700B1 (en) 3d reconstruction for vehicle
CN113392584B (en) Visual navigation method based on deep reinforcement learning and direction estimation
CN112639878A (en) Unsupervised depth prediction neural network
CN111598927B (en) Positioning reconstruction method and device
Wang et al. Quadtree-accelerated real-time monocular dense mapping
Jeon et al. Struct-MDC: Mesh-refined unsupervised depth completion leveraging structural regularities from visual SLAM
Karaoglu et al. Dynamon: Motion-aware fast and robust camera localization for dynamic nerf
Degol et al. Feats: Synthetic feature tracks for structure from motion evaluation
Li et al. Unsupervised joint learning of depth, optical flow, ego-motion from video
CN117274446A (en) Scene video processing method, device, equipment and storage medium
Li et al. Dvonet: unsupervised monocular depth estimation and visual odometry
CN115638788A (en) Semantic vector map construction method, computer equipment and storage medium
WO2022087932A1 (en) Non-rigid 3d object modeling using scene flow estimation
Thakur et al. A conditional adversarial network for scene flow estimation
Wang et al. Motion Degeneracy in Self-supervised Learning of Elevation Angle Estimation for 2D Forward-Looking Sonar
Kim et al. Complex-Motion NeRF: Joint Reconstruction and Pose Optimization With Motion and Depth Priors
CN118037965B (en) Human body 3D gesture analysis method based on automatic variation correction under multi-eye vision
Wang et al. Self-supervised learning of depth and camera motion from 360 {\deg} videos
Nadar et al. Sensor simulation for monocular depth estimation using deep neural networks
CN116465827B (en) Viewpoint path planning method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant