CN110782490B - Video depth map estimation method and device with space-time consistency - Google Patents

Video depth map estimation method and device with space-time consistency Download PDF

Info

Publication number
CN110782490B
CN110782490B CN201910907522.2A CN201910907522A CN110782490B CN 110782490 B CN110782490 B CN 110782490B CN 201910907522 A CN201910907522 A CN 201910907522A CN 110782490 B CN110782490 B CN 110782490B
Authority
CN
China
Prior art keywords
frame
depth map
estimation
training
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910907522.2A
Other languages
Chinese (zh)
Other versions
CN110782490A (en
Inventor
肖春霞
胡煜
罗飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910907522.2A priority Critical patent/CN110782490B/en
Publication of CN110782490A publication Critical patent/CN110782490A/en
Application granted granted Critical
Publication of CN110782490B publication Critical patent/CN110782490B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video depth map estimation method and device with space-time consistency, which comprises the steps of generating a training set, wherein the training set comprises a plurality of sequences generated by taking a central frame as a target view and taking front and back frames as source views; aiming at static objects in a scene, constructing a frame for jointly training monocular depth and camera pose estimation from an unlabeled video sequence, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part; aiming at a moving object in a scene, cascading an optical flow network behind the created framework to simulate the motion in the scene, wherein the optical flow estimation network structure is built, and a loss function of the part is built; aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; continuously optimizing the model, performing combined training on monocular depth and camera attitude estimation, and then training an optical flow network; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.

Description

Video depth map estimation method and device with space-time consistency
Technical Field
The invention belongs to the field of understanding of geometric information of video scenes, and relates to a technology for estimating a depth map of a video frame, in particular to a technical scheme for estimating the depth map of continuous video frames with space-time consistency.
Background
Understanding 3D scene geometry in video is a fundamental problem for visual perception, which includes many basic computer vision tasks such as depth estimation, camera pose estimation, optical flow estimation, and so on. A depth map refers to an image containing information of the distance from the surface of an object in a scene to a viewpoint. Estimating depth is an important component in understanding the geometric relationships in a scene, and a general method for extracting a depth map based on an image is very necessary. The distance relationship helps to provide richer object and environment representations, and can further realize the functions such as 3D modeling, object recognition, robotics and the like. In a computer vision system, distance information provides support for various computer vision practical applications such as image segmentation, target detection, object tracking, three-dimensional reconstruction and the like.
The existing depth map estimation method mainly comprises a manual scanning acquisition method by utilizing physical equipment, a traditional mathematical method, a supervised deep learning method and an unsupervised deep learning method. These several methods have some drawbacks: the equipment scanning method mainly utilizes physical equipment to carry out manual scanning acquisition, but the existing three-dimensional scanner (such as Kinect) is not only expensive in manufacturing cost, but also not suitable for general application scenes; the depth estimation precision of the traditional mathematical method is too low, and for some complex scenes, the method can not perform an effective treatment generally; the supervised deep learning method mainly depends on deep learning, a network architecture and a mathematical model to obtain results, the method generally has strong dependence on a data set, the acquisition of the data set generally needs to consume a large amount of manpower and material resources, and the method is generally poor in generalization; the unsupervised depth learning method and the existing video depth estimation method usually ignore the problem of spatial and temporal discontinuity of a depth map, and a large error is often generated in the actual processing process of some occlusion areas or non-Lambert surface areas.
Disclosure of Invention
The invention provides a technical scheme for depth estimation of continuous video frames with space-time consistency in order to overcome the defects of the existing method, so that the estimated depth map result can obtain clearer details in some areas, and meanwhile, the time continuity between different video frames is enhanced, so that the final result is more accurate.
The technical scheme of the invention provides a video depth map estimation method with space-time consistency, which comprises the following steps,
step 1, generating a training set, wherein the length of an image sequence is fixed to be 3 frames, a central frame is used as a target view, two frames in front of and behind the central frame are used as source views, and a plurality of sequences are generated;
step 2, constructing a frame for joint training monocular depth and camera pose estimation from an unmarked video sequence aiming at a static object in a scene, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
In step 2, a depth map estimation network and an optical flow estimation network, which are composed of an encoder and a decoder, are used, and multi-scale depth prediction is performed by using cross-layer connection.
And in step 2, the unmarked video is used for carrying out unsupervised training, including training by combining the geometric characteristics of the moving three-dimensional scenes, combining the training into image synthesis loss, and carrying out unsupervised learning training on the static scenes and the dynamic scenes in the images respectively by using the image similarity as the supervision.
In step 4, a spatial consistency loss is proposed, and the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image is restrained; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.
The invention also provides a corresponding device for realizing the video depth map estimation method with space-time consistency.
The invention has the following advantages: 1. the invention can obtain a video depth estimation technical scheme with more generalization. 2. The invention provides space-time consistency check, provides a new loss function, increases the relevance of depth maps of different video frames, and solves the problem of overlarge error before and after the depth map result of continuous video frames. 3. The depth map estimation results of some low-texture, three-dimensional blur, occlusion and other areas in the scene are improved, so that the accuracy of the overall depth map estimation result is improved.
Drawings
Fig. 1 is an overall flowchart framework diagram of a video depth map estimation method with spatio-temporal consistency according to an embodiment of the present invention.
FIG. 2 is an overall framework diagram for jointly training monocular depth and camera pose estimates from unlabeled video sequences in accordance with an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.
The invention provides a method for estimating a video depth map, which combines depth estimation, optical flow estimation and camera pose estimation together through the geometric characteristics of a moving three-dimensional scene for training, combines the geometric characteristics into image synthesis loss, respectively carries out unsupervised learning training on static and dynamic scenes in an image by using image similarity as supervision, and simultaneously provides a new loss function improvement effect aiming at the problem of discontinuous depth space time frequently occurring in the video depth map estimation. Referring to fig. 1, a method for estimating a video depth map with spatiotemporal consistency according to an embodiment of the present invention includes the following steps:
step 1, a training set is made according to a public data set commonly used in the field of video depth estimation.
Step 1 of the examples the procedure is carried out as follows:
by using a kitti data set commonly used in the field of video depth estimation, the kitti data set is currently applied to a computer vision image data set in an automatic driving scene, including urban, rural, road and other scenes, and the images contain at most more than ten vehicles and thirty pedestrians, and also contain various environments such as occlusion, motion and the like, so that the computer vision image data set has rich image information. The specific processing is to fix the length of the image sequence to 3 frames, take the central frame as the target view, and take ± 1 frame (i.e. two frames before and after) as the source view. Using images taken in the Kitti dataset, a total of 12000 sequences were obtained, of which 10800 were used for training and 1200 for validation.
And 2, constructing a framework for jointly training monocular depth and camera attitude estimation from the unlabeled video sequence.
A framework for jointly training monocular depth and camera pose estimation from unlabeled video sequences is constructed for static objects in a scene. The key supervisory signals of the depth and camera pose prediction convolutional neural network in the step come from the task of view synthesis: given an input view of a scene, new images of the scene are synthesized as seen from different camera poses.
The invention preferably adopts a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder, and adopts a cross-layer connection idea to carry out multi-scale depth prediction, thereby improving the operation efficiency and the accuracy of the result.
The invention proposes to use unmarked video for unsupervised training: the geometric characteristics of the moving three-dimensional scenes are combined together for training, the geometric characteristics are combined into image synthesis loss, and the image similarity is used as supervision to respectively perform unsupervised learning training on the static scenes and the dynamic scenes in the images. A large amount of manpower and material resources are saved, so that the invention has greater universality.
Referring to fig. 2, the implementation of step 2 of the example is illustrated as follows:
(1) and constructing a depth map estimation network structure.
Since the depth map estimation network needs to train and calculate the geometric relationship at the pixel level, the depth map network mainly consists of two parts, namely an encoder and a decoder, and the specific network structure of the encoder and the decoder is shown in table 1 and table 2. The encoder portion uses convolutional layers as a more efficient learning means. The decoder consists of an deconvolution layer, which maps the spatial features up to the full scale of the input. In order to simultaneously reserve global high-level features and local detail information, the idea of cross-layer connection is used for reference between an encoder and a decoder, and multi-scale depth prediction is carried out.
TABLE 1 encoder network architecture
Figure BDA0002213717400000031
Figure BDA0002213717400000041
TABLE 2 decoder network architecture
Figure BDA0002213717400000042
Layer, Conv1, Covn1b, Conv2, Covn2b … Conv7, Covn7b are convolutional layers, Disp1, Disp2 … Disp4 are connected across layers, Icovn1 … Icovn7, upv 1 … upConv7 are deconvolution layers, k is the kernel size, s is the step size, chns is the number of input and output channels per layer, input and output are the reduction factor of each layer relative to the input image (i.e. in is the inverse ratio of input to size, and original is the size ratio of output), input corresponds to the input of each layer, where + is the series, and input is 2 times the upsampling of the layer.
The network structure is divided into 6 scales, the maximum scale is the scale of the original image, then the size of each scale is one half of the previous scale, the resolution of the feature map of the layer with the minimum scale is only sixty-fourth of the original resolution, but the number of channels is as high as 512. The down-sampling operation is performed using a maximum pooling method in the encoder portion and the up-sampling operation is performed using a deconvolution layer in the decoder portion. The output of the encoder part is transmitted to a decoder of a corresponding scale by cross-layer transmission at each scale, and after the output characteristic diagram is connected with the decoder of the corresponding scale, a new characteristic diagram is synthesized to be used as input to be transmitted into a corresponding deconvolution layer.
(2) And (5) building a camera pose estimation network structure.
The camera pose estimation network structure regresses the camera pose (Euler angle and translation vector of camera rotation), the main structure of the camera pose network is similar to that of the encoder of the network in the step (1), a global average pooling layer, POOL, and finally a prediction layer Softmax are connected behind 8 convolutional layers, and the specific network structure is shown in a table 3. Except for the last predicted layer, the Batch Normalization and activation Relus functions are used for all layers.
TABLE 3
Figure BDA0002213717400000051
Of these, 8 convolutional layers were designated as Conv1, Conv1b, Conv2, Conv2b, Conv3, Conv3b, Conv4 and Conv4b, respectively, Fc1 and Fc2 were all connected layers.
(3) A loss function for the portion is constructed.
The deep network only uses the target view ItAs input, and outputs a per-pixel depth map Dt. Camera pose network views (I) of objectst) And adjacent source views (e.g. I)t+1) As input, and output relative camera pose
Figure BDA0002213717400000061
The outputs of the two networks are then used to reverse warp the source view to reconstruct the target view, and the photometric error is used to train the convolutional neural network. By using view synthesis as a supervision, this framework can be trained from video in an unmarked supervised manner.
For the invention<I1.....In>N is the number of picture frames as a representation of the training image sequence, and n pictures I are in total1.....In. n is the number of pictures of the entire data set, but each calculation is calculated as three consecutive frames. In the specific implementation, the calculation can be performed for more than three frames at a time, but the calculation amount is increased every time.
Selecting one frame ItAs a target view, the rest is a source view Is(s is more than or equal to 1 and less than or equal to n, and s is not equal to t). The supervisory signal may be expressed as
Figure BDA0002213717400000062
Wherein p is the index pixel coordinate,
Figure BDA0002213717400000063
representing a slave view frame IsA composite view of the predicted target frame is obtained from the rigid-body flow and the rig flag represents this portion considering only static rigid objects. Therefore, the supervisory signal at this stage is from the minimized view synthesis
Figure BDA0002213717400000064
And the original frame ItThe difference between them. I ist(p) is a representative point p in Picture ItThe position in the frame of the image data,
Figure BDA0002213717400000065
for the position of the p-point in the image, L, calculated by rigid flowrsFor their difference, the present invention requires L to be applied during the training processrsAs small as possible.
A key component of this framework is a differentiable depth image-based renderer that reconstructs the target view by sampling pixels from the source view. Prediction-based depth map
Figure BDA0002213717400000066
And relative posture
Figure BDA0002213717400000067
Let PtRepresenting the homogeneous coordinates of the pixels in the target view, K representing the camera intrinsic matrix, P can be obtained by the formulatTo the source view PsThe above.
Figure BDA0002213717400000068
It is to be noted that the projection coordinate PsAre continuous values. To obtain filling
Figure BDA0002213717400000069
I of value ofs(ps),
Figure BDA00022137174000000610
Depth map of p points at t frame, Is(ps) Is the position of p in the s-frame,
Figure BDA00022137174000000611
based on the position of a predicted point p in a t-frame picture, and then linearly interpolating a value p of 4 pixels by using a differentiable bilinear sampling mechanismsOf (2)
Figure BDA00022137174000000612
(upper left, upper right, lower left and lower right) is approximately Is(ps) I.e. by
Figure BDA0002213717400000071
Wherein wijAnd PsAnd
Figure BDA0002213717400000072
the spatial proximity therebetween is linearly proportional, and
Figure BDA0002213717400000073
the bilinear interpolation method is to linearly interpolate a value p of 4 pixelssIs approximated as I (upper left, upper right, lower left and lower right)s(ps). t, b, l and r represent upper, lower, left and right, respectively, wijIs the proportionality coefficient occupied by each point. The coordinates of the pixel deformation obtained here can be decomposed into depth and camera pose by projection geometry.
The differentiable bilinear sampling mechanism is the prior art, namely a bilinear differential interpolation method, and the details of the invention are omitted.
And 3, after the frame created in the step 2, cascading an optical flow network to simulate the motion in the scene.
In the step 2, moving objects in the scene are ignored, and certain compensation correction can be effectively carried out on the depth map of the moving objects after the optical flow estimation network is added, so that the accuracy of the result is improved.
The specific implementation process is described as follows:
(1) and constructing an optical flow estimation network structure. The remaining non-rigid flows are learned with the optical flow network, the displacements of which are caused only by the relative motion of the objects and the world scene. The framework of the optical flow estimation network structure is similar to the depth map estimation network structure in step 2 (the same network structure as tables 1 and 2 can be used because they all obtain the same resolution as the input picture at the end), and also consists of two parts, an encoder and a decoder. The optical flow network is connected in a cascaded manner after the network of the first stage. For a given pair of image frames, the optical flow network uses the output of the network in step 2
Figure BDA0002213717400000074
Predicting a corresponding residual stream
Figure BDA0002213717400000075
The final overall prediction stream is
Figure BDA0002213717400000076
Is composed of
Figure BDA0002213717400000077
The input of the optical flow network Is composed of several images connected in the channel dimension, including the source frame and target frame image pairs Is and Id, and the output of the network in step 2
Figure BDA0002213717400000078
Composite views
Figure BDA0002213717400000079
And
Figure BDA00022137174000000710
with the original image IsThe error of (2).
(2) A loss function for the portion is constructed. The supervision in step 2 is extended to the present stage by slight modifications (introducing the influence of the optical flow component on the scene flow). In step 2, the static scene is mainly processed, and the processing for the moving object is ignored. In order to improve the robustness of the learning process to these factors, a solution to incorporate an optical flow network to train the residual flow (optical flow portion) except for the rigid flow has been proposed for this problem. In particular, over the entire prediction stream
Figure BDA00022137174000000711
Then, image warping (image warping) is performed between any pair of the target frame and the source frame
Figure BDA00022137174000000712
Instead of the former
Figure BDA00022137174000000713
So as to obtain the warp loss L of the whole flowfs. The concrete formula is expressed as
Figure BDA00022137174000000714
Wherein,
Figure BDA00022137174000000715
the position of the p-point in the image is calculated for the overall flow.
And 4, providing a loss function of the deep neural network.
Aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided, the overlarge error before and after the result of continuous video frames is prevented, and meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in a scene are improved to a certain extent.
The specific implementation process is described as follows:
the spatial consistency loss provided by the invention is realized by restricting the difference of flow values from a t frame image to a t +1 frame image and from a t +1 frame image to a t frame image, and the temporal consistency loss is realized by adding the difference restriction of the flow values from the t frame to the t +1 frame image and the flow values from the t-1 frame to the t +1 frame directly, wherein the specific formula is shown as the formula:
Figure BDA0002213717400000081
Figure BDA0002213717400000082
wherein,
Figure BDA0002213717400000083
for the position of p-point in t-frame calculated by the global flow of p-point from s-frame to t-frame, Is(p) is the position of the p point in the s frame image,
Figure BDA0002213717400000084
for the entire stream of t-1 frames to t frames,
Figure BDA0002213717400000085
for the entire stream of t frames to t +1 frames,
Figure BDA0002213717400000086
the overall stream from t-1 frame to t +1 frame. L is a radical of an alcoholftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, LfpIs the difference between the stream value from the t-1 frame to the t frame plus the stream value from the t frame to the t +1 frame and the stream value from the t-1 frame directly to the t +1 frame. Ideally these two values should be as small as possible so they are used as a loss function to train the network.
Pixels that flow severely contradictory (i.e., too much computational error) are considered to be possible outliers. Since these regions violate the assumption of image consistency and geometric consistency, the text can only pass through smoothnessTo handle them. Thus, the full flow warp loss L hereinfsAnd loss of spatio-temporal consistency Lft、LfpAre weighted by pixel.
And 5, setting training parameters of the network, and continuously optimizing the model according to the error of each generation. In the training process, the set loss function is required to be continuously reduced in an iterative manner, so that the more accurate the model is. And the depth map estimation of continuous video frames can be realized by utilizing the optimized model.
In specific implementation, the monocular depth and the camera pose estimation can be trained jointly, and then the residual optical flow network is trained on the basis. And finally, obtaining the trained network models of depth map estimation, camera pose estimation and optical flow estimation.
The specific implementation process is described as follows:
the invention mainly comprises three sub-networks, namely a depth map estimation network and a camera pose estimation network, which form the reconstruction of a static object together, and the optical flow estimation network structure is combined with the output of the previous stage to realize the positioning of a moving object. Although the networks can be trained together in an end-to-end fashion, there is no guarantee that local gradient optimization will bring the network to an optimal point. Therefore, a segmented training strategy is employed while reducing memory and computation consumption. Firstly, training a depth map estimation network and a camera pose estimation network, determining weights, and then training an optical flow estimation network. The resolution of the trained input images are all resize to 128 x 416, while random upscaling, cropping, recoloring, etc. methods are also employed to prevent overfitting. The network optimization function adopts a common neural network optimization method Adam. The initial learning rate was set to 0.0002 and the mini-batch size (minimum batch size) was set to 4. The first and second stages of the training process converge with 30 and 200 epochs (iterations), respectively. Testing on the KITTI data set it should be understood that parts not elaborated on in this specification are prior art.
In the above process, the main characteristics are: and (3) providing time consistency check of the depth map, improving a loss function of the depth neural network, constructing the time consistency check specially aiming at the video depth map in a deep learning model, improving the overall loss function, and preventing overlarge errors before and after the result of continuous video frames. Meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in the scene are improved to a certain extent.
In specific implementation, the automatic operation of the process can be realized by adopting a software mode. The apparatus for operating the process should also be within the scope of the present invention.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1. A video depth map estimation method with space-time consistency is characterized in that: comprises the following steps of (a) carrying out,
step 1, generating a training set, wherein the method comprises the steps of fixing the length of an image sequence to be 3 frames, taking a central frame as a target view, taking two frames in front of and behind as source views, and generating a plurality of sequences;
step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading a previous optical flow network behind the frame established in the step 2 to simulate the motion in the scene, wherein the optical flow estimation network structure is established, and a loss function of the part is established;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; the implementation mode is as follows,
space consistency loss is proposed, and stream value difference from a t frame image to a t +1 frame image and from the t +1 frame image to the t frame image is restrained; proposing time consistency loss, and adding difference constraint of the image stream value from the t frame to the t +1 frame and the stream value from the t-1 frame to the t +1 frame to the stream value from the t-1 frame to the t frame;
the following formula is shown below,
Figure FDA0003633282730000011
Figure FDA0003633282730000012
wherein,
Figure FDA0003633282730000013
for the position of p-point at t-frame calculated by the overall stream of p-points from s-frame to t-frame, Is(p) is the position of the p point in the s frame image,
Figure FDA0003633282730000014
for the entire stream of t-1 frames to t frames,
Figure FDA0003633282730000015
for the entire stream of t frames to t +1 frames,
Figure FDA0003633282730000016
the whole stream from t-1 frame to t +1 frame; l isftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, LfpIs the difference between the stream value from t-1 frame to t frame plus the stream value from t frame to t +1 frame and the stream value from t-1 frame directly to t +1 frame;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
2. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: in step 2, a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder are adopted, and cross-layer connection is adopted to carry out multi-scale depth prediction.
3. The method of estimating a video depth map with spatio-temporal consistency according to claim 1 or 2, characterized in that: and 2, performing unsupervised training by using the unmarked video, wherein the unsupervised training comprises training by combining the geometric characteristics of the moving three-dimensional scene, combining the training into image synthesis loss, and performing unsupervised learning training on the static scene and the dynamic scene in the image by using the image similarity as a monitor.
4. An apparatus for estimating a video depth map with spatio-temporal consistency, characterized in that: method for implementing a video depth map estimation with spatio-temporal consistency according to any one of claims 1 to 3.
CN201910907522.2A 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency Expired - Fee Related CN110782490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910907522.2A CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910907522.2A CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Publications (2)

Publication Number Publication Date
CN110782490A CN110782490A (en) 2020-02-11
CN110782490B true CN110782490B (en) 2022-07-05

Family

ID=69383733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910907522.2A Expired - Fee Related CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Country Status (1)

Country Link
CN (1) CN110782490B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402310B (en) * 2020-02-29 2023-03-28 同济大学 Monocular image depth estimation method and system based on depth estimation network
CN111311664B (en) * 2020-03-03 2023-04-21 上海交通大学 Combined unsupervised estimation method and system for depth, pose and scene flow
CN111583305B (en) * 2020-05-11 2022-06-21 北京市商汤科技开发有限公司 Neural network training and motion trajectory determination method, device, equipment and medium
CN111709982B (en) * 2020-05-22 2022-08-26 浙江四点灵机器人股份有限公司 Three-dimensional reconstruction method for dynamic environment
CN112085717B (en) * 2020-09-04 2024-03-19 厦门大学 Video prediction method and system for laparoscopic surgery
CN112270691B (en) * 2020-10-15 2023-04-21 电子科技大学 Monocular video structure and motion prediction method based on dynamic filter network
CN112344922B (en) * 2020-10-26 2022-10-21 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN113160294B (en) * 2021-03-31 2022-12-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium
CN113222895B (en) * 2021-04-10 2023-05-02 优层智能科技(上海)有限公司 Electrode defect detection method and system based on artificial intelligence
CN112801074B (en) * 2021-04-15 2021-07-16 速度时空信息科技股份有限公司 Depth map estimation method based on traffic camera
CN113284173B (en) * 2021-04-20 2023-12-19 中国矿业大学 End-to-end scene flow and pose joint learning method based on false laser radar
CN114332380B (en) * 2022-01-04 2024-08-09 吉林大学 Light field video synthesis method based on monocular RGB camera and three-dimensional body representation
CN114359363B (en) * 2022-01-11 2024-06-18 浙江大学 Video consistency depth estimation method and device based on depth learning
CN114663347B (en) * 2022-02-07 2022-09-27 中国科学院自动化研究所 Unsupervised object instance detection method and unsupervised object instance detection device
CN115131404B (en) * 2022-07-01 2024-06-14 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN114937125B (en) * 2022-07-25 2022-10-25 深圳大学 Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium
CN115187638B (en) * 2022-09-07 2022-12-27 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
CN117115786B (en) * 2023-10-23 2024-01-26 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103002309A (en) * 2012-09-25 2013-03-27 浙江大学 Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
CN105100771A (en) * 2015-07-14 2015-11-25 山东大学 Single-viewpoint video depth obtaining method based on scene classification and geometric dimension
CN106599805A (en) * 2016-12-01 2017-04-26 华中科技大学 Supervised data driving-based monocular video depth estimating method
CN106612427A (en) * 2016-12-29 2017-05-03 浙江工商大学 Method for generating spatial-temporal consistency depth map sequence based on convolution neural network
CN107274445A (en) * 2017-05-19 2017-10-20 华中科技大学 A kind of image depth estimation method and system
CN107481279A (en) * 2017-05-18 2017-12-15 华中科技大学 A kind of monocular video depth map computational methods

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9123115B2 (en) * 2010-11-23 2015-09-01 Qualcomm Incorporated Depth estimation based on global motion and optical flow

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103002309A (en) * 2012-09-25 2013-03-27 浙江大学 Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
CN105100771A (en) * 2015-07-14 2015-11-25 山东大学 Single-viewpoint video depth obtaining method based on scene classification and geometric dimension
CN106599805A (en) * 2016-12-01 2017-04-26 华中科技大学 Supervised data driving-based monocular video depth estimating method
CN106612427A (en) * 2016-12-29 2017-05-03 浙江工商大学 Method for generating spatial-temporal consistency depth map sequence based on convolution neural network
CN107481279A (en) * 2017-05-18 2017-12-15 华中科技大学 A kind of monocular video depth map computational methods
CN107274445A (en) * 2017-05-19 2017-10-20 华中科技大学 A kind of image depth estimation method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Antonio W. Vieira等.STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences.《Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications》.2012, *
Tak-Wai Hui等.Dense depth map generation using sparse depth data from normal flow.《2014 IEEE International Conference on Image Processing (ICIP)》.2015, *
姜翰青等.基于多个手持摄像机的动态场景时空一致性深度恢复.《计算机辅助设计与图像学学报》.2013,第25卷(第2期), *
葛利跃等.深度图像优化分层分割的3D场景流估计.《南昌航空大学学报自然科学版》.2018,第32卷(第2期), *

Also Published As

Publication number Publication date
CN110782490A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN111739078B (en) Monocular unsupervised depth estimation method based on context attention mechanism
US11210803B2 (en) Method for 3D scene dense reconstruction based on monocular visual slam
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
CN111354030B (en) Method for generating unsupervised monocular image depth map embedded into SENet unit
CN108491763B (en) Unsupervised training method and device for three-dimensional scene recognition network and storage medium
CN113077505B (en) Monocular depth estimation network optimization method based on contrast learning
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN110910437B (en) Depth prediction method for complex indoor scene
CN111325784A (en) Unsupervised pose and depth calculation method and system
CN115187638B (en) Unsupervised monocular depth estimation method based on optical flow mask
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN110942484B (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN113850900B (en) Method and system for recovering depth map based on image and geometric clues in three-dimensional reconstruction
CN115294282A (en) Monocular depth estimation system and method for enhancing feature fusion in three-dimensional scene reconstruction
CN112288788A (en) Monocular image depth estimation method
CN117576179A (en) Mine image monocular depth estimation method with multi-scale detail characteristic enhancement
CN117593702B (en) Remote monitoring method, device, equipment and storage medium
CN117788544A (en) Image depth estimation method based on lightweight attention mechanism
CN115272450A (en) Target positioning method based on panoramic segmentation
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
Yang et al. 360Spred: Saliency Prediction for 360-Degree Videos Based on 3D Separable Graph Convolutional Networks
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
Zhao et al. MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask
CN117974721A (en) Vehicle motion estimation method and system based on monocular continuous frame images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220705