CN110782490B

CN110782490B - Video depth map estimation method and device with space-time consistency

Info

Publication number: CN110782490B
Application number: CN201910907522.2A
Authority: CN
Inventors: 肖春霞; 胡煜; 罗飞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2022-07-05
Anticipated expiration: 2039-09-24
Also published as: CN110782490A

Abstract

The invention provides a video depth map estimation method and device with space-time consistency, which comprises the steps of generating a training set, wherein the training set comprises a plurality of sequences generated by taking a central frame as a target view and taking front and back frames as source views; aiming at static objects in a scene, constructing a frame for jointly training monocular depth and camera pose estimation from an unlabeled video sequence, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part; aiming at a moving object in a scene, cascading an optical flow network behind the created framework to simulate the motion in the scene, wherein the optical flow estimation network structure is built, and a loss function of the part is built; aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; continuously optimizing the model, performing combined training on monocular depth and camera attitude estimation, and then training an optical flow network; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.

Description

Video depth map estimation method and device with space-time consistency

Technical Field

The invention belongs to the field of understanding of geometric information of video scenes, and relates to a technology for estimating a depth map of a video frame, in particular to a technical scheme for estimating the depth map of continuous video frames with space-time consistency.

Background

Understanding 3D scene geometry in video is a fundamental problem for visual perception, which includes many basic computer vision tasks such as depth estimation, camera pose estimation, optical flow estimation, and so on. A depth map refers to an image containing information of the distance from the surface of an object in a scene to a viewpoint. Estimating depth is an important component in understanding the geometric relationships in a scene, and a general method for extracting a depth map based on an image is very necessary. The distance relationship helps to provide richer object and environment representations, and can further realize the functions such as 3D modeling, object recognition, robotics and the like. In a computer vision system, distance information provides support for various computer vision practical applications such as image segmentation, target detection, object tracking, three-dimensional reconstruction and the like.

The existing depth map estimation method mainly comprises a manual scanning acquisition method by utilizing physical equipment, a traditional mathematical method, a supervised deep learning method and an unsupervised deep learning method. These several methods have some drawbacks: the equipment scanning method mainly utilizes physical equipment to carry out manual scanning acquisition, but the existing three-dimensional scanner (such as Kinect) is not only expensive in manufacturing cost, but also not suitable for general application scenes; the depth estimation precision of the traditional mathematical method is too low, and for some complex scenes, the method can not perform an effective treatment generally; the supervised deep learning method mainly depends on deep learning, a network architecture and a mathematical model to obtain results, the method generally has strong dependence on a data set, the acquisition of the data set generally needs to consume a large amount of manpower and material resources, and the method is generally poor in generalization; the unsupervised depth learning method and the existing video depth estimation method usually ignore the problem of spatial and temporal discontinuity of a depth map, and a large error is often generated in the actual processing process of some occlusion areas or non-Lambert surface areas.

Disclosure of Invention

The invention provides a technical scheme for depth estimation of continuous video frames with space-time consistency in order to overcome the defects of the existing method, so that the estimated depth map result can obtain clearer details in some areas, and meanwhile, the time continuity between different video frames is enhanced, so that the final result is more accurate.

The technical scheme of the invention provides a video depth map estimation method with space-time consistency, which comprises the following steps,

step 1, generating a training set, wherein the length of an image sequence is fixed to be 3 frames, a central frame is used as a target view, two frames in front of and behind the central frame are used as source views, and a plurality of sequences are generated;

step 2, constructing a frame for joint training monocular depth and camera pose estimation from an unmarked video sequence aiming at a static object in a scene, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;

step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;

step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;

step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.

In step 2, a depth map estimation network and an optical flow estimation network, which are composed of an encoder and a decoder, are used, and multi-scale depth prediction is performed by using cross-layer connection.

And in step 2, the unmarked video is used for carrying out unsupervised training, including training by combining the geometric characteristics of the moving three-dimensional scenes, combining the training into image synthesis loss, and carrying out unsupervised learning training on the static scenes and the dynamic scenes in the images respectively by using the image similarity as the supervision.

In step 4, a spatial consistency loss is proposed, and the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image is restrained; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.

The invention also provides a corresponding device for realizing the video depth map estimation method with space-time consistency.

The invention has the following advantages: 1. the invention can obtain a video depth estimation technical scheme with more generalization. 2. The invention provides space-time consistency check, provides a new loss function, increases the relevance of depth maps of different video frames, and solves the problem of overlarge error before and after the depth map result of continuous video frames. 3. The depth map estimation results of some low-texture, three-dimensional blur, occlusion and other areas in the scene are improved, so that the accuracy of the overall depth map estimation result is improved.

Drawings

Fig. 1 is an overall flowchart framework diagram of a video depth map estimation method with spatio-temporal consistency according to an embodiment of the present invention.

FIG. 2 is an overall framework diagram for jointly training monocular depth and camera pose estimates from unlabeled video sequences in accordance with an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.

The invention provides a method for estimating a video depth map, which combines depth estimation, optical flow estimation and camera pose estimation together through the geometric characteristics of a moving three-dimensional scene for training, combines the geometric characteristics into image synthesis loss, respectively carries out unsupervised learning training on static and dynamic scenes in an image by using image similarity as supervision, and simultaneously provides a new loss function improvement effect aiming at the problem of discontinuous depth space time frequently occurring in the video depth map estimation. Referring to fig. 1, a method for estimating a video depth map with spatiotemporal consistency according to an embodiment of the present invention includes the following steps:

step 1, a training set is made according to a public data set commonly used in the field of video depth estimation.

Step 1 of the examples the procedure is carried out as follows:

by using a kitti data set commonly used in the field of video depth estimation, the kitti data set is currently applied to a computer vision image data set in an automatic driving scene, including urban, rural, road and other scenes, and the images contain at most more than ten vehicles and thirty pedestrians, and also contain various environments such as occlusion, motion and the like, so that the computer vision image data set has rich image information. The specific processing is to fix the length of the image sequence to 3 frames, take the central frame as the target view, and take ± 1 frame (i.e. two frames before and after) as the source view. Using images taken in the Kitti dataset, a total of 12000 sequences were obtained, of which 10800 were used for training and 1200 for validation.

And 2, constructing a framework for jointly training monocular depth and camera attitude estimation from the unlabeled video sequence.

A framework for jointly training monocular depth and camera pose estimation from unlabeled video sequences is constructed for static objects in a scene. The key supervisory signals of the depth and camera pose prediction convolutional neural network in the step come from the task of view synthesis: given an input view of a scene, new images of the scene are synthesized as seen from different camera poses.

The invention preferably adopts a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder, and adopts a cross-layer connection idea to carry out multi-scale depth prediction, thereby improving the operation efficiency and the accuracy of the result.

The invention proposes to use unmarked video for unsupervised training: the geometric characteristics of the moving three-dimensional scenes are combined together for training, the geometric characteristics are combined into image synthesis loss, and the image similarity is used as supervision to respectively perform unsupervised learning training on the static scenes and the dynamic scenes in the images. A large amount of manpower and material resources are saved, so that the invention has greater universality.

Referring to fig. 2, the implementation of step 2 of the example is illustrated as follows:

(1) and constructing a depth map estimation network structure.

Since the depth map estimation network needs to train and calculate the geometric relationship at the pixel level, the depth map network mainly consists of two parts, namely an encoder and a decoder, and the specific network structure of the encoder and the decoder is shown in table 1 and table 2. The encoder portion uses convolutional layers as a more efficient learning means. The decoder consists of an deconvolution layer, which maps the spatial features up to the full scale of the input. In order to simultaneously reserve global high-level features and local detail information, the idea of cross-layer connection is used for reference between an encoder and a decoder, and multi-scale depth prediction is carried out.

TABLE 1 encoder network architecture

TABLE 2 decoder network architecture

Layer, Conv1, Covn1b, Conv2, Covn2b … Conv7, Covn7b are convolutional layers, Disp1, Disp2 … Disp4 are connected across layers, Icovn1 … Icovn7, upv 1 … upConv7 are deconvolution layers, k is the kernel size, s is the step size, chns is the number of input and output channels per layer, input and output are the reduction factor of each layer relative to the input image (i.e. in is the inverse ratio of input to size, and original is the size ratio of output), input corresponds to the input of each layer, where + is the series, and input is 2 times the upsampling of the layer.

The network structure is divided into 6 scales, the maximum scale is the scale of the original image, then the size of each scale is one half of the previous scale, the resolution of the feature map of the layer with the minimum scale is only sixty-fourth of the original resolution, but the number of channels is as high as 512. The down-sampling operation is performed using a maximum pooling method in the encoder portion and the up-sampling operation is performed using a deconvolution layer in the decoder portion. The output of the encoder part is transmitted to a decoder of a corresponding scale by cross-layer transmission at each scale, and after the output characteristic diagram is connected with the decoder of the corresponding scale, a new characteristic diagram is synthesized to be used as input to be transmitted into a corresponding deconvolution layer.

(2) And (5) building a camera pose estimation network structure.

The camera pose estimation network structure regresses the camera pose (Euler angle and translation vector of camera rotation), the main structure of the camera pose network is similar to that of the encoder of the network in the step (1), a global average pooling layer, POOL, and finally a prediction layer Softmax are connected behind 8 convolutional layers, and the specific network structure is shown in a table 3. Except for the last predicted layer, the Batch Normalization and activation Relus functions are used for all layers.

TABLE 3

Of these, 8 convolutional layers were designated as Conv1, Conv1b, Conv2, Conv2b, Conv3, Conv3b, Conv4 and Conv4b, respectively, Fc1 and Fc2 were all connected layers.

(3) A loss function for the portion is constructed.

The deep network only uses the target view I_tAs input, and outputs a per-pixel depth map D_t. Camera pose network views (I) of objects_t) And adjacent source views (e.g. I)_t+1) As input, and output relative camera pose

The outputs of the two networks are then used to reverse warp the source view to reconstruct the target view, and the photometric error is used to train the convolutional neural network. By using view synthesis as a supervision, this framework can be trained from video in an unmarked supervised manner.

For the invention<I₁.....I_n>N is the number of picture frames as a representation of the training image sequence, and n pictures I are in total₁.....I_n. n is the number of pictures of the entire data set, but each calculation is calculated as three consecutive frames. In the specific implementation, the calculation can be performed for more than three frames at a time, but the calculation amount is increased every time.

Selecting one frame I_tAs a target view, the rest is a source view I_s(s is more than or equal to 1 and less than or equal to n, and s is not equal to t). The supervisory signal may be expressed as

Wherein p is the index pixel coordinate,

representing a slave view frame I_sA composite view of the predicted target frame is obtained from the rigid-body flow and the rig flag represents this portion considering only static rigid objects. Therefore, the supervisory signal at this stage is from the minimized view synthesis

And the original frame I_tThe difference between them. I is_t(p) is a representative point p in Picture I_tThe position in the frame of the image data,

for the position of the p-point in the image, L, calculated by rigid flow_rsFor their difference, the present invention requires L to be applied during the training process_rsAs small as possible.

A key component of this framework is a differentiable depth image-based renderer that reconstructs the target view by sampling pixels from the source view. Prediction-based depth map

And relative posture

Let P_tRepresenting the homogeneous coordinates of the pixels in the target view, K representing the camera intrinsic matrix, P can be obtained by the formula_tTo the source view P_sThe above.

It is to be noted that the projection coordinate P_sAre continuous values. To obtain filling

I of value of_s(p_s)，

Depth map of p points at t frame, I_s(p_s) Is the position of p in the s-frame,

based on the position of a predicted point p in a t-frame picture, and then linearly interpolating a value p of 4 pixels by using a differentiable bilinear sampling mechanism_sOf (2)

(upper left, upper right, lower left and lower right) is approximately I_s(p_s) I.e. by

Wherein w^ijAnd P_sAnd

the spatial proximity therebetween is linearly proportional, and

the bilinear interpolation method is to linearly interpolate a value p of 4 pixels_sIs approximated as I (upper left, upper right, lower left and lower right)_s(p_s). t, b, l and r represent upper, lower, left and right, respectively, w^ijIs the proportionality coefficient occupied by each point. The coordinates of the pixel deformation obtained here can be decomposed into depth and camera pose by projection geometry.

The differentiable bilinear sampling mechanism is the prior art, namely a bilinear differential interpolation method, and the details of the invention are omitted.

And 3, after the frame created in the step 2, cascading an optical flow network to simulate the motion in the scene.

In the step 2, moving objects in the scene are ignored, and certain compensation correction can be effectively carried out on the depth map of the moving objects after the optical flow estimation network is added, so that the accuracy of the result is improved.

The specific implementation process is described as follows:

(1) and constructing an optical flow estimation network structure. The remaining non-rigid flows are learned with the optical flow network, the displacements of which are caused only by the relative motion of the objects and the world scene. The framework of the optical flow estimation network structure is similar to the depth map estimation network structure in step 2 (the same network structure as tables 1 and 2 can be used because they all obtain the same resolution as the input picture at the end), and also consists of two parts, an encoder and a decoder. The optical flow network is connected in a cascaded manner after the network of the first stage. For a given pair of image frames, the optical flow network uses the output of the network in step 2

Predicting a corresponding residual stream

The final overall prediction stream is

Is composed of

The input of the optical flow network Is composed of several images connected in the channel dimension, including the source frame and target frame image pairs Is and Id, and the output of the network in step 2

Composite views

And

with the original image I_sThe error of (2).

(2) A loss function for the portion is constructed. The supervision in step 2 is extended to the present stage by slight modifications (introducing the influence of the optical flow component on the scene flow). In step 2, the static scene is mainly processed, and the processing for the moving object is ignored. In order to improve the robustness of the learning process to these factors, a solution to incorporate an optical flow network to train the residual flow (optical flow portion) except for the rigid flow has been proposed for this problem. In particular, over the entire prediction stream

Then, image warping (image warping) is performed between any pair of the target frame and the source frame

Instead of the former

So as to obtain the warp loss L of the whole flow_fs. The concrete formula is expressed as

Wherein,

the position of the p-point in the image is calculated for the overall flow.

And 4, providing a loss function of the deep neural network.

Aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided, the overlarge error before and after the result of continuous video frames is prevented, and meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in a scene are improved to a certain extent.

The specific implementation process is described as follows:

the spatial consistency loss provided by the invention is realized by restricting the difference of flow values from a t frame image to a t +1 frame image and from a t +1 frame image to a t frame image, and the temporal consistency loss is realized by adding the difference restriction of the flow values from the t frame to the t +1 frame image and the flow values from the t-1 frame to the t +1 frame directly, wherein the specific formula is shown as the formula:

wherein,

for the position of p-point in t-frame calculated by the global flow of p-point from s-frame to t-frame, I_s(p) is the position of the p point in the s frame image,

for the entire stream of t-1 frames to t frames,

for the entire stream of t frames to t +1 frames,

the overall stream from t-1 frame to t +1 frame. L is a radical of an alcohol_ftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, L_fpIs the difference between the stream value from the t-1 frame to the t frame plus the stream value from the t frame to the t +1 frame and the stream value from the t-1 frame directly to the t +1 frame. Ideally these two values should be as small as possible so they are used as a loss function to train the network.

Pixels that flow severely contradictory (i.e., too much computational error) are considered to be possible outliers. Since these regions violate the assumption of image consistency and geometric consistency, the text can only pass through smoothnessTo handle them. Thus, the full flow warp loss L herein_fsAnd loss of spatio-temporal consistency L_ft、L_fpAre weighted by pixel.

And 5, setting training parameters of the network, and continuously optimizing the model according to the error of each generation. In the training process, the set loss function is required to be continuously reduced in an iterative manner, so that the more accurate the model is. And the depth map estimation of continuous video frames can be realized by utilizing the optimized model.

In specific implementation, the monocular depth and the camera pose estimation can be trained jointly, and then the residual optical flow network is trained on the basis. And finally, obtaining the trained network models of depth map estimation, camera pose estimation and optical flow estimation.

The specific implementation process is described as follows:

the invention mainly comprises three sub-networks, namely a depth map estimation network and a camera pose estimation network, which form the reconstruction of a static object together, and the optical flow estimation network structure is combined with the output of the previous stage to realize the positioning of a moving object. Although the networks can be trained together in an end-to-end fashion, there is no guarantee that local gradient optimization will bring the network to an optimal point. Therefore, a segmented training strategy is employed while reducing memory and computation consumption. Firstly, training a depth map estimation network and a camera pose estimation network, determining weights, and then training an optical flow estimation network. The resolution of the trained input images are all resize to 128 x 416, while random upscaling, cropping, recoloring, etc. methods are also employed to prevent overfitting. The network optimization function adopts a common neural network optimization method Adam. The initial learning rate was set to 0.0002 and the mini-batch size (minimum batch size) was set to 4. The first and second stages of the training process converge with 30 and 200 epochs (iterations), respectively. Testing on the KITTI data set it should be understood that parts not elaborated on in this specification are prior art.

In the above process, the main characteristics are: and (3) providing time consistency check of the depth map, improving a loss function of the depth neural network, constructing the time consistency check specially aiming at the video depth map in a deep learning model, improving the overall loss function, and preventing overlarge errors before and after the result of continuous video frames. Meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in the scene are improved to a certain extent.

In specific implementation, the automatic operation of the process can be realized by adopting a software mode. The apparatus for operating the process should also be within the scope of the present invention.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video depth map estimation method with space-time consistency is characterized in that: comprises the following steps of (a) carrying out,

step 1, generating a training set, wherein the method comprises the steps of fixing the length of an image sequence to be 3 frames, taking a central frame as a target view, taking two frames in front of and behind as source views, and generating a plurality of sequences;

step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;

step 3, aiming at a moving object in the scene, cascading a previous optical flow network behind the frame established in the step 2 to simulate the motion in the scene, wherein the optical flow estimation network structure is established, and a loss function of the part is established;

step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; the implementation mode is as follows,

space consistency loss is proposed, and stream value difference from a t frame image to a t +1 frame image and from the t +1 frame image to the t frame image is restrained; proposing time consistency loss, and adding difference constraint of the image stream value from the t frame to the t +1 frame and the stream value from the t-1 frame to the t +1 frame to the stream value from the t-1 frame to the t frame;

the following formula is shown below,

wherein,

for the position of p-point at t-frame calculated by the overall stream of p-points from s-frame to t-frame, I_s(p) is the position of the p point in the s frame image,

for the entire stream of t-1 frames to t frames,

for the entire stream of t frames to t +1 frames,

the whole stream from t-1 frame to t +1 frame; l is_ftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, L_fpIs the difference between the stream value from t-1 frame to t frame plus the stream value from t frame to t +1 frame and the stream value from t-1 frame directly to t +1 frame;

2. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: in step 2, a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder are adopted, and cross-layer connection is adopted to carry out multi-scale depth prediction.

3. The method of estimating a video depth map with spatio-temporal consistency according to claim 1 or 2, characterized in that: and 2, performing unsupervised training by using the unmarked video, wherein the unsupervised training comprises training by combining the geometric characteristics of the moving three-dimensional scene, combining the training into image synthesis loss, and performing unsupervised learning training on the static scene and the dynamic scene in the image by using the image similarity as a monitor.

4. An apparatus for estimating a video depth map with spatio-temporal consistency, characterized in that: method for implementing a video depth map estimation with spatio-temporal consistency according to any one of claims 1 to 3.