CN106612427A

CN106612427A - Method for generating spatial-temporal consistency depth map sequence based on convolution neural network

Info

Publication number: CN106612427A
Application number: CN201611244732.0A
Authority: CN
Inventors: 王勋; 赵绪然
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-03
Anticipated expiration: 2036-12-29
Also published as: CN106612427B

Abstract

The invention discloses a method for generating a spatial-temporal consistency depth map sequence based on a convolution neural network, which can be used in a film and television work 2D-to-3D technology. The method comprises the following steps: (1) collecting a training set, wherein each training sample in the training set is composed of a continuous RGB image sequence and a corresponding depth map sequence; (2) carrying out spatial-temporal consistency super pixel segmentation on each image sequence in the training set, and constructing a spatial similarity matrix and a temporal similarity matrix; (3) constructing a convolution neural network composed of a single super pixel depth regression network and a spatial-temporal consistency conditional random field loss layer; (4) training the convolution neural network; and (5) for an RGB image sequence of unknown depth, using the trained neural network to recover a corresponding depth map sequence through forward propagation. The problem that a depth recovery method based on clues relies too much on the scene hypothesis and the problem that the frames of a depth map generated by the existing depth recovery method based on a convolution neural network are discontinuous are avoided.

Description

A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks

Technical field

The present invention relates to computer vision field of stereo videos, and in particular to a kind of space-time one based on convolutional neural networks The generation method of cause property depth map sequence.

Background technology

The general principle of three-dimensional video-frequency is that the image superposition that two width are had into horizontal parallax is played, and spectators pass through anaglyph spectacles The picture of right and left eyes is respectively seen, so as to produce three-dimensional perception.Three-dimensional video-frequency can give people to provide 3 D stereo sight on the spot in person Sense, it is very popular.Constantly rise however as the popularization degree of 3D video display hardware, the shortage of 3D movie and television contents is therewith Come.Directly high cost is shot by 3D video cameras, post-production difficulty is big, is typically only capable to used in big cost film.Therefore shadow The 2D/3D switch technologies for being regarded as product are a kind of effective approach for solving a film source difficult problem in short supply, can not only significantly expand three-dimensional shadow The subject matter and quantity of piece, moreover it is possible to which the films and television programs for making some classical return to fluorescent screen.

Because the horizontal parallax in three-dimensional video-frequency is directly related to the corresponding depth of each pixel, therefore each frame of video is obtained Corresponding depth map is the key point of 2D/3D switch technologies.Depth map can be schemed and be assigned by manually scratching to each frame of video Depth value is given to produce, but cost is very expensive.Meanwhile, there is also the automanual depth drawing generating method of some, i.e., first by The artificial depth map for drawing some key frames in video, it is adjacent that these depth maps are expanded to other by computer by propagation algorithm Frame.Although these methods can save portion of time, when mass disposal films and television programs 2D to 3D is changed, still need Want the artificial operation of burdensome.

Comparatively speaking, full automatic depth recovery method can farthest save labour turnover.Some algorithms can be with By motion, focus on, block or shade even depth clue, using specific rule depth map is recovered, but generally only to spy Determine scene effective.For example, the method based on inferred motion structure can be little according to adjacent interframe distant objects relative displacement, nearby The big clue of object relative displacement recovers the depth of the static scene that mobile camera shoots, but such method is in reference object It is invalid in the case that mobile or video camera is static；The depth of shallow depth image can be recovered based on the depth recovery method for focusing on, But the poor effect in the case of the big depth of field.Various scenes, therefore the depth based on Depth cue are generally comprised in films and television programs Restoration methods are difficult commonly used.

Convolutional neural networks are a kind of deep neural networks for being particularly well-suited to image, and it is by convolutional layer, active coating, Chi Hua The elementary cell stacking such as layer and depletion layer is constituted, and can be input into x to the complicated function of specific output y with analog image, is solving figure As occupying dominance status in all kinds of machine vision problems such as classification, image segmentation.Nearly one or two years comes, and certain methods are by convolution Neutral net is used for depth recovery, show that the mapping for being input to depth map output from RGB image is closed using the study of substantial amounts of data System.Do not rely on various based on the depth recovery of convolutional neural networks it is assumed that having good universality, and recover precision very Height, therefore have very big application potential in the 2D-3D conversions of films and television programs.However, existing method is in training convolutional nerve Single image optimization is all based on during network, and have ignored the continuous sexual intercourse of interframe.If applying to recover image sequence Depth, the depth map that adjacent each frame is recovered can occur obvious saltus step.And the depth map saltus step of consecutive frame can cause synthesis Virtual view flicker, have a strong impact on user's perception.Additionally, the continuity of interframe also provides important line to depth recovery Rope, and in existing method, these information are simply neglected.

The content of the invention

Present invention aims to the deficiencies in the prior art, there is provided a kind of space-time based on convolutional neural networks is consistent The generation method of property depth map sequence, the continuity of RGB image and depth map in time domain is introduced in convolutional neural networks, By multiple image combined optimization during training, to generate the continuous depth map in time domain, and improve the accuracy of depth recovery.

The purpose of the present invention is achieved through the following technical solutions：A kind of space-time based on convolutional neural networks is consistent The generation method of property depth map sequence, comprises the steps：

1) training set is collected.Each training sample of training set is a continuous RGB image sequence comprising m frames, with And its corresponding depth map sequence；

2) space-time consistency super-pixel segmentation is carried out to each image sequence in training set, and is built spatially Similarity matrix S^(s)With temporal similarity matrix S^(t)；

3) build convolutional neural networks, the neutral net by comprising parameter W single super-pixel depth Recurrent networks, and Space-time consistency condition random field loss layer comprising parameter alpha is constituted.The effect of wherein single super-pixel depth Recurrent networks is Do not considering to return out each super-pixel one depth value in the case that space-time consistency is constrained；Space-time consistency condition with The effect of airport loss layer is using step 2) in time for setting up and similarity matrix spatially single super-pixel is returned Row constraint is entered in the output of network, the estimating depth figure smoothed in final output time domain and spatial domain.

4) using the RGB image sequence and depth map sequence in training set to step 3) in build convolutional neural networks enter Row training, draws network parameter W and α.

5) to the RGB image sequence of unknown depth, depth map is recovered by propagated forward using the neutral net for training Sequence.

Further, described step 2) it is specially：

(2.1) space-time consistency super-pixel segmentation is carried out to each the continuous RGB image sequence in training set.Will input Sequence labelling is I=[I₁,…,I_m], wherein I_tIt is t frame RGB images, has m frames.Space-time consistency super-pixel segmentation is by m frames N is divided into respectively₁,…,n_mSame object is corresponded in individual super-pixel, and generation a later frame in each super-pixel and former frame The corresponding relation of super-pixel.Whole image sequence is includedIndividual super-pixel.It is for each super-pixel p, its is heavy The real depth value of heart position is designated as d_p, and define the real depth vector d=[d of n super-pixel₁；…；d_n]。

(2.2) the Space Consistency similarity matrix S of this n super-pixel is set up^(s), method is：S^(s)It is a n × n Matrix, whereinDescribe the frame in similarity relation of p-th super-pixel and q-th super-pixel：

Wherein c_pAnd c_qIt is respectively the color histogram feature of super-pixel p and q, γ is the parameter for manually setting, and can be set It is set to all neighbouring super pixels right | | c_p-c_q||²The median of value.

(2.3) the Space Consistency similarity matrix S of this n super-pixel is set up^(t), method is：S^(t)It is a n × n Matrix, whereinDescribe the similarity relation of the interframe of p-th super-pixel and q-th super-pixel：

Wherein, the corresponding relation of consecutive frame super-pixel is drawn by the space-time consistency super-pixel segmentation in step (2.1).

Further, described step 3) in build convolutional neural networks be made up of two parts：Single super-pixel depth Degree Recurrent networks, and space-time consistency condition random field loss layer：

(3.1) single super-pixel depth Recurrent networks are by first 31 layers of VGG16 networks, 1 super-pixel pond layer, and 3 Full articulamentum is constituted.Wherein, the feature in each super-pixel spatial dimension of super-pixel pond layer carries out average pond.The network Input is the continuous RGB image of m frames, and output is a n-dimensional vector z=[z₁,…z_p], wherein p-th element z_pIt is that this is continuous Estimation of Depth of p-th super-pixel of the RGB image sequence Jing after space-time consistency super-pixel segmentation when any constraint is not considered Value.The parameter of the needs study of the convolutional neural networks is designated as W.

(3.2) single super-pixel Recurrent networks in the input step (3.1) of space-time consistency condition random field loss layer Output z=[z₁,…z_n], the super-pixel real depth vector d=[d defined in step (2.1)₁；…；d_n], and step (2.2) the Space Consistency similarity matrix for and in (2.3) drawingWith time consistency similarity matrixHere, The conditional probability function of space-time consistency condition random field is：

Wherein energy function E (d, I) is defined as：

The Section 1 ∑ of the energy function_p∈N(d_p-z_p)²It is the gap of single super-pixel predicted value and actual value；Section 2It is Space Consistency constraint, shows if super-pixel p and q are adjacent in same frame, Er Qieyan Color ratio it is more close (Than larger), then depth should be similar；Section 3It is time consistency Property constraint, show if super-pixel p and q are the super-pixel of the same object of correspondence in adjacent two frameIts depth should It is similar.The energy function matrix form can be write as：

E (d, I)=d^TLd-2z^Td+z^Tz

Wherein：

M=α^(s)S^(s)+α^(t)S^(t)

S^(s)And S^(t)It is the room and time similarity matrix drawn in step (2.2) and step (2.3), α^(s)And α^(t)It is Two parameters of study are needed,It is the unit matrix of n × n, D is a diagonal matrix, D_pp=∑_qM_pq。

And

Wherein L^-1The inverse matrix of L is represented, | L | represents the determinant of L.

Therefore, loss function can be defined as the negative logarithm of conditional probability function：

Further, step 4) in convolutional neural networks training process be specially：

(4.1) using stochastic gradient descent method to network parameter W, α^(s)And α^(t)It is optimized, in each iteration, ginseng Number updates with the following methods：

Wherein lr is learning rate.

(4.2) cost function J is calculated the partial derivative of parameter W by following formula in step (4.1)：

WhereinSuccessively it is calculated by the backpropagation of convolutional neural networks.

(4.3) in step (4.2) cost function J to parameter alpha^(s)And α^(t)Partial derivativeWithBy following formula meters Calculate：

Wherein Tr () represents the mark for seeking matrix, matrix A^(s)And A^(t)It is matrix L to α^(s)And α^(t)Partial derivative, by following public affairs Formula is calculated：

δ (p=q) values as p=q are 1, and otherwise value is 0.

Further, step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially：

(5.1) space-time consistency super-pixel segmentation is carried out to the RGB image sequence according to the method in step 2, and is counted Calculate space similarity matrix S^(s)With time similarity matrix S^(t)；

(5.2) propagated forward is carried out to the RGB image sequence using the convolutional neural networks for training, obtains single super picture Plain network exports z；

(5.3) depth through space-time consistency constraint is output asCalculated by following formula：

Wherein matrix L is calculated by the method described in step (3.2).Represent p-th super-pixel of RGB image sequence Depth value.

(5.4) by eachGive the relevant position of the super-pixel respective frame, you can draw the depth map of m two field pictures.

Beneficial effects of the present invention are as follows：

First, compared to the depth recovery method based on Depth cue, the present invention is learnt from RGB using convolutional neural networks Image does not rely on the ad hoc hypothesis to scene to the Function Mapping of depth map；

Second, only single-frame images is optimized compared to the existing depth recovery method based on convolutional neural networks, this Bright addition space-time consistency constraint, by constructing space-time consistency random field loss layer to multiple image combined optimization, can be with defeated Go out the depth map of space-time consistency, it is to avoid the interframe jump of depth map.

3rd, compared to the existing depth recovery method based on convolutional neural networks, what the present invention was added is space-time one The constraint of cause property, can improve the precision of depth recovery.

The data set LYB 3D-TV that the present invention is proposed in public data collection NYU depth v2 and an inventor oneself Upper and Eigen, David, Christian Puhrsch, and Rob Fergus. " Depth map prediction from a single image using a multi-scale deep network."Advances in neural information Other existing methods such as processing systems.2014. are compared.As a result show, method proposed by the present invention The continuous cause property of time domain for recovering depth map, and the accuracy for improving estimation of Depth can be significantly increased.

Description of the drawings

Fig. 1 is the example flow chart of the present invention；

Fig. 2 is convolutional neural networks structure chart proposed by the present invention；

Fig. 3 is the structure chart of single super-pixel depth Recurrent networks；

Fig. 4 is the schematic diagram that single super-pixel acts on multiple image.

Specific embodiment

Below in conjunction with the accompanying drawings the present invention is described in further detail with specific embodiment.

Embodiment flow chart as shown in Figure 1, the inventive method comprises the steps：

2) using Chang Jason et al.A video representation using temporal The method proposed in superpixels.CVPR 2013 carries out the super picture of space-time consistency to each image sequence in training set Element segmentation, and build similarity matrix S spatially^(s)With temporal similarity matrix S^(t)；

3) build convolutional neural networks, the neutral net by comprising parameter W single super-pixel depth Recurrent networks, and Space-time consistency condition random field loss layer comprising parameter alpha is constituted.The effect of wherein single super-pixel depth Recurrent networks is In the case where space-time consistency constraint is not considered to returning out a depth value to each super-pixel；Space-time consistency condition The effect of random field loss layer is to use step 2) in time for setting up and similarity matrix spatially to single super-pixel time Row constraint is entered in the output for returning network, the estimating depth figure smoothed in final output time domain and spatial domain.

With regard to step 2) be embodied as be described as follows：

(2.1) using Chang Jason et al.A video representation using temporal The method proposed in superpixels.CVPR 2013 carries out space-time one to each the continuous RGB image sequence in training set Cause property super-pixel segmentation.List entries is labeled as into I=[I₁,…,I_m], wherein I_tIt is t frame RGB images, has m frames.Space-time M frames are divided into respectively n by uniformity super-pixel segmentation₁,…,n_mIndividual super-pixel, and generate each super-pixel and front in a later frame The corresponding relation of the super-pixel of correspondence same object in one frame.Whole image sequence is includedIndividual super-pixel.For Each super-pixel p, the real depth value of its position of centre of gravity is designated as d by us_p, and define the real depth of n super-pixel to Amount d=[d₁；…；d_n]。

With regard to step 3) be embodied as be described as follows：

(3.1) convolutional neural networks that this method builds are made up of two parts：Single super-pixel depth Recurrent networks, with And space-time condition for consistence random field loss layer, its overall network structure is as shown in Figure 2；

(3.2) the single super-pixel depth Recurrent networks described in step (3.1) are by document Simonyan, Karen, and Andrew Zisserman."Very deep convolutional networks for large-scale image recognition."arXivpreprint arXiv:First 31 layers of the VGG16 networks proposed in 1409.1556 (2014), two Individual convolutional layer, 1 super-pixel pond layer, and 3 full articulamentums are constituted, and the network structure is as shown in Figure 3.Wherein, super-pixel pond Changing the feature in layer each super-pixel spatial dimension carries out the layers such as average pond, other convolution, Chi Hua, activation to be convolution refreshing The conventional layer of Jing networks.For the continuous RGB image input of m frames, the network acts solely on first each frame, such as bag Containing n_tThe t two field pictures of individual super-pixel, the network exports a n_tThe vectorial z of dimension_t, represent each super-pixel of the frame in and do not examining The depth considered under any constraint returns output.Afterwards, the output of m two field pictures is spliced into into oneThe vectorial z of dimension =[z₁；…,；z_n], the estimating depth regressand value of common n super-pixel in the image sequence is represented, as shown in Figure 4.The convolutional Neural The parameter of the needs study of network is designated as W.

(3.3) described in the input step (3.2) of the space-time consistency condition random field loss layer described in step (3.1) Single super-pixel Recurrent networks output z=[z₁,…z_n], and, the super-pixel real depth defined in step (2.1) to Amount d=[d₁；…；d_n], and the Space Consistency similarity matrix drawn in step (2.2) and (2.3)And time consistency Property similarity matrixHere, the conditional probability function of space-time consistency condition random field is：

Wherein energy function E (d, I) is defined as：

E (d, I)=d^TLd-2z^Td+z^Tz

Wherein：

M=α^(s)S^(s)+α^(t)S^(t)

And

Step 4) in convolutional neural networks training process, specially：

Wherein lr is learning rate.

(4.3) in step (4.2) cost function J to parameter alpha^(s)And α^(t)Partial derivativeCalculated by following formula：

Tr () is the computing of the mark for seeking matrix；Wherein matrix A^(s)And A^(t)It is matrix L to α^(s)And α^(t)Partial derivative, by Following formula are calculated：

δ (p=q) values as p=q are 1, and otherwise value is 0.

Step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially：

Wherein matrix L is calculated by the method described in step (3.3).Represent p-th super-pixel of RGB image sequence Depth value.

Specific embodiment：The data that the present invention is proposed in public data collection NYU depth v2 and an inventor oneself Compare with other existing methods of concentration on collection LYB3D-TV.Wherein, NYU depth v2 data sets are by 795 training Scene and 654 test scenes are constituted, and each scene includes the continuous rgb images of 30 frames and its corresponding depth map.LYU 3D-TV databases take from TV play《Thinkling sound's Ya lists》Some scenes, we have chosen 5124 frame pictures in 60 scenes and its The depth map of manual mark is used as training set, and the depth map of 1278 frame pictures in 20 scenes and its craft mark is used as survey Examination collection.We are contrasted method proposed by the present invention and following method in depth recovery precision：

1.Depth transfer:Karsch,Kevin,Ce Liu,and Sing Bing Kang."Depth transfer:Depth extraction from video using non-parametric sampling."IEEE transactions on pattern analysis and machine intelligence 36.11(2014):2144- 2158.

2.discrete-continuous CRF:Liu,Miaomiao,Mathieu Salzmann,and Xuming He."Discrete-continuous depth estimation from a single image."Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014.

3.Multi-scale CNN:Eigen,David,Christian Puhrsch,and Rob Fergus."Depth map prediction from a single image using a multi-scale deep network."Advances in neural information processing systems.2014(Multi-scale CNN),

4.2D-DCNF:Liu,Fayao,et al."Learning depth from single monocular images using deep convolutional neural fields."IEEE transactions on pattern analysis and machine intelligence.

As a result show, the precision of our method has been lifted relative to control methods, and recover the interframe of depth map Chattering is significantly reduced.

Table 1：In the depth recovery accuracy comparison of NYU depth v2 databases

Table 2：In the depth recovery accuracy comparison of LYB-3D TV databases

Claims

1. a kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks, it is characterised in that under including Row step：

1) training set is collected.Each training sample of training set is a continuous RGB image sequence comprising m frames, Yi Jiqi Corresponding depth map sequence；

2) carry out space-time consistency super-pixel segmentation to each image sequence in training set, and build spatially it is similar Degree matrix S^(s)With temporal similarity matrix S^(t)；

3) convolutional neural networks are built, the neutral net is by the single super-pixel depth Recurrent networks comprising parameter W, and includes The space-time consistency condition random field loss layer of parameter alpha is constituted.

4) using the RGB image sequence and depth map sequence in training set to step 3) in build convolutional neural networks instruct Practice, draw network parameter W and α.

5) to the RGB image sequence of unknown depth, depth map sequence is recovered by propagated forward using the neutral net for training.

2. the generation method of space-time consistency depth map sequence according to claim 1, it is characterised in that described step 2) it is specially：

(2.1) space-time consistency super-pixel segmentation is carried out to each the continuous RGB image sequence in training set.By list entries It is labeled as I=[I₁,…,I_m], wherein I_tIt is t frame RGB images, has m frames.Space-time consistency super-pixel segmentation distinguishes m frames It is divided into n₁,…,n_mIndividual super-pixel, and generate the super picture of correspondence same object in each super-pixel and former frame in a later frame The corresponding relation of element.Whole image sequence is includedIndividual super-pixel.For each super-pixel p, by its center of gravity position The real depth value put is designated as d_p, and define the real depth vector d=[d of n super-pixel₁；…；d_n]。

(2.2) the Space Consistency similarity matrix S of this n super-pixel is set up^(s), method is：S^(s)It is the matrix of a n × n, WhereinDescribe the frame in similarity relation of p-th super-pixel and q-th super-pixel：

Wherein c_pAnd c_qIt is respectively the color histogram feature of super-pixel p and q, γ is the parameter for manually setting, and be may be set to All neighbouring super pixels are right | | c_p-c_q||²The median of value.

(2.3) the time consistency similarity matrix S of this n super-pixel is set up^(t), method is：S^(t)It is the matrix of a n × n, WhereinDescribe the similarity relation of the interframe of p-th super-pixel and q-th super-pixel：

3. the generation method of space-time consistency depth map sequence according to claim 2, it is characterised in that described step 3) convolutional neural networks built in are made up of two parts：Single super-pixel depth Recurrent networks, and space-time consistency bar Part random field loss layer：

(3.1) single super-pixel depth Recurrent networks are by first 31 layers of VGG16 networks, 1 super-pixel pond layer, and 3 connect entirely Connect layer composition.Wherein, the feature in each super-pixel spatial dimension of super-pixel pond layer carries out average pond.The input of the network It is the continuous RGB image of m frames, output is a n-dimensional vector z=[z₁,…z_n], wherein p-th element z_pIt is the continuous RGB figures As estimation of Depth value of p-th super-pixel of the sequence Jing after space-time consistency super-pixel segmentation when any constraint is not considered.Should The parameter of the needs study of convolutional neural networks is designated as W.

(3.2) input of space-time consistency condition random field loss layer is the defeated of single super-pixel Recurrent networks in step (3.1) Go out z=[z₁,…z_n], the super-pixel real depth vector d=[d defined in step (2.1)₁；…；d_n], and step (2.2) (2.3) the Space Consistency similarity matrix drawn inWith time consistency similarity matrixLoss function is defined For：

J = - \log P (d | I) = - d^{T} L d + 2 z^{T} d - z^{T} L^{- 1} z - \frac{1}{2} l o g (| L |) + \frac{n}{2} l o g (π)

Wherein L^-1The inverse matrix of L is represented, and：

M=α^(s)S^(s)+α^(t)S^(t)

Wherein, S^(s)And S^(t)It is the room and time similarity matrix drawn in step (2.2) and step (2.3), α^(s)And α^(t) It is two parameters for needing study,It is the unit matrix of n × n, D is a diagonal matrix, D_pp=∑_qM_pq。

4. the generation method of space-time consistency depth map sequence according to claim 3, it is characterised in that described step 4) convolutional neural networks training process is specially in：

(4.1) using stochastic gradient descent method to network parameter W, α^(s)And α^(t)It is optimized, in each iteration, parameter is used In the following manner updates：

W = W - l r \frac{\partial J}{\partial W}

α^{(s)} = α^{(s)} - l r \frac{\partial J}{\partial α^{(s)}}

α^{(t)} = α^{(t)} - l r \frac{\partial J}{\partial α^{(t)}}

Wherein lr is learning rate.

(4.2) loss function J is calculated the partial derivative of parameter W by following formula：

\frac{\partial J}{\partial W} = 2 {(L^{- 1} z - d)}^{T} \frac{\partial z}{\partial W}

(4.3) loss function J is to parameter alpha^(s)And α^(t)Partial derivativeWithCalculated by following formula：

\frac{\partial J}{\partial α^{(s)}} = d^{T} A^{(s)} d - z^{T} L^{- 1} A^{(s)} L^{- 1} z - \frac{1}{2} T r (L^{- 1} A^{(s)})

\frac{\partial J}{\partial α^{(t)}} = d^{T} A^{(t)} d - z^{T} L^{- 1} A^{(t)} L^{- 1} z - \frac{1}{2} T r (L^{- 1} A^{(t)})

Tr () is the computing of the mark for seeking matrix；Wherein matrix A^(s)And A^(t)It is matrix L to α^(s)And α^(t)Partial derivative, by following Formula is calculated：

A_{p q}^{(s)} = - S_{p q}^{(s)} + δ (p = q) Σ_{q} S_{p q}^{(s)}

A_{p q}^{(t)} = - S_{p q}^{(t)} + δ (p = q) Σ_{q} S_{p q}^{(t)}

δ (p=q) values as p=q are 1, and otherwise value is 0.

5. the generation method of space-time consistency depth map sequence according to claim 4, it is characterised in that described step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially：

(5.1) space-time consistency super-pixel segmentation is carried out to the RGB image sequence, and calculates space similarity matrix S^(s)With Time similarity matrix S^(t)；

(5.2) propagated forward is carried out to the RGB image sequence using the convolutional neural networks for training, obtains single super-pixel net Network exports z；

\hat{d} = L^{- 1} z

Wherein matrix L is calculated by the method described in step (3.2).Represent the depth of p-th super-pixel of RGB image sequence Estimate.