CN106612427A - Method for generating spatial-temporal consistency depth map sequence based on convolution neural network - Google Patents

Method for generating spatial-temporal consistency depth map sequence based on convolution neural network Download PDF

Info

Publication number
CN106612427A
CN106612427A CN201611244732.0A CN201611244732A CN106612427A CN 106612427 A CN106612427 A CN 106612427A CN 201611244732 A CN201611244732 A CN 201611244732A CN 106612427 A CN106612427 A CN 106612427A
Authority
CN
China
Prior art keywords
super
pixel
depth
space
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611244732.0A
Other languages
Chinese (zh)
Other versions
CN106612427B (en
Inventor
王勋
赵绪然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201611244732.0A priority Critical patent/CN106612427B/en
Publication of CN106612427A publication Critical patent/CN106612427A/en
Application granted granted Critical
Publication of CN106612427B publication Critical patent/CN106612427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/122Improving the 3D impression of stereoscopic images by modifying image signal contents, e.g. by filtering or adding monoscopic depth cues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/128Adjusting depth or disparity

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for generating a spatial-temporal consistency depth map sequence based on a convolution neural network, which can be used in a film and television work 2D-to-3D technology. The method comprises the following steps: (1) collecting a training set, wherein each training sample in the training set is composed of a continuous RGB image sequence and a corresponding depth map sequence; (2) carrying out spatial-temporal consistency super pixel segmentation on each image sequence in the training set, and constructing a spatial similarity matrix and a temporal similarity matrix; (3) constructing a convolution neural network composed of a single super pixel depth regression network and a spatial-temporal consistency conditional random field loss layer; (4) training the convolution neural network; and (5) for an RGB image sequence of unknown depth, using the trained neural network to recover a corresponding depth map sequence through forward propagation. The problem that a depth recovery method based on clues relies too much on the scene hypothesis and the problem that the frames of a depth map generated by the existing depth recovery method based on a convolution neural network are discontinuous are avoided.

Description

A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks
Technical field
The present invention relates to computer vision field of stereo videos, and in particular to a kind of space-time one based on convolutional neural networks The generation method of cause property depth map sequence.
Background technology
The general principle of three-dimensional video-frequency is that the image superposition that two width are had into horizontal parallax is played, and spectators pass through anaglyph spectacles The picture of right and left eyes is respectively seen, so as to produce three-dimensional perception.Three-dimensional video-frequency can give people to provide 3 D stereo sight on the spot in person Sense, it is very popular.Constantly rise however as the popularization degree of 3D video display hardware, the shortage of 3D movie and television contents is therewith Come.Directly high cost is shot by 3D video cameras, post-production difficulty is big, is typically only capable to used in big cost film.Therefore shadow The 2D/3D switch technologies for being regarded as product are a kind of effective approach for solving a film source difficult problem in short supply, can not only significantly expand three-dimensional shadow The subject matter and quantity of piece, moreover it is possible to which the films and television programs for making some classical return to fluorescent screen.
Because the horizontal parallax in three-dimensional video-frequency is directly related to the corresponding depth of each pixel, therefore each frame of video is obtained Corresponding depth map is the key point of 2D/3D switch technologies.Depth map can be schemed and be assigned by manually scratching to each frame of video Depth value is given to produce, but cost is very expensive.Meanwhile, there is also the automanual depth drawing generating method of some, i.e., first by The artificial depth map for drawing some key frames in video, it is adjacent that these depth maps are expanded to other by computer by propagation algorithm Frame.Although these methods can save portion of time, when mass disposal films and television programs 2D to 3D is changed, still need Want the artificial operation of burdensome.
Comparatively speaking, full automatic depth recovery method can farthest save labour turnover.Some algorithms can be with By motion, focus on, block or shade even depth clue, using specific rule depth map is recovered, but generally only to spy Determine scene effective.For example, the method based on inferred motion structure can be little according to adjacent interframe distant objects relative displacement, nearby The big clue of object relative displacement recovers the depth of the static scene that mobile camera shoots, but such method is in reference object It is invalid in the case that mobile or video camera is static;The depth of shallow depth image can be recovered based on the depth recovery method for focusing on, But the poor effect in the case of the big depth of field.Various scenes, therefore the depth based on Depth cue are generally comprised in films and television programs Restoration methods are difficult commonly used.
Convolutional neural networks are a kind of deep neural networks for being particularly well-suited to image, and it is by convolutional layer, active coating, Chi Hua The elementary cell stacking such as layer and depletion layer is constituted, and can be input into x to the complicated function of specific output y with analog image, is solving figure As occupying dominance status in all kinds of machine vision problems such as classification, image segmentation.Nearly one or two years comes, and certain methods are by convolution Neutral net is used for depth recovery, show that the mapping for being input to depth map output from RGB image is closed using the study of substantial amounts of data System.Do not rely on various based on the depth recovery of convolutional neural networks it is assumed that having good universality, and recover precision very Height, therefore have very big application potential in the 2D-3D conversions of films and television programs.However, existing method is in training convolutional nerve Single image optimization is all based on during network, and have ignored the continuous sexual intercourse of interframe.If applying to recover image sequence Depth, the depth map that adjacent each frame is recovered can occur obvious saltus step.And the depth map saltus step of consecutive frame can cause synthesis Virtual view flicker, have a strong impact on user's perception.Additionally, the continuity of interframe also provides important line to depth recovery Rope, and in existing method, these information are simply neglected.
The content of the invention
Present invention aims to the deficiencies in the prior art, there is provided a kind of space-time based on convolutional neural networks is consistent The generation method of property depth map sequence, the continuity of RGB image and depth map in time domain is introduced in convolutional neural networks, By multiple image combined optimization during training, to generate the continuous depth map in time domain, and improve the accuracy of depth recovery.
The purpose of the present invention is achieved through the following technical solutions:A kind of space-time based on convolutional neural networks is consistent The generation method of property depth map sequence, comprises the steps:
1) training set is collected.Each training sample of training set is a continuous RGB image sequence comprising m frames, with And its corresponding depth map sequence;
2) space-time consistency super-pixel segmentation is carried out to each image sequence in training set, and is built spatially Similarity matrix S(s)With temporal similarity matrix S(t)
3) build convolutional neural networks, the neutral net by comprising parameter W single super-pixel depth Recurrent networks, and Space-time consistency condition random field loss layer comprising parameter alpha is constituted.The effect of wherein single super-pixel depth Recurrent networks is Do not considering to return out each super-pixel one depth value in the case that space-time consistency is constrained;Space-time consistency condition with The effect of airport loss layer is using step 2) in time for setting up and similarity matrix spatially single super-pixel is returned Row constraint is entered in the output of network, the estimating depth figure smoothed in final output time domain and spatial domain.
4) using the RGB image sequence and depth map sequence in training set to step 3) in build convolutional neural networks enter Row training, draws network parameter W and α.
5) to the RGB image sequence of unknown depth, depth map is recovered by propagated forward using the neutral net for training Sequence.
Further, described step 2) it is specially:
(2.1) space-time consistency super-pixel segmentation is carried out to each the continuous RGB image sequence in training set.Will input Sequence labelling is I=[I1,…,Im], wherein ItIt is t frame RGB images, has m frames.Space-time consistency super-pixel segmentation is by m frames N is divided into respectively1,…,nmSame object is corresponded in individual super-pixel, and generation a later frame in each super-pixel and former frame The corresponding relation of super-pixel.Whole image sequence is includedIndividual super-pixel.It is for each super-pixel p, its is heavy The real depth value of heart position is designated as dp, and define the real depth vector d=[d of n super-pixel1;…;dn]。
(2.2) the Space Consistency similarity matrix S of this n super-pixel is set up(s), method is:S(s)It is a n × n Matrix, whereinDescribe the frame in similarity relation of p-th super-pixel and q-th super-pixel:
Wherein cpAnd cqIt is respectively the color histogram feature of super-pixel p and q, γ is the parameter for manually setting, and can be set It is set to all neighbouring super pixels right | | cp-cq||2The median of value.
(2.3) the Space Consistency similarity matrix S of this n super-pixel is set up(t), method is:S(t)It is a n × n Matrix, whereinDescribe the similarity relation of the interframe of p-th super-pixel and q-th super-pixel:
Wherein, the corresponding relation of consecutive frame super-pixel is drawn by the space-time consistency super-pixel segmentation in step (2.1).
Further, described step 3) in build convolutional neural networks be made up of two parts:Single super-pixel depth Degree Recurrent networks, and space-time consistency condition random field loss layer:
(3.1) single super-pixel depth Recurrent networks are by first 31 layers of VGG16 networks, 1 super-pixel pond layer, and 3 Full articulamentum is constituted.Wherein, the feature in each super-pixel spatial dimension of super-pixel pond layer carries out average pond.The network Input is the continuous RGB image of m frames, and output is a n-dimensional vector z=[z1,…zp], wherein p-th element zpIt is that this is continuous Estimation of Depth of p-th super-pixel of the RGB image sequence Jing after space-time consistency super-pixel segmentation when any constraint is not considered Value.The parameter of the needs study of the convolutional neural networks is designated as W.
(3.2) single super-pixel Recurrent networks in the input step (3.1) of space-time consistency condition random field loss layer Output z=[z1,…zn], the super-pixel real depth vector d=[d defined in step (2.1)1;…;dn], and step (2.2) the Space Consistency similarity matrix for and in (2.3) drawingWith time consistency similarity matrixHere, The conditional probability function of space-time consistency condition random field is:
Wherein energy function E (d, I) is defined as:
The Section 1 ∑ of the energy functionp∈N(dp-zp)2It is the gap of single super-pixel predicted value and actual value;Section 2It is Space Consistency constraint, shows if super-pixel p and q are adjacent in same frame, Er Qieyan Color ratio it is more close (Than larger), then depth should be similar;Section 3It is time consistency Property constraint, show if super-pixel p and q are the super-pixel of the same object of correspondence in adjacent two frameIts depth should It is similar.The energy function matrix form can be write as:
E (d, I)=dTLd-2zTd+zTz
Wherein:
M=α(s)S(s)(t)S(t)
S(s)And S(t)It is the room and time similarity matrix drawn in step (2.2) and step (2.3), α(s)And α(t)It is Two parameters of study are needed,It is the unit matrix of n × n, D is a diagonal matrix, Dpp=∑qMpq
And
Wherein L-1The inverse matrix of L is represented, | L | represents the determinant of L.
Therefore, loss function can be defined as the negative logarithm of conditional probability function:
Further, step 4) in convolutional neural networks training process be specially:
(4.1) using stochastic gradient descent method to network parameter W, α(s)And α(t)It is optimized, in each iteration, ginseng Number updates with the following methods:
Wherein lr is learning rate.
(4.2) cost function J is calculated the partial derivative of parameter W by following formula in step (4.1):
WhereinSuccessively it is calculated by the backpropagation of convolutional neural networks.
(4.3) in step (4.2) cost function J to parameter alpha(s)And α(t)Partial derivativeWithBy following formula meters Calculate:
Wherein Tr () represents the mark for seeking matrix, matrix A(s)And A(t)It is matrix L to α(s)And α(t)Partial derivative, by following public affairs Formula is calculated:
δ (p=q) values as p=q are 1, and otherwise value is 0.
Further, step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially:
(5.1) space-time consistency super-pixel segmentation is carried out to the RGB image sequence according to the method in step 2, and is counted Calculate space similarity matrix S(s)With time similarity matrix S(t)
(5.2) propagated forward is carried out to the RGB image sequence using the convolutional neural networks for training, obtains single super picture Plain network exports z;
(5.3) depth through space-time consistency constraint is output asCalculated by following formula:
Wherein matrix L is calculated by the method described in step (3.2).Represent p-th super-pixel of RGB image sequence Depth value.
(5.4) by eachGive the relevant position of the super-pixel respective frame, you can draw the depth map of m two field pictures.
Beneficial effects of the present invention are as follows:
First, compared to the depth recovery method based on Depth cue, the present invention is learnt from RGB using convolutional neural networks Image does not rely on the ad hoc hypothesis to scene to the Function Mapping of depth map;
Second, only single-frame images is optimized compared to the existing depth recovery method based on convolutional neural networks, this Bright addition space-time consistency constraint, by constructing space-time consistency random field loss layer to multiple image combined optimization, can be with defeated Go out the depth map of space-time consistency, it is to avoid the interframe jump of depth map.
3rd, compared to the existing depth recovery method based on convolutional neural networks, what the present invention was added is space-time one The constraint of cause property, can improve the precision of depth recovery.
The data set LYB 3D-TV that the present invention is proposed in public data collection NYU depth v2 and an inventor oneself Upper and Eigen, David, Christian Puhrsch, and Rob Fergus. " Depth map prediction from a single image using a multi-scale deep network."Advances in neural information Other existing methods such as processing systems.2014. are compared.As a result show, method proposed by the present invention The continuous cause property of time domain for recovering depth map, and the accuracy for improving estimation of Depth can be significantly increased.
Description of the drawings
Fig. 1 is the example flow chart of the present invention;
Fig. 2 is convolutional neural networks structure chart proposed by the present invention;
Fig. 3 is the structure chart of single super-pixel depth Recurrent networks;
Fig. 4 is the schematic diagram that single super-pixel acts on multiple image.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is described in further detail with specific embodiment.
Embodiment flow chart as shown in Figure 1, the inventive method comprises the steps:
1) training set is collected.Each training sample of training set is a continuous RGB image sequence comprising m frames, with And its corresponding depth map sequence;
2) using Chang Jason et al.A video representation using temporal The method proposed in superpixels.CVPR 2013 carries out the super picture of space-time consistency to each image sequence in training set Element segmentation, and build similarity matrix S spatially(s)With temporal similarity matrix S(t)
3) build convolutional neural networks, the neutral net by comprising parameter W single super-pixel depth Recurrent networks, and Space-time consistency condition random field loss layer comprising parameter alpha is constituted.The effect of wherein single super-pixel depth Recurrent networks is In the case where space-time consistency constraint is not considered to returning out a depth value to each super-pixel;Space-time consistency condition The effect of random field loss layer is to use step 2) in time for setting up and similarity matrix spatially to single super-pixel time Row constraint is entered in the output for returning network, the estimating depth figure smoothed in final output time domain and spatial domain.
4) using the RGB image sequence and depth map sequence in training set to step 3) in build convolutional neural networks enter Row training, draws network parameter W and α.
5) to the RGB image sequence of unknown depth, depth map is recovered by propagated forward using the neutral net for training Sequence.
With regard to step 2) be embodied as be described as follows:
(2.1) using Chang Jason et al.A video representation using temporal The method proposed in superpixels.CVPR 2013 carries out space-time one to each the continuous RGB image sequence in training set Cause property super-pixel segmentation.List entries is labeled as into I=[I1,…,Im], wherein ItIt is t frame RGB images, has m frames.Space-time M frames are divided into respectively n by uniformity super-pixel segmentation1,…,nmIndividual super-pixel, and generate each super-pixel and front in a later frame The corresponding relation of the super-pixel of correspondence same object in one frame.Whole image sequence is includedIndividual super-pixel.For Each super-pixel p, the real depth value of its position of centre of gravity is designated as d by usp, and define the real depth of n super-pixel to Amount d=[d1;…;dn]。
(2.2) the Space Consistency similarity matrix S of this n super-pixel is set up(s), method is:S(s)It is a n × n Matrix, whereinDescribe the frame in similarity relation of p-th super-pixel and q-th super-pixel:
Wherein cpAnd cqIt is respectively the color histogram feature of super-pixel p and q, γ is the parameter for manually setting, and can be set It is set to all neighbouring super pixels right | | cp-cq||2The median of value.
(2.3) the Space Consistency similarity matrix S of this n super-pixel is set up(t), method is:S(t)It is a n × n Matrix, whereinDescribe the similarity relation of the interframe of p-th super-pixel and q-th super-pixel:
Wherein, the corresponding relation of consecutive frame super-pixel is drawn by the space-time consistency super-pixel segmentation in step (2.1).
With regard to step 3) be embodied as be described as follows:
(3.1) convolutional neural networks that this method builds are made up of two parts:Single super-pixel depth Recurrent networks, with And space-time condition for consistence random field loss layer, its overall network structure is as shown in Figure 2;
(3.2) the single super-pixel depth Recurrent networks described in step (3.1) are by document Simonyan, Karen, and Andrew Zisserman."Very deep convolutional networks for large-scale image recognition."arXivpreprint arXiv:First 31 layers of the VGG16 networks proposed in 1409.1556 (2014), two Individual convolutional layer, 1 super-pixel pond layer, and 3 full articulamentums are constituted, and the network structure is as shown in Figure 3.Wherein, super-pixel pond Changing the feature in layer each super-pixel spatial dimension carries out the layers such as average pond, other convolution, Chi Hua, activation to be convolution refreshing The conventional layer of Jing networks.For the continuous RGB image input of m frames, the network acts solely on first each frame, such as bag Containing ntThe t two field pictures of individual super-pixel, the network exports a ntThe vectorial z of dimensiont, represent each super-pixel of the frame in and do not examining The depth considered under any constraint returns output.Afterwards, the output of m two field pictures is spliced into into oneThe vectorial z of dimension =[z1;…,;zn], the estimating depth regressand value of common n super-pixel in the image sequence is represented, as shown in Figure 4.The convolutional Neural The parameter of the needs study of network is designated as W.
(3.3) described in the input step (3.2) of the space-time consistency condition random field loss layer described in step (3.1) Single super-pixel Recurrent networks output z=[z1,…zn], and, the super-pixel real depth defined in step (2.1) to Amount d=[d1;…;dn], and the Space Consistency similarity matrix drawn in step (2.2) and (2.3)And time consistency Property similarity matrixHere, the conditional probability function of space-time consistency condition random field is:
Wherein energy function E (d, I) is defined as:
The Section 1 ∑ of the energy functionp∈N(dp-zp)2It is the gap of single super-pixel predicted value and actual value;Section 2It is Space Consistency constraint, shows if super-pixel p and q are adjacent in same frame, Er Qieyan Color ratio it is more close (Than larger), then depth should be similar;Section 3It is time consistency Property constraint, show if super-pixel p and q are the super-pixel of the same object of correspondence in adjacent two frameIts depth should It is similar.The energy function matrix form can be write as:
E (d, I)=dTLd-2zTd+zTz
Wherein:
M=α(s)S(s)(t)S(t)
S(s)And S(t)It is the room and time similarity matrix drawn in step (2.2) and step (2.3), α(s)And α(t)It is Two parameters of study are needed,It is the unit matrix of n × n, D is a diagonal matrix, Dpp=∑qMpq
And
Wherein L-1The inverse matrix of L is represented, | L | represents the determinant of L.
Therefore, loss function can be defined as the negative logarithm of conditional probability function:
Step 4) in convolutional neural networks training process, specially:
(4.1) using stochastic gradient descent method to network parameter W, α(s)And α(t)It is optimized, in each iteration, ginseng Number updates with the following methods:
Wherein lr is learning rate.
(4.2) cost function J is calculated the partial derivative of parameter W by following formula in step (4.1):
WhereinSuccessively it is calculated by the backpropagation of convolutional neural networks.
(4.3) in step (4.2) cost function J to parameter alpha(s)And α(t)Partial derivativeCalculated by following formula:
Tr () is the computing of the mark for seeking matrix;Wherein matrix A(s)And A(t)It is matrix L to α(s)And α(t)Partial derivative, by Following formula are calculated:
δ (p=q) values as p=q are 1, and otherwise value is 0.
Step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially:
(5.1) space-time consistency super-pixel segmentation is carried out to the RGB image sequence according to the method in step 2, and is counted Calculate space similarity matrix S(s)With time similarity matrix S(t)
(5.2) propagated forward is carried out to the RGB image sequence using the convolutional neural networks for training, obtains single super picture Plain network exports z;
(5.3) depth through space-time consistency constraint is output asCalculated by following formula:
Wherein matrix L is calculated by the method described in step (3.3).Represent p-th super-pixel of RGB image sequence Depth value.
(5.4) by eachGive the relevant position of the super-pixel respective frame, you can draw the depth map of m two field pictures.
Specific embodiment:The data that the present invention is proposed in public data collection NYU depth v2 and an inventor oneself Compare with other existing methods of concentration on collection LYB3D-TV.Wherein, NYU depth v2 data sets are by 795 training Scene and 654 test scenes are constituted, and each scene includes the continuous rgb images of 30 frames and its corresponding depth map.LYU 3D-TV databases take from TV play《Thinkling sound's Ya lists》Some scenes, we have chosen 5124 frame pictures in 60 scenes and its The depth map of manual mark is used as training set, and the depth map of 1278 frame pictures in 20 scenes and its craft mark is used as survey Examination collection.We are contrasted method proposed by the present invention and following method in depth recovery precision:
1.Depth transfer:Karsch,Kevin,Ce Liu,and Sing Bing Kang."Depth transfer:Depth extraction from video using non-parametric sampling."IEEE transactions on pattern analysis and machine intelligence 36.11(2014):2144- 2158.
2.discrete-continuous CRF:Liu,Miaomiao,Mathieu Salzmann,and Xuming He."Discrete-continuous depth estimation from a single image."Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014.
3.Multi-scale CNN:Eigen,David,Christian Puhrsch,and Rob Fergus."Depth map prediction from a single image using a multi-scale deep network."Advances in neural information processing systems.2014(Multi-scale CNN),
4.2D-DCNF:Liu,Fayao,et al."Learning depth from single monocular images using deep convolutional neural fields."IEEE transactions on pattern analysis and machine intelligence.
As a result show, the precision of our method has been lifted relative to control methods, and recover the interframe of depth map Chattering is significantly reduced.
Table 1:In the depth recovery accuracy comparison of NYU depth v2 databases
Table 2:In the depth recovery accuracy comparison of LYB-3D TV databases

Claims (5)

1. a kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks, it is characterised in that under including Row step:
1) training set is collected.Each training sample of training set is a continuous RGB image sequence comprising m frames, Yi Jiqi Corresponding depth map sequence;
2) carry out space-time consistency super-pixel segmentation to each image sequence in training set, and build spatially it is similar Degree matrix S(s)With temporal similarity matrix S(t)
3) convolutional neural networks are built, the neutral net is by the single super-pixel depth Recurrent networks comprising parameter W, and includes The space-time consistency condition random field loss layer of parameter alpha is constituted.
4) using the RGB image sequence and depth map sequence in training set to step 3) in build convolutional neural networks instruct Practice, draw network parameter W and α.
5) to the RGB image sequence of unknown depth, depth map sequence is recovered by propagated forward using the neutral net for training.
2. the generation method of space-time consistency depth map sequence according to claim 1, it is characterised in that described step 2) it is specially:
(2.1) space-time consistency super-pixel segmentation is carried out to each the continuous RGB image sequence in training set.By list entries It is labeled as I=[I1,…,Im], wherein ItIt is t frame RGB images, has m frames.Space-time consistency super-pixel segmentation distinguishes m frames It is divided into n1,…,nmIndividual super-pixel, and generate the super picture of correspondence same object in each super-pixel and former frame in a later frame The corresponding relation of element.Whole image sequence is includedIndividual super-pixel.For each super-pixel p, by its center of gravity position The real depth value put is designated as dp, and define the real depth vector d=[d of n super-pixel1;…;dn]。
(2.2) the Space Consistency similarity matrix S of this n super-pixel is set up(s), method is:S(s)It is the matrix of a n × n, WhereinDescribe the frame in similarity relation of p-th super-pixel and q-th super-pixel:
Wherein cpAnd cqIt is respectively the color histogram feature of super-pixel p and q, γ is the parameter for manually setting, and be may be set to All neighbouring super pixels are right | | cp-cq||2The median of value.
(2.3) the time consistency similarity matrix S of this n super-pixel is set up(t), method is:S(t)It is the matrix of a n × n, WhereinDescribe the similarity relation of the interframe of p-th super-pixel and q-th super-pixel:
Wherein, the corresponding relation of consecutive frame super-pixel is drawn by the space-time consistency super-pixel segmentation in step (2.1).
3. the generation method of space-time consistency depth map sequence according to claim 2, it is characterised in that described step 3) convolutional neural networks built in are made up of two parts:Single super-pixel depth Recurrent networks, and space-time consistency bar Part random field loss layer:
(3.1) single super-pixel depth Recurrent networks are by first 31 layers of VGG16 networks, 1 super-pixel pond layer, and 3 connect entirely Connect layer composition.Wherein, the feature in each super-pixel spatial dimension of super-pixel pond layer carries out average pond.The input of the network It is the continuous RGB image of m frames, output is a n-dimensional vector z=[z1,…zn], wherein p-th element zpIt is the continuous RGB figures As estimation of Depth value of p-th super-pixel of the sequence Jing after space-time consistency super-pixel segmentation when any constraint is not considered.Should The parameter of the needs study of convolutional neural networks is designated as W.
(3.2) input of space-time consistency condition random field loss layer is the defeated of single super-pixel Recurrent networks in step (3.1) Go out z=[z1,…zn], the super-pixel real depth vector d=[d defined in step (2.1)1;…;dn], and step (2.2) (2.3) the Space Consistency similarity matrix drawn inWith time consistency similarity matrixLoss function is defined For:
J = - log P ( d | I ) = - d T L d + 2 z T d - z T L - 1 z - 1 2 l o g ( | L | ) + n 2 l o g ( π )
Wherein L-1The inverse matrix of L is represented, and:
M=α(s)S(s)(t)S(t)
Wherein, S(s)And S(t)It is the room and time similarity matrix drawn in step (2.2) and step (2.3), α(s)And α(t) It is two parameters for needing study,It is the unit matrix of n × n, D is a diagonal matrix, Dpp=∑qMpq
4. the generation method of space-time consistency depth map sequence according to claim 3, it is characterised in that described step 4) convolutional neural networks training process is specially in:
(4.1) using stochastic gradient descent method to network parameter W, α(s)And α(t)It is optimized, in each iteration, parameter is used In the following manner updates:
W = W - l r ∂ J ∂ W
α ( s ) = α ( s ) - l r ∂ J ∂ α ( s )
α ( t ) = α ( t ) - l r ∂ J ∂ α ( t )
Wherein lr is learning rate.
(4.2) loss function J is calculated the partial derivative of parameter W by following formula:
∂ J ∂ W = 2 ( L - 1 z - d ) T ∂ z ∂ W
WhereinSuccessively it is calculated by the backpropagation of convolutional neural networks.
(4.3) loss function J is to parameter alpha(s)And α(t)Partial derivativeWithCalculated by following formula:
∂ J ∂ α ( s ) = d T A ( s ) d - z T L - 1 A ( s ) L - 1 z - 1 2 T r ( L - 1 A ( s ) )
∂ J ∂ α ( t ) = d T A ( t ) d - z T L - 1 A ( t ) L - 1 z - 1 2 T r ( L - 1 A ( t ) )
Tr () is the computing of the mark for seeking matrix;Wherein matrix A(s)And A(t)It is matrix L to α(s)And α(t)Partial derivative, by following Formula is calculated:
A p q ( s ) = - S p q ( s ) + δ ( p = q ) Σ q S p q ( s )
A p q ( t ) = - S p q ( t ) + δ ( p = q ) Σ q S p q ( t )
δ (p=q) values as p=q are 1, and otherwise value is 0.
5. the generation method of space-time consistency depth map sequence according to claim 4, it is characterised in that described step 5) in, the method for recovering the RGB image sequence of a unknown depth is specially:
(5.1) space-time consistency super-pixel segmentation is carried out to the RGB image sequence, and calculates space similarity matrix S(s)With Time similarity matrix S(t)
(5.2) propagated forward is carried out to the RGB image sequence using the convolutional neural networks for training, obtains single super-pixel net Network exports z;
(5.3) depth through space-time consistency constraint is output asCalculated by following formula:
d ^ = L - 1 z
Wherein matrix L is calculated by the method described in step (3.2).Represent the depth of p-th super-pixel of RGB image sequence Estimate.
(5.4) by eachGive the relevant position of the super-pixel respective frame, you can draw the depth map of m two field pictures.
CN201611244732.0A 2016-12-29 2016-12-29 A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks Active CN106612427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611244732.0A CN106612427B (en) 2016-12-29 2016-12-29 A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611244732.0A CN106612427B (en) 2016-12-29 2016-12-29 A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks

Publications (2)

Publication Number Publication Date
CN106612427A true CN106612427A (en) 2017-05-03
CN106612427B CN106612427B (en) 2018-07-06

Family

ID=58636373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611244732.0A Active CN106612427B (en) 2016-12-29 2016-12-29 A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks

Country Status (1)

Country Link
CN (1) CN106612427B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292846A (en) * 2017-06-27 2017-10-24 南方医科大学 The restoration methods of incomplete CT data for projection under a kind of circular orbit
CN107992848A (en) * 2017-12-19 2018-05-04 北京小米移动软件有限公司 Obtain the method, apparatus and computer-readable recording medium of depth image
CN108335322A (en) * 2018-02-01 2018-07-27 深圳市商汤科技有限公司 Depth estimation method and device, electronic equipment, program and medium
CN108389226A (en) * 2018-02-12 2018-08-10 北京工业大学 A kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax
CN108596102A (en) * 2018-04-26 2018-09-28 北京航空航天大学青岛研究院 Indoor scene object segmentation grader building method based on RGB-D
CN109215067A (en) * 2017-07-03 2019-01-15 百度(美国)有限责任公司 High-resolution 3-D point cloud is generated based on CNN and CRF model
CN109657839A (en) * 2018-11-22 2019-04-19 天津大学 A kind of wind power forecasting method based on depth convolutional neural networks
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110782490A (en) * 2019-09-24 2020-02-11 武汉大学 Video depth map estimation method and device with space-time consistency
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation
CN114596637A (en) * 2022-03-23 2022-06-07 北京百度网讯科技有限公司 Image sample data enhancement training method and device and electronic equipment
US11423615B1 (en) * 2018-05-29 2022-08-23 HL Acquisition, Inc. Techniques for producing three-dimensional models from one or more two-dimensional images

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102196292A (en) * 2011-06-24 2011-09-21 清华大学 Human-computer-interaction-based video depth map sequence generation method and system
US20130177236A1 (en) * 2012-01-10 2013-07-11 Samsung Electronics Co., Ltd. Method and apparatus for processing depth image
CN103955942A (en) * 2014-05-22 2014-07-30 哈尔滨工业大学 SVM-based depth map extraction method of 2D image
CN105359190A (en) * 2013-09-05 2016-02-24 电子湾有限公司 Estimating depth from a single image
CN105657402A (en) * 2016-01-18 2016-06-08 深圳市未来媒体技术研究院 Depth map recovery method
CN105979244A (en) * 2016-05-31 2016-09-28 十二维度(北京)科技有限公司 Method and system used for converting 2D image to 3D image based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102196292A (en) * 2011-06-24 2011-09-21 清华大学 Human-computer-interaction-based video depth map sequence generation method and system
US20130177236A1 (en) * 2012-01-10 2013-07-11 Samsung Electronics Co., Ltd. Method and apparatus for processing depth image
CN105359190A (en) * 2013-09-05 2016-02-24 电子湾有限公司 Estimating depth from a single image
CN103955942A (en) * 2014-05-22 2014-07-30 哈尔滨工业大学 SVM-based depth map extraction method of 2D image
CN105657402A (en) * 2016-01-18 2016-06-08 深圳市未来媒体技术研究院 Depth map recovery method
CN105979244A (en) * 2016-05-31 2016-09-28 十二维度(北京)科技有限公司 Method and system used for converting 2D image to 3D image based on deep learning

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292846B (en) * 2017-06-27 2020-11-10 南方医科大学 Recovery method of incomplete CT projection data under circular orbit
CN107292846A (en) * 2017-06-27 2017-10-24 南方医科大学 The restoration methods of incomplete CT data for projection under a kind of circular orbit
CN109215067A (en) * 2017-07-03 2019-01-15 百度(美国)有限责任公司 High-resolution 3-D point cloud is generated based on CNN and CRF model
CN109215067B (en) * 2017-07-03 2023-03-10 百度(美国)有限责任公司 High-resolution 3-D point cloud generation based on CNN and CRF models
CN107992848B (en) * 2017-12-19 2020-09-25 北京小米移动软件有限公司 Method and device for acquiring depth image and computer readable storage medium
CN107992848A (en) * 2017-12-19 2018-05-04 北京小米移动软件有限公司 Obtain the method, apparatus and computer-readable recording medium of depth image
US11308638B2 (en) 2018-02-01 2022-04-19 Shenzhen Sensetime Technology Co., Ltd. Depth estimation method and apparatus, electronic device, program, and medium
CN108335322A (en) * 2018-02-01 2018-07-27 深圳市商汤科技有限公司 Depth estimation method and device, electronic equipment, program and medium
CN108335322B (en) * 2018-02-01 2021-02-12 深圳市商汤科技有限公司 Depth estimation method and apparatus, electronic device, program, and medium
CN108389226A (en) * 2018-02-12 2018-08-10 北京工业大学 A kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax
CN108596102A (en) * 2018-04-26 2018-09-28 北京航空航天大学青岛研究院 Indoor scene object segmentation grader building method based on RGB-D
US11423615B1 (en) * 2018-05-29 2022-08-23 HL Acquisition, Inc. Techniques for producing three-dimensional models from one or more two-dimensional images
CN109657839A (en) * 2018-11-22 2019-04-19 天津大学 A kind of wind power forecasting method based on depth convolutional neural networks
CN110163246B (en) * 2019-04-08 2021-03-30 杭州电子科技大学 Monocular light field image unsupervised depth estimation method based on convolutional neural network
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110782490B (en) * 2019-09-24 2022-07-05 武汉大学 Video depth map estimation method and device with space-time consistency
CN110782490A (en) * 2019-09-24 2020-02-11 武汉大学 Video depth map estimation method and device with space-time consistency
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation
CN114596637A (en) * 2022-03-23 2022-06-07 北京百度网讯科技有限公司 Image sample data enhancement training method and device and electronic equipment
CN114596637B (en) * 2022-03-23 2024-02-06 北京百度网讯科技有限公司 Image sample data enhancement training method and device and electronic equipment

Also Published As

Publication number Publication date
CN106612427B (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN106612427B (en) A kind of generation method of the space-time consistency depth map sequence based on convolutional neural networks
US10540590B2 (en) Method for generating spatial-temporally consistent depth map sequences based on convolution neural networks
Sun et al. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume
Zou et al. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency
Zhou et al. Moving indoor: Unsupervised video depth learning in challenging environments
Yue et al. Image denoising by exploring external and internal correlations
CN109360156A (en) Single image rain removing method based on the image block for generating confrontation network
CN102026013A (en) Stereo video matching method based on affine transformation
Peng et al. LVE-S2D: Low-light video enhancement from static to dynamic
Zhang et al. Multiscale-vr: Multiscale gigapixel 3d panoramic videography for virtual reality
Li et al. Enforcing temporal consistency in video depth estimation
CN107018400B (en) It is a kind of by 2D Video Quality Metrics into the method for 3D videos
Meng et al. Perception inspired deep neural networks for spectral snapshot compressive imaging
Cho et al. Event-image fusion stereo using cross-modality feature propagation
Dong et al. Cycle-CNN for colorization towards real monochrome-color camera systems
Guo et al. Adaptive estimation of depth map for two-dimensional to three-dimensional stereoscopic conversion
Yeh et al. An approach to automatic creation of cinemagraphs
Li et al. Graph-based saliency fusion with superpixel-level belief propagation for 3D fixation prediction
Dong et al. Pyramid convolutional network for colorization in monochrome-color multi-lens camera system
WO2022257184A1 (en) Method for acquiring image generation apparatus, and image generation apparatus
CN106028018B (en) Real scene shooting double vision point 3D method for optimizing video and system towards naked eye 3D display
Kim et al. Light field angular super-resolution using convolutional neural network with residual network
Zhou et al. 1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report: A Concise Pipeline for Egocentric Hand Pose Reconstruction
Lee et al. Efficient Low Light Video Enhancement Based on Improved Retinex Algorithms
CN112200756A (en) Intelligent bullet special effect short video generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant