CN102074020B - Method for performing multi-body depth recovery and segmentation on video - Google Patents
Method for performing multi-body depth recovery and segmentation on video Download PDFInfo
- Publication number
- CN102074020B CN102074020B CN2010106169405A CN201010616940A CN102074020B CN 102074020 B CN102074020 B CN 102074020B CN 2010106169405 A CN2010106169405 A CN 2010106169405A CN 201010616940 A CN201010616940 A CN 201010616940A CN 102074020 B CN102074020 B CN 102074020B
- Authority
- CN
- China
- Prior art keywords
- msub
- mrow
- msup
- frame
- prime
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000011084 recovery Methods 0.000 title claims abstract description 17
- 238000003709 image segmentation Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000013519 translation Methods 0.000 claims description 6
- 230000033001 locomotion Effects 0.000 description 16
- 238000005457 optimization Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000003044 adaptive effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 208000037280 Trisomy Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a method for performing multi-body depth recovery and segmentation on a video, which comprises the following steps of: (1) performing energy minimization on the video by an iteration method to obtain an initial label of each frame of the video, wherein the initial label consists of the depth of pixels and segmentation information; (2) after performing image segmentation on each frame, optimizing the initial label of each frame of image by a multi-body plane fitting method to obtain the optimized labels of all segmentation blocks in each frame of image; (3) selecting a group of visible frames and a group of invisible frames from adjacent frames for each pixel on each frame by using the optimized labels; and (4) performing the energy minimization on each frame of the video by the iteration method to obtain the iterated labels of each frame of the video, and expanding the progression of the depth of the iterated labels further by using a hierarchical confidence propagation algorithm. By the method, the depth recovery and segmentation can be performed on videos in which multiple rigid objects move.
Description
Technical Field
The invention relates to a depth recovery and segmentation method, which is used for performing depth recovery and segmentation on a video with a plurality of rigid object motions.
Background
Depth-based three-dimensional restoration and image (or video) segmentation have long been fundamental problems in computer vision, because the computed depth image and the segmented image can be used separately or together in many important applications, such as object recognition, image-based rendering, and image (or video) editing. However, the research on these two types of problems is often independent, and until recently, some have begun to research them together. Such as L.Quan, J.Wang, P.Tan, and L.Yuan.image-based modeling by joint segmentation. International Journal of Computer Vision (IJCV' 07).
Multi-View Stereo recovery (depth recovery) techniques (MVS) can be used to compute depth and three-dimensional geometric information from a set of images. Based on the importance of three-dimensional reconstruction, research has been conducted on reconstructing a dynamic scene in which a moving object exists. The three-dimensional motion segmentation is to distinguish the characteristic tracks of a plurality of moving objects so as to recover the actual positions of the moving objects and the corresponding motion information of a camera. For simplicity, most of these methods use affine camera models (e.g., j.p. costeira and t.kanade.a multi-body factorization method for motion analysis. ieee International Conference on Computer Vision (ICCV '95), and a few methods have been proposed to deal with the three-dimensional segmentation problem under the perspective camera model, such as k.schindler, j.u., and h.wang. perspective-view multi-structure-and-motion-region model selection (ECCV' 06). However, none of these methods can be used directly on top of a high quality three-dimensional reconstruction, especially if image segmentation is required.
If the rigid objects that move are individually concealed, the MVS may be applied to each object independently. Classical image segmentation methods such as mean shift, normalized cuts and weighted aggregation (swa) simply process two-dimensional images without considering the overall geometric information in the MVS.
In order to extract a moving object in the foreground and some possible visible boundaries of the object, some two-layer segmentation methods are proposed, such as a. criminisi, g.cross, a.blake, and v.kolmogorov.bilayeration of live video.cvpr' 06. These methods assume that the camera is stationary and that the background color is also easily estimated or modeled. It is noted, however, that these methods are also not applicable to MVS, since the camera needs to be moved in MVS.
Recently, Chao Guassian et al used motion information and depth information in G.Zhang, J.Jia, W.Hua, and H.Bao.Robust biolayering and motion/depth estimation with a hand-suspended camera. IEEE transactions on Pattern Analysis and Machine Analysis (PAMI' 2010) to simulate the background environment and extract high quality foreground layers. Iterative optimization can be continuously performed on the calculated depth motion field and the double-layer segmentation result. However, this method is limited to bi-layer segmentation. In addition, only the motion information of the foreground layer is calculated and the depth information is not calculated, which is not enough for three-dimensional reconstruction.
In two-dimensional motion segmentation, pixels with the same motion trend are roughly divided into a group and finally divided into a plurality of different layers. This method relies heavily on the accuracy of motion estimation and it is difficult to obtain high quality segmentation results, especially when severe occlusion occurs.
In addition, two-dimensional motion segmentation also requires motion and segmentation calculations, both of which are a problem with "chicken and egg", i.e., inaccuracies in motion estimation can cause inaccuracies in segmentation, which in turn can cause inaccuracies in motion calculation. The optimization of both is then often concluded with a local optimum.
Disclosure of Invention
The invention aims to provide a method for performing depth recovery and segmentation on multiple bodies of videos, which can perform depth recovery and segmentation on videos with a plurality of rigid object motions.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the method for recovering and segmenting the depth of multiple bodies of the video comprises the following steps:
(1) performing energy minimization on the video by using an energy equation of an equation (1) through an iterative method to obtain an initial label of each frame of the video, wherein the initial label consists of depth information and segmentation information of pixels,
wherein,
in the formulae (1), (2) and (3), ItRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)t) Representation objectElement xtA visible frame, and the pair of pixels xtIn visible frame and xtRe-projection and x of corresponding pixel in t frametOverlapping; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); sigmacA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frametCorresponding pixel, and the t 'th frame is of phi' (x)t) A frame of (2); i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:
in the formula (4), h represents a homogeneous coordinate; d (l) represents a pixel xtDepth information in the label of (1); kt′、Rt′And Tt′Respectively corresponding to an internal parameter matrix, a rotation matrix of external parameters and a translation matrix of the external parameters of the camera corresponding to the t' th frame; kt、RtAnd TtRespectively obtaining an internal parameter matrix, a rotation matrix and a translation matrix of the external parameter of the camera corresponding to the t-th frame;
(2) after each frame is subjected to image segmentation, optimizing the initial labels of each frame of image by using a multi-body plane fitting method to obtain optimized labels of all segmented blocks of each frame of image;
(3) using the optimized label obtained finally in step (2) to identify each pixel x on the t-th frametSelecting a set of visible frames from neighboring framesv(xt) And a set of invisible frames phio(xt) All pixels in the visible frame are transformed to the tth frame without matching xtAt least one pixel in the invisible frame is coincided with x when being transformed to the t frametOverlapping;
(4) performing energy minimization on each frame of the video by using an energy equation shown in the formula (5) by using an iterative method to obtain an iterated label of each frame of the video, further expanding the progression of depth in the iterated label by using a hierarchical confidence coefficient propagation algorithm,
Et(Lt)=Ed(Lt)+Es(Lt) (5)
wherein,
in formulae (5) to (11), Ed(Lt) And Es(Lt) Respectively representing a data item and a smooth item in an energy equation; i istRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighbors of (2)A near pixel; ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frametA corresponding pixel; i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical ofgRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;is the pixel that reprojects pixel x 'onto the t-th frame according to D (l'); p is a radical ofvRepresenting a pixel xtAnd the geometric consistency and segmentation consistency of the pixels corresponding to the coordinates x'.
Further, the method of "optimizing the initial label by using a multi-body plane fitting method" in step (2) of the present invention is as follows:
after each frame is subjected to image segmentation, each segmentation block is successively endowed with an object label, the object labels endowed by the same segmentation block at each time are different from each other, and then the energy equation shown in the formula (1) is utilized to obtain the corresponding minimum energy value and the parameter of the plane where the segmentation block is located for each assignment result of each segmentation block; comparing the minimum value of the minimum energy values in each divided block with the minimum energy value corresponding to the initial label: if the minimum value in the minimum energy values in the segmentation blocks is smaller than the minimum energy value corresponding to the initial label, assigning the object label corresponding to the minimum value in the minimum energy values in the segmentation blocks as the segmentation label to the pixels in the segmentation blocks to obtain the optimized label of the segmentation blocks; otherwise, taking the initial label as the optimized label of the segmentation block.
Compared with the prior art, the invention has the beneficial effects that:
(1) a brand-new multi-body stereo vision model is provided, the depth and the segmentation labels are uniformly represented by one label, and the problem of global optimization labeling is solved, so that the multi-view stereo matching method is expanded to a scene with a plurality of rigid objects moving independently for the first time, and the global optimization method is used for carrying out optimization solution;
(2) a strategy for adaptive selection of a matched frame is provided, the restored depth and segmentation information of an adjacent frame are projected to a current frame, the visibility of pixels is judged, missing pixels are filled, and a label priori constraint is obtained to process the shielding problem;
(3) a brand-new multi-body plane fitting method is provided, and the problems of depth of a characteristic-free area and difficulty in calculation of segmentation are effectively solved.
Drawings
FIG. 1 is a basic flow diagram of the present invention;
FIG. 2 is one of the principles of the present invention, depicting projection and re-projection in a multi-view geometry;
FIG. 3 is a diagram illustrating the method for filling missing pixels in labeled images according to the present invention: (a) is frame 61; (b) is frame 76; (c) is the depth information L of all pixels after pre-labeling and re-projecting76,61Red is the blocking pixel; (d) is a process example of the label graph missing pixel padding method;
FIG. 4 is a comparison of the effect of adaptive selection of matched frames and the label priors constraint with and without the present invention: (a) is the first frame of the experimental data sequence; graphs (b) and (c) are the two labeled graph results (labeled values are in grayscale) of this frame processed with and without the pre-labeled adaptive frame selection method, respectively; FIGS. (d) and (e) are enlarged views of rectangular regions in FIGS. (b) and (c), respectively, for easy observation;
FIG. 5 is a pipeline diagram of one example of the invention: (a) is a frame in a sequence of pictures; (b) is a label graph obtained before multi-body plane fitting after initial solution; (c) is a label diagram after the plane fitting of the multi-body plane; (d) is a label diagram after two times of optimization iteration; (e) is a segmentation map of a person and a background; (f) is the result of the three-dimensional reconstruction patch without depth level expansion; (g) the result of the three-dimensional reconstruction patch after the depth level expansion is obtained;
FIG. 6 is two trisomy examples of the invention: (a) and (d) two selected frames, respectively; (b) is a numbered scheme of (a); (c) is a segmentation map of (a); (e) is a labeled graph of (d), and (f) is a divided graph of (d);
FIG. 7 is a reconstructed patch result of one example of the invention: (a) is the geometric information of the background; (b) and (c) the reconstructed patch results of two persons in the frame, respectively;
FIG. 8 is an example result of a box sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention;
FIG. 9 is an example result of a toy sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention; (c) is the segmentation result graph of the frame calculated by the invention.
Detailed Description
The invention provides a stable and efficient multi-body depth recovery and segmentation method, and a basic flow chart of the invention is shown in figure 1, which mainly comprises the following steps:
step 1) an initialization processing stage, wherein energy minimization is performed on a video by an iterative method according to an energy equation provided by the invention to obtain an initial label of each frame of the video, wherein the initial label consists of a depth label and a segmentation label of a pixel, and the initialization processing stage specifically comprises the following steps:
the object of the invention is to find both the depth values of all pixels and to which object the pixel belongs, so that there are two labels, one depth label and one segmentation label. The invention uniformly represents the two label values of the pixel by an expanded label:
it is assumed here that there are K objects (including the background) in the scene. The pixel disparity values (actually the inverse of the depth, referred to herein as the depth index) of the kth object on the image captured by the camera range fromIn between, this interval is divided evenly so that each disparity value is defined as follows:
this is the meaning of all elements in the L-set. Therein are provided with
The object with the symbol l is represented by S (l), i.e. the segmentation symbol. D (l) represents the disparity value of the pixel labeled l, i.e., the depth label. Any element L of the L setiS (l) and D (l) are readily available, and index term h is first found so that the following inequality is satisfied:
then s (l) ═ h, <math>
<mrow>
<mi>D</mi>
<mrow>
<mo>(</mo>
<mi>l</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<msubsup>
<mi>d</mi>
<mrow>
<mi>j</mi>
<mo>-</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>h</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msubsup>
<msub>
<mi>m</mi>
<mi>j</mi>
</msub>
</mrow>
<mi>h</mi>
</msubsup>
<mo>.</mo>
</mrow>
</math>
the following energy equations were optimized:
wherein P isinitIs defined as follows:
where x istIs the t-th frame image ItOne pixel above, t is 1 … n, and n is the total frame number of the video; n (x)t) Is xtAll neighboring pixels of (g), p (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) And | η }, which represents the difference in labels between neighboring pixels. η is a truncation parameter so that discontinuity of the object boundary is not lost when the energy function is minimized because the energy value of the smoothing term is too large, i.e. is prevented from being excessively smoothed.
Here phi' (x)t) Refers to selected frames within which x istThis pixel is visible. These frames are selected here by the method disclosed in the document S.B.Kang and R.Szeliski.extraction view-dependent depth maps from acquisition of images.International Journal of Computer Vision (IJCV' 2004).
P in this casecIs defined as such:
Wherein sigmacIs a parameter controlling the shape of the difference function of the above formula, It′(x') is a pixel xtAt phi' (x)t) Color value of corresponding pixel x ', homogeneous coordinate x ' of x 'hThe following can be obtained:
from above x'hAnd converting the homogeneous coordinate into a two-dimensional coordinate to obtain the coordinate of x'. The above D (l) is the pixel xtIn the reference symbol l, Kt′、Rt′And Tt′The method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t' th frame. Kt、RtAnd TtThe method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t-th frame. It is noted that camera internal and external references for all frames are known before all steps of the invention are performed.
Step 2) after each frame is subjected to image segmentation, optimizing the initial label of each frame of image by using a multi-body plane fitting method to obtain the optimized labels of all segmented blocks of each frame of image, wherein the specific steps are as follows:
first, mean shift is used to calculate ItIs divided, i.e.
Assume that there are K objects (including background) in the environment. For each segment s as shown aboveiAssuming that it belongs to the object k and is a plane, there are three parametersThen optimizing energy equation (1) can obtain the values of the three parameters and a minimum energyEnumerating this partition siThe belonged object K can obtain K minimum energy namely E'0,E′1,…,E′(k-1)Taking the smallest of them (not provided as E'j) Corresponding object j and corresponding three parametersMinimum energy as the partition, its corresponding object, and its three plane parameters [ a ]i,bi,ci]。
The initial per-pixel label was previously calculated in step 1) of the initialization phase, then s is assignediAll pixels in the pixel array are substituted into energy equation (1) according to the initial label to obtain an energy E'tIf E'j<E′tThen all s are updatediPixel x in (2)tThe division number of (a) is that the object is j, that is, S (x)t) J, the plane parameter is [ a ]i,bi,ci]I.e. by
The effect pairs before and after optimization using the multi-volume plane fitting algorithm are shown in fig. 5.
Step 3) adaptive selection of matched frames and label prior constraint. I.e. for each pixel x on the t-th frame, according to the label obtained during the initialization phase or the last iteration optimizationtTwo sets of frames are selected from adjacent frames, one set of frames consisting of pairs of pixels xtVisible frame composition, denoted as phiv(xt) Another set of frames consisting of those pairs of pixels xtInvisible frame composition, denoted phio(xt) The method specifically comprises the following steps:
1) after obtaining the optimized label in step 2), transforming the label map of the t 'frame to the t frame to obtain L by the method of W.R. Mark, L.McMillan, and G.Bishop.post-rendering 3D forwarding, (SI 3D' 1997)t′,t. If by such a re-projection, none of the pixels on the label map of the t' th frame are projected onto xtAs shown in FIG. 2, then let t' th frame belong to φo(xt) Otherwise, it belongs to phiv(xt)。
2) The matching calculation need not be performed for all frames in practice, but only a maximum of N needs to be selected1The frame is matched (here N)1Generally 16 to 20). If found | φv(xt) | is less than a lower limit N2(generally 5) then find some neighboring pixel-free reprojection to xtFrame phi ofo(xt) So that
|φv(xt)|+|φo(xt)|=N2。
3) Note that the occluded pixel is not computationally expensive to match, so if a pixel is not visible in all neighboring frames, its depth value is not directly available. But in this case the depth value of the pixel and the object to which it belongs can still be approximated, which is whyo(xt) The reason for this is. The method comprises the following steps:
for the label map projected from the adjacent frame, as shown in FIG. 3, for each missing pixel xtSearching in horizontal and vertical directions respectively to find two nearest effective projection pixels, and selecting the one with the smallest label among the four pixels, and recording the one as x*Its reference numeral is xtReference numerals of (a). Can use x*At Lt′,tIn which the reference number replaces xtAt Lt′,tReference symbol in (1) is Lt′,t(x)=Lt′,t(x*) Is determined by the distance between the two pixels, this confidence level is defined as follows:
wherein the constant σωSet to 10.
Although this missing label inference method is not very accurate, it can be improved for data items that have been computed where occlusion is important. The label a priori constraints are defined as follows:
wherein λoIs a weight and β is used to control the shape of the difference function of the above equation. The above formula requires when ωo(x) At very high time, Lt(xt) Need to follow Lt′,t(xt) Close.
FIG. 4 shows the results of label calculations using adaptive selection of matching frames and label priors, and comparisons with labels calculated without the use. It can be seen that the label map computed using the adaptive selection of matched frames and the label a priori constraints is significantly improved at the discontinuity boundaries.
Step 4) an iterative optimization stage, which is to perform iterative optimization on all frames of the monocular video sequence according to the energy equation provided by the invention to obtain label maps of all the frames, and then improve the fineness of depth recovery by using a hierarchical confidence coefficient propagation algorithm, and specifically comprises the following steps:
1) according to the energy equation
Et(Lt)=Ed(Lt)+Es(Lt)
And optimizing to obtain the minimum energy value, namely obtaining the depth values of all the pixels and the segmentation blocks. The energy equation is optimized in two passes by using a confidence coefficient propagation algorithm to obtain a relatively accurate result.
Here, the
A smoothing term is shown to measure the index difference between adjacent pixels in the image so that the index difference between adjacent pixels is as small as possible. ρ is defined as ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) And | η }, representing the labeled difference between neighboring pixels. η is a cutoff value so that when the energy function is minimized, discontinuities in the object boundary are not lost, i.e., are not overly smoothed, due to the energy value of the smoothing term being too large.
Data item Ed(Lt) The definition is as follows:
here, the <math>
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
<mi>l</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mo>|</mo>
<msub>
<mi>φ</mi>
<mi>v</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>|</mo>
<mo>+</mo>
<mo>|</mo>
<msub>
<mi>φ</mi>
<mn>0</mn>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>|</mo>
</mrow>
</mfrac>
<mrow>
<mo>(</mo>
<munder>
<mi>Σ</mi>
<mrow>
<msup>
<mi>t</mi>
<mo>′</mo>
</msup>
<mo>∈</mo>
<msub>
<mi>φ</mi>
<mn>0</mn>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>o</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
<mi>l</mi>
<msub>
<mrow>
<mo>,</mo>
<mi>L</mi>
</mrow>
<msup>
<mi>t</mi>
<mo>′</mo>
</msup>
</msub>
<mo>,</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<munder>
<mi>Σ</mi>
<mrow>
<msup>
<mi>t</mi>
<mo>′</mo>
</msup>
<mo>∈</mo>
<msub>
<mi>φ</mi>
<mi>v</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>c</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
<mi>l</mi>
<mo>,</mo>
<msub>
<mi>I</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
<msub>
<mi>I</mi>
<msup>
<mi>t</mi>
<mo>′</mo>
</msup>
</msub>
<mo>)</mo>
</mrow>
<mo>·</mo>
<msub>
<mi>p</mi>
<mi>v</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
<mi>l</mi>
<mo>,</mo>
<msub>
<mi>L</mi>
<msup>
<mi>t</mi>
<mo>′</mo>
</msup>
</msub>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
</mrow>
</math>
Wherein
Described is a pixel xtAnd color similarity between x'; x is the number oftAnd x 'are pixels on the t-th and t' -th frames, respectively; p is a radical ofvDescribed is a pixel xtAnd x' and the consistency of the segmentation, i.e. xtAnd x' is on the same object, and the depths are consistent. The specific definition is as follows:
wherein l 'is the index for x'. If l and l 'are different objects, i.e. S (l) ≠ S (l'), then the two pixels are not the corresponding pixels in the two frames and need to be separated. Otherwise, with pgThe geometric consistency between two pixels is measured, and is defined as follows:
here, theIs to project the x ' on the t ' frame to the corresponding point on the t-th frame according to the D (l ') obtained by calculation.
2) In the above processing procedure, the present invention adopts a Belief Propagation (BP) algorithm to optimize the objective function. Since the number of labels is proportional to the memory requirement. This can easily result in memory requirements that exceed the memory space of the machine being operated for processing large resolution images. The adoption of a hierarchical solving strategy can reduce the memory usage amount, but can reduce the quality of object segmentation and depth recovery to a certain extent, particularly in a discontinuous boundary region. In order to obtain high-quality object segmentation and depth recovery results as far as possible, the invention adopts a simple solving strategy based on region segmentation to overcome the bottleneck of the memory. The process is simple, namely, the image is uniformly cut into M multiplied by M areas, and energy optimization is carried out on each area. If a color partition spans multiple regions, a corresponding split is required. The strategy is simple and effective, can effectively overcome the bottleneck of the memory, and has little influence on the processing result.
To this end, all steps of the inventive multi-volume depth recovery and segmentation are completed.
At initialization and two iterative optimizations, the number of layers m per object depth is usuallykIs set between 51 and 101. After two iterative optimizations, the segmentation results are usually already very accurate. Then, in order to further improve the accuracy of the depth, the present invention may fix the division index, at which time the index is actually equivalent to the depth order. The invention adopts a hierarchical confidence propagation algorithm from coarse to fine, can effectively expand the depth series in the global optimization without increasing a lot of calculation cost, thereby improving the precision of depth recovery.
In the experiment, the image resolution size of the video sequence was 960 × 540. The majority of the parameters in the system may be default values and need not be adjusted during processing, e.g. by taking lambdas=5/|L|,η=0.03|L|,λo=0.3,σc=10,σdWhen β is 2, 0.02| L |, the results shown in fig. 8 and 9 are obtained.
One set of experiments of the present invention is the experiment of a set of box sequences as shown in fig. 8, where fig. 8(a) is a frame in a video and fig. 8(b) is a label graph of the frame obtained using the present invention; another set of experiments of the present invention is experiments of a set of toy sequences as shown in fig. 9, fig. 9(a) is a frame in a video, fig. 9(b) is a numbered view of the frame using the present invention, and fig. 9(c) is a segmented view of the frame using the present invention. From the comparison of the label diagrams of fig. 8 and 9 with the original diagram, it can be seen that the whole label diagram is well-defined and the restoration is accurate at the boundary; from the comparison between the segmentation map of fig. 9 and the original map, it can be seen that the segmentation result is very accurate, and different objects are accurately segmented, thereby illustrating that the result obtained by the algorithm provided by the present invention is very accurate under the multi-body condition.
Claims (2)
1. A method for multi-volume depth recovery and segmentation of video, characterized by comprising the steps of:
(1) performing energy minimization on the video by using an energy equation of an equation (1) through an iterative method to obtain an initial label of each frame of the video, wherein the initial label consists of depth information and segmentation information of pixels,
wherein,
in the formulae (1), (2) and (3), ItRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)t) Represents a pair of pixels xtA visible frame, and the pair of pixels xtIn visible frame and xtRe-projection and x of corresponding pixel in t frametOverlapping; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); sigmacA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frametCorresponding pixel, and the t 'th frame is of phi' (x)t) A frame of (2); i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:
in the formula (4), h represents a homogeneous coordinate; d (l) represents a pixel xtDepth information in the label of (1); kt′、Rt′And Tt′Respectively corresponding to an internal parameter matrix, a rotation matrix of external parameters and a translation matrix of the external parameters of the camera corresponding to the t' th frame; kt、RtAnd TtRespectively obtaining an internal parameter matrix, a rotation matrix and a translation matrix of the external parameter of the camera corresponding to the t-th frame;
(2) after each frame is subjected to image segmentation, optimizing the initial labels of each frame of image by using a multi-body plane fitting method to obtain optimized labels of all segmented blocks of each frame of image;
(3) using the optimized label obtained finally in step (2) to identify each pixel x on the t-th frametSelecting a set of visible frames from neighboring framesv(xt) And a set of invisible frames phio(xt) All pixels in the visible frame are transformed to the tth frame without matching xtAt least one pixel in the invisible frame is coincided with x when being transformed to the t frametOverlapping;
(4) performing energy minimization on each frame of the video by using an energy equation shown in the formula (5) by using an iterative method to obtain an iterated label of each frame of the video, further expanding the progression of depth in the iterated label by using a hierarchical confidence coefficient propagation algorithm,
Et(Lt)=Ed(Lt)+Es(Lt) (5)
wherein,
in formulae (5) to (11), Ed(Lt) And Es(Lt) Respectively representing a data item and a smooth item in an energy equation; i istRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frametA corresponding pixel; i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical ofgRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;is the pixel that reprojects pixel x 'onto the t-th frame according to D (l'); p is a radical ofvRepresenting a pixel xtAnd the geometric consistency and segmentation consistency of the pixels corresponding to the coordinates x'.
2. The method for multi-body depth restoration and segmentation of video according to claim 1, wherein the method of "optimizing the initial labels by using multi-body plane fitting" in step (2) is as follows:
after each frame is subjected to image segmentation, each segmentation block is successively endowed with an object label, the object labels endowed by the same segmentation block at each time are different from each other, and then the energy equation shown in the formula (1) is utilized to obtain the corresponding minimum energy value and the parameter of the plane where the segmentation block is located for each assignment result of each segmentation block; comparing the minimum value of the minimum energy values in each divided block with the minimum energy value corresponding to the initial label: if the minimum value in the minimum energy values in the segmentation blocks is smaller than the minimum energy value corresponding to the initial label, assigning the object label corresponding to the minimum value in the minimum energy values in the segmentation blocks as the segmentation label to the pixels in the segmentation blocks to obtain the optimized label of the segmentation blocks; otherwise, taking the initial label as the optimized label of the segmentation block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010106169405A CN102074020B (en) | 2010-12-31 | 2010-12-31 | Method for performing multi-body depth recovery and segmentation on video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010106169405A CN102074020B (en) | 2010-12-31 | 2010-12-31 | Method for performing multi-body depth recovery and segmentation on video |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102074020A CN102074020A (en) | 2011-05-25 |
CN102074020B true CN102074020B (en) | 2012-08-15 |
Family
ID=44032549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010106169405A Active CN102074020B (en) | 2010-12-31 | 2010-12-31 | Method for performing multi-body depth recovery and segmentation on video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102074020B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017108413A1 (en) * | 2015-12-21 | 2017-06-29 | Koninklijke Philips N.V. | Processing a depth map for an image |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013026826A (en) * | 2011-07-21 | 2013-02-04 | Sony Corp | Image processing method, image processing device and display device |
US8938114B2 (en) * | 2012-01-11 | 2015-01-20 | Sony Corporation | Imaging device and method for imaging hidden objects |
US9621869B2 (en) * | 2012-05-24 | 2017-04-11 | Sony Corporation | System and method for rendering affected pixels |
CN102903096B (en) * | 2012-07-04 | 2015-06-17 | 北京航空航天大学 | Monocular video based object depth extraction method |
CN103002309B (en) * | 2012-09-25 | 2014-12-24 | 浙江大学 | Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera |
CN103198486B (en) * | 2013-04-10 | 2015-09-09 | 浙江大学 | A kind of depth image enhancement method based on anisotropy parameter |
CN103500447B (en) * | 2013-09-18 | 2015-03-18 | 中国石油大学(华东) | Video foreground and background partition method based on incremental high-order Boolean energy minimization |
US20150381972A1 (en) * | 2014-06-30 | 2015-12-31 | Microsoft Corporation | Depth estimation using multi-view stereo and a calibrated projector |
CN104616286B (en) * | 2014-12-17 | 2017-10-31 | 浙江大学 | Quick semi-automatic multi views depth restorative procedure |
CN104574379B (en) * | 2014-12-24 | 2017-08-25 | 中国科学院自动化研究所 | A kind of methods of video segmentation learnt based on target multi-part |
CN106056622B (en) * | 2016-08-17 | 2018-11-06 | 大连理工大学 | A kind of multi-view depth video restored method based on Kinect cameras |
US11361508B2 (en) * | 2020-08-20 | 2022-06-14 | Qualcomm Incorporated | Object scanning using planar segmentation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101142593A (en) * | 2005-03-17 | 2008-03-12 | 英国电讯有限公司 | Method of tracking objects in a video sequence |
CN101271578A (en) * | 2008-04-10 | 2008-09-24 | 清华大学 | Depth sequence generation method of technology for converting plane video into stereo video |
CN101789124A (en) * | 2010-02-02 | 2010-07-28 | 浙江大学 | Segmentation method for space-time consistency of video sequence of parameter and depth information of known video camera |
-
2010
- 2010-12-31 CN CN2010106169405A patent/CN102074020B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101142593A (en) * | 2005-03-17 | 2008-03-12 | 英国电讯有限公司 | Method of tracking objects in a video sequence |
CN101271578A (en) * | 2008-04-10 | 2008-09-24 | 清华大学 | Depth sequence generation method of technology for converting plane video into stereo video |
CN101789124A (en) * | 2010-02-02 | 2010-07-28 | 浙江大学 | Segmentation method for space-time consistency of video sequence of parameter and depth information of known video camera |
Non-Patent Citations (2)
Title |
---|
Long Quan et al..Image-Based Modeling by Joint Segmentation.《International Journal of Computer Vision》.2007,第75卷(第1期), * |
Sing Bing Kang et al..Extracting View-Dependent Depth Maps from a Collection of Images.《International Journal of Computer Vision》.2004,第58卷(第2期), * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017108413A1 (en) * | 2015-12-21 | 2017-06-29 | Koninklijke Philips N.V. | Processing a depth map for an image |
Also Published As
Publication number | Publication date |
---|---|
CN102074020A (en) | 2011-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102074020B (en) | Method for performing multi-body depth recovery and segmentation on video | |
Roussos et al. | Dense multibody motion estimation and reconstruction from a handheld camera | |
EP2595116A1 (en) | Method for generating depth maps for converting moving 2d images to 3d | |
CN106910242A (en) | The method and system of indoor full scene three-dimensional reconstruction are carried out based on depth camera | |
CN111882668B (en) | Multi-view three-dimensional object reconstruction method and system | |
CN109242873A (en) | A method of 360 degree of real-time three-dimensionals are carried out to object based on consumer level color depth camera and are rebuild | |
Zhang et al. | Recovering consistent video depth maps via bundle optimization | |
Lee et al. | Silhouette segmentation in multiple views | |
US20090285544A1 (en) | Video Processing | |
CN103002309B (en) | Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera | |
Zhang et al. | Simultaneous multi-body stereo and segmentation | |
Bebeselea-Sterp et al. | A comparative study of stereovision algorithms | |
WO2018133119A1 (en) | Method and system for three-dimensional reconstruction of complete indoor scene based on depth camera | |
CN103049929A (en) | Multi-camera dynamic scene 3D (three-dimensional) rebuilding method based on joint optimization | |
Kahl et al. | Multiview reconstruction of space curves | |
Wang et al. | Vid2Curve: simultaneous camera motion estimation and thin structure reconstruction from an RGB video | |
Lee et al. | Automatic 2d-to-3d conversion using multi-scale deep neural network | |
Mahmoud et al. | Fast 3d structure from motion with missing points from registration of partial reconstructions | |
Kim et al. | Multi-view object extraction with fractional boundaries | |
Fan et al. | Collaborative three-dimensional completion of color and depth in a specified area with superpixels | |
Klose et al. | Reconstructing Shape and Motion from Asynchronous Cameras. | |
Engels et al. | Automatic occlusion removal from façades for 3D urban reconstruction | |
Guo et al. | Mesh-guided optimized retexturing for image and video | |
Ruhl et al. | Interactive scene flow editing for improved image-based rendering and virtual spacetime navigation | |
Gupta et al. | 3dfs: Deformable dense depth fusion and segmentation for object reconstruction from a handheld camera |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210709 Address after: Room 288-8, 857 Shixin North Road, ningwei street, Xiaoshan District, Hangzhou City, Zhejiang Province Patentee after: ZHEJIANG SHANGTANG TECHNOLOGY DEVELOPMENT Co.,Ltd. Address before: 310027 No. 38, Zhejiang Road, Hangzhou, Zhejiang, Xihu District Patentee before: ZHEJIANG University |
|
TR01 | Transfer of patent right |