CN102074020B

CN102074020B - Method for performing multi-body depth recovery and segmentation on video

Info

Publication number: CN102074020B
Application number: CN2010106169405A
Authority: CN
Inventors: 鲍虎军; 章国锋
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang Shangtang Technology Development Co Ltd
Priority date: 2010-12-31
Filing date: 2010-12-31
Publication date: 2012-08-15
Anticipated expiration: 2030-12-31
Also published as: CN102074020A

Abstract

The invention discloses a method for performing multi-body depth recovery and segmentation on a video, which comprises the following steps of: (1) performing energy minimization on the video by an iteration method to obtain an initial label of each frame of the video, wherein the initial label consists of the depth of pixels and segmentation information; (2) after performing image segmentation on each frame, optimizing the initial label of each frame of image by a multi-body plane fitting method to obtain the optimized labels of all segmentation blocks in each frame of image; (3) selecting a group of visible frames and a group of invisible frames from adjacent frames for each pixel on each frame by using the optimized labels; and (4) performing the energy minimization on each frame of the video by the iteration method to obtain the iterated labels of each frame of the video, and expanding the progression of the depth of the iterated labels further by using a hierarchical confidence propagation algorithm. By the method, the depth recovery and segmentation can be performed on videos in which multiple rigid objects move.

Description

Method for multi-body depth recovery and segmentation of video

Technical Field

The invention relates to a depth recovery and segmentation method, which is used for performing depth recovery and segmentation on a video with a plurality of rigid object motions.

Background

Depth-based three-dimensional restoration and image (or video) segmentation have long been fundamental problems in computer vision, because the computed depth image and the segmented image can be used separately or together in many important applications, such as object recognition, image-based rendering, and image (or video) editing. However, the research on these two types of problems is often independent, and until recently, some have begun to research them together. Such as L.Quan, J.Wang, P.Tan, and L.Yuan.image-based modeling by joint segmentation. International Journal of Computer Vision (IJCV' 07).

Multi-View Stereo recovery (depth recovery) techniques (MVS) can be used to compute depth and three-dimensional geometric information from a set of images. Based on the importance of three-dimensional reconstruction, research has been conducted on reconstructing a dynamic scene in which a moving object exists. The three-dimensional motion segmentation is to distinguish the characteristic tracks of a plurality of moving objects so as to recover the actual positions of the moving objects and the corresponding motion information of a camera. For simplicity, most of these methods use affine camera models (e.g., j.p. costeira and t.kanade.a multi-body factorization method for motion analysis. ieee International Conference on Computer Vision (ICCV '95), and a few methods have been proposed to deal with the three-dimensional segmentation problem under the perspective camera model, such as k.schindler, j.u., and h.wang. perspective-view multi-structure-and-motion-region model selection (ECCV' 06). However, none of these methods can be used directly on top of a high quality three-dimensional reconstruction, especially if image segmentation is required.

If the rigid objects that move are individually concealed, the MVS may be applied to each object independently. Classical image segmentation methods such as mean shift, normalized cuts and weighted aggregation (swa) simply process two-dimensional images without considering the overall geometric information in the MVS.

In order to extract a moving object in the foreground and some possible visible boundaries of the object, some two-layer segmentation methods are proposed, such as a. criminisi, g.cross, a.blake, and v.kolmogorov.bilayeration of live video.cvpr' 06. These methods assume that the camera is stationary and that the background color is also easily estimated or modeled. It is noted, however, that these methods are also not applicable to MVS, since the camera needs to be moved in MVS.

Recently, Chao Guassian et al used motion information and depth information in G.Zhang, J.Jia, W.Hua, and H.Bao.Robust biolayering and motion/depth estimation with a hand-suspended camera. IEEE transactions on Pattern Analysis and Machine Analysis (PAMI' 2010) to simulate the background environment and extract high quality foreground layers. Iterative optimization can be continuously performed on the calculated depth motion field and the double-layer segmentation result. However, this method is limited to bi-layer segmentation. In addition, only the motion information of the foreground layer is calculated and the depth information is not calculated, which is not enough for three-dimensional reconstruction.

In two-dimensional motion segmentation, pixels with the same motion trend are roughly divided into a group and finally divided into a plurality of different layers. This method relies heavily on the accuracy of motion estimation and it is difficult to obtain high quality segmentation results, especially when severe occlusion occurs.

In addition, two-dimensional motion segmentation also requires motion and segmentation calculations, both of which are a problem with "chicken and egg", i.e., inaccuracies in motion estimation can cause inaccuracies in segmentation, which in turn can cause inaccuracies in motion calculation. The optimization of both is then often concluded with a local optimum.

Disclosure of Invention

The invention aims to provide a method for performing depth recovery and segmentation on multiple bodies of videos, which can perform depth recovery and segmentation on videos with a plurality of rigid object motions.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the method for recovering and segmenting the depth of multiple bodies of the video comprises the following steps:

(1) performing energy minimization on the video by using an energy equation of an equation (1) through an iterative method to obtain an initial label of each frame of the video, wherein the initial label consists of depth information and segmentation information of pixels,

<math> <mrow> <msup> <mi>E</mi> <mo>′</mo> </msup> <mrow> <mo>(</mo> <mi>L</mi> <mo>;</mo> <mover> <mi>I</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mo>+</mo> <msub> <mi>λ</mi> <mi>s</mi> </msub> <munder> <mi>Σ</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>ρ</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>.</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

<math> <mrow> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>|</mo> <mi>φ</mi> </mrow> <mo>′</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msup> <mi>φ</mi> <mo>′</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

in the formulae (1), (2) and (3), I_tRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number of_tIs represented by_tOne pixel above; l is_t(x_t) Denotes x_tThe reference number of (a); n (x)_t) Representing a pixel x_tAll neighboring pixels of (a); ρ (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)_t) Representation objectElement x_tA visible frame, and the pair of pixels x_tIn visible frame and x_tRe-projection and x of corresponding pixel in t frame_tOverlapping; p is a radical of_cRepresenting a pixel x_tAnd color similarity of x'; l represents x_tThe reference number of (a); sigma_cA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frame_tCorresponding pixel, and the t 'th frame is of phi' (x)_t) A frame of (2); i is_t(x) Representing a pixel x_tA color value of (a); i is_t′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) '^hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:

in the formula (4), h represents a homogeneous coordinate; d (l) represents a pixel x_tDepth information in the label of (1); k_t′、R_t′And T_t′Respectively corresponding to an internal parameter matrix, a rotation matrix of external parameters and a translation matrix of the external parameters of the camera corresponding to the t' th frame; k_t、R_tAnd T_tRespectively obtaining an internal parameter matrix, a rotation matrix and a translation matrix of the external parameter of the camera corresponding to the t-th frame;

(2) after each frame is subjected to image segmentation, optimizing the initial labels of each frame of image by using a multi-body plane fitting method to obtain optimized labels of all segmented blocks of each frame of image;

(3) using the optimized label obtained finally in step (2) to identify each pixel x on the t-th frame_tSelecting a set of visible frames from neighboring frames_v(x_t) And a set of invisible frames phi_o(x_t) All pixels in the visible frame are transformed to the tth frame without matching x_tAt least one pixel in the invisible frame is coincided with x when being transformed to the t frame_tOverlapping;

(4) performing energy minimization on each frame of the video by using an energy equation shown in the formula (5) by using an iterative method to obtain an iterated label of each frame of the video, further expanding the progression of depth in the iterated label by using a hierarchical confidence coefficient propagation algorithm,

E_t(L_t)＝E_d(L_t)+E_s(L_t) (5)

wherein,

<math> <mrow> <msub> <mi>E</mi> <mi>s</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>λ</mi> <mi>s</mi> </msub> <munder> <mi>Σ</mi> <msub> <mi>x</mi> <mi>t</mi> </msub> </munder> <munder> <mi>Σ</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>ρ</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mi>E</mi> <mi>d</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>φ</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>φ</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mrow> <mo>(</mo> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msub> <mi>φ</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <msub> <mrow> <mo>,</mo> <mi>L</mi> </mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msub> <mi>φ</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>·</mo> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <msubsup> <mi>x</mi> <mi>t</mi> <msup> <mi>t</mi> <mrow> <mo>′</mo> <mo>&RightArrow;</mo> <mi>t</mi> </mrow> </msup> </msubsup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>σ</mi> <mi>d</mi> <mn>2</mn> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow> </math>

in formulae (5) to (11), E_d(L_t) And E_s(L_t) Respectively representing a data item and a smooth item in an energy equation; i is_tRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number of_tIs represented by_tOne pixel above; l is_t(x_t) Denotes x_tThe reference number of (a); n (x)_t) Representing a pixel x_tAll neighbors of (2)A near pixel; ρ (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frame_tA corresponding pixel; i is_t(x) Representing a pixel x_tA color value of (a); i is_t′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) '^hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical of_cRepresenting a pixel x_tAnd color similarity of x'; l represents x_tThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical of_gRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;

is the pixel that reprojects pixel x 'onto the t-th frame according to D (l'); p is a radical of_vRepresenting a pixel x_tAnd the geometric consistency and segmentation consistency of the pixels corresponding to the coordinates x'.

Further, the method of "optimizing the initial label by using a multi-body plane fitting method" in step (2) of the present invention is as follows:

after each frame is subjected to image segmentation, each segmentation block is successively endowed with an object label, the object labels endowed by the same segmentation block at each time are different from each other, and then the energy equation shown in the formula (1) is utilized to obtain the corresponding minimum energy value and the parameter of the plane where the segmentation block is located for each assignment result of each segmentation block; comparing the minimum value of the minimum energy values in each divided block with the minimum energy value corresponding to the initial label: if the minimum value in the minimum energy values in the segmentation blocks is smaller than the minimum energy value corresponding to the initial label, assigning the object label corresponding to the minimum value in the minimum energy values in the segmentation blocks as the segmentation label to the pixels in the segmentation blocks to obtain the optimized label of the segmentation blocks; otherwise, taking the initial label as the optimized label of the segmentation block.

Compared with the prior art, the invention has the beneficial effects that:

(1) a brand-new multi-body stereo vision model is provided, the depth and the segmentation labels are uniformly represented by one label, and the problem of global optimization labeling is solved, so that the multi-view stereo matching method is expanded to a scene with a plurality of rigid objects moving independently for the first time, and the global optimization method is used for carrying out optimization solution;

(2) a strategy for adaptive selection of a matched frame is provided, the restored depth and segmentation information of an adjacent frame are projected to a current frame, the visibility of pixels is judged, missing pixels are filled, and a label priori constraint is obtained to process the shielding problem;

(3) a brand-new multi-body plane fitting method is provided, and the problems of depth of a characteristic-free area and difficulty in calculation of segmentation are effectively solved.

Drawings

FIG. 1 is a basic flow diagram of the present invention;

FIG. 2 is one of the principles of the present invention, depicting projection and re-projection in a multi-view geometry;

FIG. 3 is a diagram illustrating the method for filling missing pixels in labeled images according to the present invention: (a) is frame 61; (b) is frame 76; (c) is the depth information L of all pixels after pre-labeling and re-projecting_76，61Red is the blocking pixel; (d) is a process example of the label graph missing pixel padding method;

FIG. 4 is a comparison of the effect of adaptive selection of matched frames and the label priors constraint with and without the present invention: (a) is the first frame of the experimental data sequence; graphs (b) and (c) are the two labeled graph results (labeled values are in grayscale) of this frame processed with and without the pre-labeled adaptive frame selection method, respectively; FIGS. (d) and (e) are enlarged views of rectangular regions in FIGS. (b) and (c), respectively, for easy observation;

FIG. 5 is a pipeline diagram of one example of the invention: (a) is a frame in a sequence of pictures; (b) is a label graph obtained before multi-body plane fitting after initial solution; (c) is a label diagram after the plane fitting of the multi-body plane; (d) is a label diagram after two times of optimization iteration; (e) is a segmentation map of a person and a background; (f) is the result of the three-dimensional reconstruction patch without depth level expansion; (g) the result of the three-dimensional reconstruction patch after the depth level expansion is obtained;

FIG. 6 is two trisomy examples of the invention: (a) and (d) two selected frames, respectively; (b) is a numbered scheme of (a); (c) is a segmentation map of (a); (e) is a labeled graph of (d), and (f) is a divided graph of (d);

FIG. 7 is a reconstructed patch result of one example of the invention: (a) is the geometric information of the background; (b) and (c) the reconstructed patch results of two persons in the frame, respectively;

FIG. 8 is an example result of a box sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention;

FIG. 9 is an example result of a toy sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention; (c) is the segmentation result graph of the frame calculated by the invention.

Detailed Description

The invention provides a stable and efficient multi-body depth recovery and segmentation method, and a basic flow chart of the invention is shown in figure 1, which mainly comprises the following steps:

step 1) an initialization processing stage, wherein energy minimization is performed on a video by an iterative method according to an energy equation provided by the invention to obtain an initial label of each frame of the video, wherein the initial label consists of a depth label and a segmentation label of a pixel, and the initialization processing stage specifically comprises the following steps:

the object of the invention is to find both the depth values of all pixels and to which object the pixel belongs, so that there are two labels, one depth label and one segmentation label. The invention uniformly represents the two label values of the pixel by an expanded label:

L = {d_{1}^{1}, d_{2}^{1}, . . ., d_{m}^{_{1}}, . . ., d_{1}^{K}, d_{2}^{K}, . . ., d_{m_{K}}^{K}}

it is assumed here that there are K objects (including the background) in the scene. The pixel disparity values (actually the inverse of the depth, referred to herein as the depth index) of the kth object on the image captured by the camera range from

In between, this interval is divided evenly so that each disparity value is defined as follows:

this is the meaning of all elements in the L-set. Therein are provided with

The object with the symbol l is represented by S (l), i.e. the segmentation symbol. D (l) represents the disparity value of the pixel labeled l, i.e., the depth label. Any element L of the L set_iS (l) and D (l) are readily available, and index term h is first found so that the following inequality is satisfied:

then s (l) ═ h,

the following energy equations were optimized:

<math> <mrow> <msup> <mi>E</mi> <mo>′</mo> </msup> <mrow> <mo>(</mo> <mi>L</mi> <mo>;</mo> <mover> <mi>I</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mo>+</mo> <msub> <mi>λ</mi> <mi>s</mi> </msub> <munder> <mi>Σ</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>ρ</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

wherein P is_initIs defined as follows:

<math> <mrow> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>|</mo> <mi>φ</mi> </mrow> <mo>′</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msup> <mi>φ</mi> <mo>′</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> </mrow> </math>

where x is_tIs the t-th frame image I_tOne pixel above, t is 1 … n, and n is the total frame number of the video; n (x)_t) Is x_tAll neighboring pixels of (g), p (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) And | η }, which represents the difference in labels between neighboring pixels. η is a truncation parameter so that discontinuity of the object boundary is not lost when the energy function is minimized because the energy value of the smoothing term is too large, i.e. is prevented from being excessively smoothed.

Here phi' (x)_t) Refers to selected frames within which x is_tThis pixel is visible. These frames are selected here by the method disclosed in the document S.B.Kang and R.Szeliski.extraction view-dependent depth maps from acquisition of images.International Journal of Computer Vision (IJCV' 2004).

P in this case_cIs defined as such：

Wherein sigma_cIs a parameter controlling the shape of the difference function of the above formula, I_t′(x') is a pixel x_tAt phi' (x)_t) Color value of corresponding pixel x ', homogeneous coordinate x ' of x '^hThe following can be obtained:

from above x'^hAnd converting the homogeneous coordinate into a two-dimensional coordinate to obtain the coordinate of x'. The above D (l) is the pixel x_tIn the reference symbol l, K_t′、R_t′And T_t′The method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t' th frame. K_t、R_tAnd T_tThe method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t-th frame. It is noted that camera internal and external references for all frames are known before all steps of the invention are performed.

Step 2) after each frame is subjected to image segmentation, optimizing the initial label of each frame of image by using a multi-body plane fitting method to obtain the optimized labels of all segmented blocks of each frame of image, wherein the specific steps are as follows:

first, mean shift is used to calculate I_tIs divided, i.e.

Assume that there are K objects (including background) in the environment. For each segment s as shown above_iAssuming that it belongs to the object k and is a plane, there are three parameters

Then optimizing energy equation (1) can obtain the values of the three parameters and a minimum energyEnumerating this partition s_iThe belonged object K can obtain K minimum energy namely E'⁰，E′¹，…，E′^(k-1)Taking the smallest of them (not provided as E'^j) Corresponding object j and corresponding three parametersMinimum energy as the partition, its corresponding object, and its three plane parameters [ a ]_i，b_i，c_i]。

The initial per-pixel label was previously calculated in step 1) of the initialization phase, then s is assigned_iAll pixels in the pixel array are substituted into energy equation (1) according to the initial label to obtain an energy E'^tIf E'^j＜E′^tThen all s are updated_iPixel x in (2)_tThe division number of (a) is that the object is j, that is, S (x)_t) J, the plane parameter is [ a ]_i，b_i，c_i]I.e. by

The effect pairs before and after optimization using the multi-volume plane fitting algorithm are shown in fig. 5.

Step 3) adaptive selection of matched frames and label prior constraint. I.e. for each pixel x on the t-th frame, according to the label obtained during the initialization phase or the last iteration optimization_tTwo sets of frames are selected from adjacent frames, one set of frames consisting of pairs of pixels x_tVisible frame composition, denoted as phi_v(x_t) Another set of frames consisting of those pairs of pixels x_tInvisible frame composition, denoted phi_o(x_t) The method specifically comprises the following steps:

1) after obtaining the optimized label in step 2), transforming the label map of the t 'frame to the t frame to obtain L by the method of W.R. Mark, L.McMillan, and G.Bishop.post-rendering 3D forwarding, (SI 3D' 1997)_t′，t. If by such a re-projection, none of the pixels on the label map of the t' th frame are projected onto x_tAs shown in FIG. 2, then let t' th frame belong to φ_o(x_t) Otherwise, it belongs to phi_v(x_t)。

2) The matching calculation need not be performed for all frames in practice, but only a maximum of N needs to be selected₁The frame is matched (here N)₁Generally 16 to 20). If found | φ_v(x_t) | is less than a lower limit N₂(generally 5) then find some neighboring pixel-free reprojection to x_tFrame phi of_o(x_t) So that

|φ_v(x_t)|+|φ_o(x_t)|＝N₂。

3) Note that the occluded pixel is not computationally expensive to match, so if a pixel is not visible in all neighboring frames, its depth value is not directly available. But in this case the depth value of the pixel and the object to which it belongs can still be approximated, which is why_o(x_t) The reason for this is. The method comprises the following steps:

for the label map projected from the adjacent frame, as shown in FIG. 3, for each missing pixel x_tSearching in horizontal and vertical directions respectively to find two nearest effective projection pixels, and selecting the one with the smallest label among the four pixels, and recording the one as x^*Its reference numeral is x_tReference numerals of (a). Can use x^*At L_t′，tIn which the reference number replaces x_tAt L_t′，tReference symbol in (1) is L_t′，t(x)＝L_t′，t(x^*) Is determined by the distance between the two pixels, this confidence level is defined as follows:

wherein the constant σ_ωSet to 10.

Although this missing label inference method is not very accurate, it can be improved for data items that have been computed where occlusion is important. The label a priori constraints are defined as follows:

wherein λ_oIs a weight and β is used to control the shape of the difference function of the above equation. The above formula requires when ω_o(x) At very high time, L_t(x_t) Need to follow L_t′，t(x_t) Close.

FIG. 4 shows the results of label calculations using adaptive selection of matching frames and label priors, and comparisons with labels calculated without the use. It can be seen that the label map computed using the adaptive selection of matched frames and the label a priori constraints is significantly improved at the discontinuity boundaries.

Step 4) an iterative optimization stage, which is to perform iterative optimization on all frames of the monocular video sequence according to the energy equation provided by the invention to obtain label maps of all the frames, and then improve the fineness of depth recovery by using a hierarchical confidence coefficient propagation algorithm, and specifically comprises the following steps:

1) according to the energy equation

E_t(L_t)＝E_d(L_t)+E_s(L_t)

And optimizing to obtain the minimum energy value, namely obtaining the depth values of all the pixels and the segmentation blocks. The energy equation is optimized in two passes by using a confidence coefficient propagation algorithm to obtain a relatively accurate result.

Here, the

<math> <mrow> <msub> <mi>E</mi> <mi>s</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>λ</mi> <mi>s</mi> </msub> <munder> <mi>Σ</mi> <msub> <mi>x</mi> <mi>t</mi> </msub> </munder> <munder> <mi>Σ</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>ρ</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>

A smoothing term is shown to measure the index difference between adjacent pixels in the image so that the index difference between adjacent pixels is as small as possible. ρ is defined as ρ (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) And | η }, representing the labeled difference between neighboring pixels. η is a cutoff value so that when the energy function is minimized, discontinuities in the object boundary are not lost, i.e., are not overly smoothed, due to the energy value of the smoothing term being too large.

Data item E_d(L_t) The definition is as follows:

<math> <mrow> <msub> <mi>E</mi> <mi>d</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>

here, the

<math> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>φ</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>φ</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mrow> <mo>(</mo> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msub> <mi>φ</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <msub> <mrow> <mo>,</mo> <mi>L</mi> </mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>Σ</mi> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&Element;</mo> <msub> <mi>φ</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>·</mo> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>

Wherein

Described is a pixel x_tAnd color similarity between x'; x is the number of_tAnd x 'are pixels on the t-th and t' -th frames, respectively; p is a radical of_vDescribed is a pixel x_tAnd x' and the consistency of the segmentation, i.e. x_tAnd x' is on the same object, and the depths are consistent. The specific definition is as follows:

<math> <mrow> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>

wherein l 'is the index for x'. If l and l 'are different objects, i.e. S (l) ≠ S (l'), then the two pixels are not the corresponding pixels in the two frames and need to be separated. Otherwise, with p_gThe geometric consistency between two pixels is measured, and is defined as follows:

<math> <mrow> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <msubsup> <mi>x</mi> <mi>t</mi> <msup> <mi>t</mi> <mrow> <mo>′</mo> <mo>&RightArrow;</mo> <mi>t</mi> </mrow> </msup> </msubsup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>σ</mi> <mi>d</mi> <mn>2</mn> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> </mrow> </math>

here, the

Is to project the x ' on the t ' frame to the corresponding point on the t-th frame according to the D (l ') obtained by calculation.

2) In the above processing procedure, the present invention adopts a Belief Propagation (BP) algorithm to optimize the objective function. Since the number of labels is proportional to the memory requirement. This can easily result in memory requirements that exceed the memory space of the machine being operated for processing large resolution images. The adoption of a hierarchical solving strategy can reduce the memory usage amount, but can reduce the quality of object segmentation and depth recovery to a certain extent, particularly in a discontinuous boundary region. In order to obtain high-quality object segmentation and depth recovery results as far as possible, the invention adopts a simple solving strategy based on region segmentation to overcome the bottleneck of the memory. The process is simple, namely, the image is uniformly cut into M multiplied by M areas, and energy optimization is carried out on each area. If a color partition spans multiple regions, a corresponding split is required. The strategy is simple and effective, can effectively overcome the bottleneck of the memory, and has little influence on the processing result.

To this end, all steps of the inventive multi-volume depth recovery and segmentation are completed.

At initialization and two iterative optimizations, the number of layers m per object depth is usually_kIs set between 51 and 101. After two iterative optimizations, the segmentation results are usually already very accurate. Then, in order to further improve the accuracy of the depth, the present invention may fix the division index, at which time the index is actually equivalent to the depth order. The invention adopts a hierarchical confidence propagation algorithm from coarse to fine, can effectively expand the depth series in the global optimization without increasing a lot of calculation cost, thereby improving the precision of depth recovery.

In the experiment, the image resolution size of the video sequence was 960 × 540. The majority of the parameters in the system may be default values and need not be adjusted during processing, e.g. by taking lambda_s＝5/|L|，η＝0.03|L|，λ_o＝0.3，σ_c＝10，σ_dWhen β is 2, 0.02| L |, the results shown in fig. 8 and 9 are obtained.

One set of experiments of the present invention is the experiment of a set of box sequences as shown in fig. 8, where fig. 8(a) is a frame in a video and fig. 8(b) is a label graph of the frame obtained using the present invention; another set of experiments of the present invention is experiments of a set of toy sequences as shown in fig. 9, fig. 9(a) is a frame in a video, fig. 9(b) is a numbered view of the frame using the present invention, and fig. 9(c) is a segmented view of the frame using the present invention. From the comparison of the label diagrams of fig. 8 and 9 with the original diagram, it can be seen that the whole label diagram is well-defined and the restoration is accurate at the boundary; from the comparison between the segmentation map of fig. 9 and the original map, it can be seen that the segmentation result is very accurate, and different objects are accurately segmented, thereby illustrating that the result obtained by the algorithm provided by the present invention is very accurate under the multi-body condition.

Claims

1. A method for multi-volume depth recovery and segmentation of video, characterized by comprising the steps of:

wherein,

in the formulae (1), (2) and (3), I_tRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number of_tIs represented by_tOne pixel above; l is_t(x_t) Denotes x_tThe reference number of (a); n (x)_t) Representing a pixel x_tAll neighboring pixels of (a); ρ (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)_t) Represents a pair of pixels x_tA visible frame, and the pair of pixels x_tIn visible frame and x_tRe-projection and x of corresponding pixel in t frame_tOverlapping; p is a radical of_cRepresenting a pixel x_tAnd color similarity of x'; l represents x_tThe reference number of (a); sigma_cA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frame_tCorresponding pixel, and the t 'th frame is of phi' (x)_t) A frame of (2); i is_t(x) Representing a pixel x_tA color value of (a); i is_t′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) '^hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:

E_t(L_t)＝E_d(L_t)+E_s(L_t) (5)

wherein,

in formulae (5) to (11), E_d(L_t) And E_s(L_t) Respectively representing a data item and a smooth item in an energy equation; i is_tRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number of_tIs represented by_tOne pixel above; l is_t(x_t) Denotes x_tThe reference number of (a); n (x)_t) Representing a pixel x_tAll neighboring pixels of (a); ρ (L)_t(x_t)，L_t(y_t))＝min{|L_t(x_t)-L_t(y_t) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frame_tA corresponding pixel; i is_t(x) Representing a pixel x_tA color value of (a); i is_t′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) '_hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical of_cRepresenting a pixel x_tAnd color similarity of x'; l represents x_tThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical of_gRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;

2. The method for multi-body depth restoration and segmentation of video according to claim 1, wherein the method of "optimizing the initial labels by using multi-body plane fitting" in step (2) is as follows: