CN102074020B - Method for performing multi-body depth recovery and segmentation on video - Google Patents

Method for performing multi-body depth recovery and segmentation on video Download PDF

Info

Publication number
CN102074020B
CN102074020B CN2010106169405A CN201010616940A CN102074020B CN 102074020 B CN102074020 B CN 102074020B CN 2010106169405 A CN2010106169405 A CN 2010106169405A CN 201010616940 A CN201010616940 A CN 201010616940A CN 102074020 B CN102074020 B CN 102074020B
Authority
CN
China
Prior art keywords
msub
mrow
msup
frame
prime
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010106169405A
Other languages
Chinese (zh)
Other versions
CN102074020A (en
Inventor
鲍虎军
章国锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Shangtang Technology Development Co Ltd
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2010106169405A priority Critical patent/CN102074020B/en
Publication of CN102074020A publication Critical patent/CN102074020A/en
Application granted granted Critical
Publication of CN102074020B publication Critical patent/CN102074020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method for performing multi-body depth recovery and segmentation on a video, which comprises the following steps of: (1) performing energy minimization on the video by an iteration method to obtain an initial label of each frame of the video, wherein the initial label consists of the depth of pixels and segmentation information; (2) after performing image segmentation on each frame, optimizing the initial label of each frame of image by a multi-body plane fitting method to obtain the optimized labels of all segmentation blocks in each frame of image; (3) selecting a group of visible frames and a group of invisible frames from adjacent frames for each pixel on each frame by using the optimized labels; and (4) performing the energy minimization on each frame of the video by the iteration method to obtain the iterated labels of each frame of the video, and expanding the progression of the depth of the iterated labels further by using a hierarchical confidence propagation algorithm. By the method, the depth recovery and segmentation can be performed on videos in which multiple rigid objects move.

Description

Method for multi-body depth recovery and segmentation of video
Technical Field
The invention relates to a depth recovery and segmentation method, which is used for performing depth recovery and segmentation on a video with a plurality of rigid object motions.
Background
Depth-based three-dimensional restoration and image (or video) segmentation have long been fundamental problems in computer vision, because the computed depth image and the segmented image can be used separately or together in many important applications, such as object recognition, image-based rendering, and image (or video) editing. However, the research on these two types of problems is often independent, and until recently, some have begun to research them together. Such as L.Quan, J.Wang, P.Tan, and L.Yuan.image-based modeling by joint segmentation. International Journal of Computer Vision (IJCV' 07).
Multi-View Stereo recovery (depth recovery) techniques (MVS) can be used to compute depth and three-dimensional geometric information from a set of images. Based on the importance of three-dimensional reconstruction, research has been conducted on reconstructing a dynamic scene in which a moving object exists. The three-dimensional motion segmentation is to distinguish the characteristic tracks of a plurality of moving objects so as to recover the actual positions of the moving objects and the corresponding motion information of a camera. For simplicity, most of these methods use affine camera models (e.g., j.p. costeira and t.kanade.a multi-body factorization method for motion analysis. ieee International Conference on Computer Vision (ICCV '95), and a few methods have been proposed to deal with the three-dimensional segmentation problem under the perspective camera model, such as k.schindler, j.u., and h.wang. perspective-view multi-structure-and-motion-region model selection (ECCV' 06). However, none of these methods can be used directly on top of a high quality three-dimensional reconstruction, especially if image segmentation is required.
If the rigid objects that move are individually concealed, the MVS may be applied to each object independently. Classical image segmentation methods such as mean shift, normalized cuts and weighted aggregation (swa) simply process two-dimensional images without considering the overall geometric information in the MVS.
In order to extract a moving object in the foreground and some possible visible boundaries of the object, some two-layer segmentation methods are proposed, such as a. criminisi, g.cross, a.blake, and v.kolmogorov.bilayeration of live video.cvpr' 06. These methods assume that the camera is stationary and that the background color is also easily estimated or modeled. It is noted, however, that these methods are also not applicable to MVS, since the camera needs to be moved in MVS.
Recently, Chao Guassian et al used motion information and depth information in G.Zhang, J.Jia, W.Hua, and H.Bao.Robust biolayering and motion/depth estimation with a hand-suspended camera. IEEE transactions on Pattern Analysis and Machine Analysis (PAMI' 2010) to simulate the background environment and extract high quality foreground layers. Iterative optimization can be continuously performed on the calculated depth motion field and the double-layer segmentation result. However, this method is limited to bi-layer segmentation. In addition, only the motion information of the foreground layer is calculated and the depth information is not calculated, which is not enough for three-dimensional reconstruction.
In two-dimensional motion segmentation, pixels with the same motion trend are roughly divided into a group and finally divided into a plurality of different layers. This method relies heavily on the accuracy of motion estimation and it is difficult to obtain high quality segmentation results, especially when severe occlusion occurs.
In addition, two-dimensional motion segmentation also requires motion and segmentation calculations, both of which are a problem with "chicken and egg", i.e., inaccuracies in motion estimation can cause inaccuracies in segmentation, which in turn can cause inaccuracies in motion calculation. The optimization of both is then often concluded with a local optimum.
Disclosure of Invention
The invention aims to provide a method for performing depth recovery and segmentation on multiple bodies of videos, which can perform depth recovery and segmentation on videos with a plurality of rigid object motions.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the method for recovering and segmenting the depth of multiple bodies of the video comprises the following steps:
(1) performing energy minimization on the video by using an energy equation of an equation (1) through an iterative method to obtain an initial label of each frame of the video, wherein the initial label consists of depth information and segmentation information of pixels,
<math> <mrow> <msup> <mi>E</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <mi>L</mi> <mo>;</mo> <mover> <mi>I</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>.</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein,
<math> <mrow> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>|</mo> <mi>&phi;</mi> </mrow> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msup> <mi>&phi;</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formulae (1), (2) and (3), ItRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)t) Representation objectElement xtA visible frame, and the pair of pixels xtIn visible frame and xtRe-projection and x of corresponding pixel in t frametOverlapping; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); sigmacA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frametCorresponding pixel, and the t 'th frame is of phi' (x)t) A frame of (2); i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:
<math> <mrow> <msup> <mi>x</mi> <mrow> <mo>&prime;</mo> <mi>h</mi> </mrow> </msup> <mo>~</mo> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <msub> <mi>R</mi> <mi>t</mi> </msub> <msubsup> <mi>K</mi> <mi>t</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msubsup> <mi>x</mi> <mi>t</mi> <mi>h</mi> </msubsup> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>T</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (4), h represents a homogeneous coordinate; d (l) represents a pixel xtDepth information in the label of (1); kt′、Rt′And Tt′Respectively corresponding to an internal parameter matrix, a rotation matrix of external parameters and a translation matrix of the external parameters of the camera corresponding to the t' th frame; kt、RtAnd TtRespectively obtaining an internal parameter matrix, a rotation matrix and a translation matrix of the external parameter of the camera corresponding to the t-th frame;
(2) after each frame is subjected to image segmentation, optimizing the initial labels of each frame of image by using a multi-body plane fitting method to obtain optimized labels of all segmented blocks of each frame of image;
(3) using the optimized label obtained finally in step (2) to identify each pixel x on the t-th frametSelecting a set of visible frames from neighboring framesv(xt) And a set of invisible frames phio(xt) All pixels in the visible frame are transformed to the tth frame without matching xtAt least one pixel in the invisible frame is coincided with x when being transformed to the t frametOverlapping;
(4) performing energy minimization on each frame of the video by using an energy equation shown in the formula (5) by using an iterative method to obtain an iterated label of each frame of the video, further expanding the progression of depth in the iterated label by using a hierarchical confidence coefficient propagation algorithm,
Et(Lt)=Ed(Lt)+Es(Lt) (5)
wherein,
<math> <mrow> <msub> <mi>E</mi> <mi>s</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <msub> <mi>x</mi> <mi>t</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>d</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mrow> <mo>(</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <msub> <mrow> <mo>,</mo> <mi>L</mi> </mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <msubsup> <mi>x</mi> <mi>t</mi> <msup> <mi>t</mi> <mrow> <mo>&prime;</mo> <mo>&RightArrow;</mo> <mi>t</mi> </mrow> </msup> </msubsup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>&sigma;</mi> <mi>d</mi> <mn>2</mn> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow> </math>
in formulae (5) to (11), Ed(Lt) And Es(Lt) Respectively representing a data item and a smooth item in an energy equation; i istRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighbors of (2)A near pixel; ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frametA corresponding pixel; i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical ofgRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;
Figure GDA0000151171500000047
is the pixel that reprojects pixel x 'onto the t-th frame according to D (l'); p is a radical ofvRepresenting a pixel xtAnd the geometric consistency and segmentation consistency of the pixels corresponding to the coordinates x'.
Further, the method of "optimizing the initial label by using a multi-body plane fitting method" in step (2) of the present invention is as follows:
after each frame is subjected to image segmentation, each segmentation block is successively endowed with an object label, the object labels endowed by the same segmentation block at each time are different from each other, and then the energy equation shown in the formula (1) is utilized to obtain the corresponding minimum energy value and the parameter of the plane where the segmentation block is located for each assignment result of each segmentation block; comparing the minimum value of the minimum energy values in each divided block with the minimum energy value corresponding to the initial label: if the minimum value in the minimum energy values in the segmentation blocks is smaller than the minimum energy value corresponding to the initial label, assigning the object label corresponding to the minimum value in the minimum energy values in the segmentation blocks as the segmentation label to the pixels in the segmentation blocks to obtain the optimized label of the segmentation blocks; otherwise, taking the initial label as the optimized label of the segmentation block.
Compared with the prior art, the invention has the beneficial effects that:
(1) a brand-new multi-body stereo vision model is provided, the depth and the segmentation labels are uniformly represented by one label, and the problem of global optimization labeling is solved, so that the multi-view stereo matching method is expanded to a scene with a plurality of rigid objects moving independently for the first time, and the global optimization method is used for carrying out optimization solution;
(2) a strategy for adaptive selection of a matched frame is provided, the restored depth and segmentation information of an adjacent frame are projected to a current frame, the visibility of pixels is judged, missing pixels are filled, and a label priori constraint is obtained to process the shielding problem;
(3) a brand-new multi-body plane fitting method is provided, and the problems of depth of a characteristic-free area and difficulty in calculation of segmentation are effectively solved.
Drawings
FIG. 1 is a basic flow diagram of the present invention;
FIG. 2 is one of the principles of the present invention, depicting projection and re-projection in a multi-view geometry;
FIG. 3 is a diagram illustrating the method for filling missing pixels in labeled images according to the present invention: (a) is frame 61; (b) is frame 76; (c) is the depth information L of all pixels after pre-labeling and re-projecting76,61Red is the blocking pixel; (d) is a process example of the label graph missing pixel padding method;
FIG. 4 is a comparison of the effect of adaptive selection of matched frames and the label priors constraint with and without the present invention: (a) is the first frame of the experimental data sequence; graphs (b) and (c) are the two labeled graph results (labeled values are in grayscale) of this frame processed with and without the pre-labeled adaptive frame selection method, respectively; FIGS. (d) and (e) are enlarged views of rectangular regions in FIGS. (b) and (c), respectively, for easy observation;
FIG. 5 is a pipeline diagram of one example of the invention: (a) is a frame in a sequence of pictures; (b) is a label graph obtained before multi-body plane fitting after initial solution; (c) is a label diagram after the plane fitting of the multi-body plane; (d) is a label diagram after two times of optimization iteration; (e) is a segmentation map of a person and a background; (f) is the result of the three-dimensional reconstruction patch without depth level expansion; (g) the result of the three-dimensional reconstruction patch after the depth level expansion is obtained;
FIG. 6 is two trisomy examples of the invention: (a) and (d) two selected frames, respectively; (b) is a numbered scheme of (a); (c) is a segmentation map of (a); (e) is a labeled graph of (d), and (f) is a divided graph of (d);
FIG. 7 is a reconstructed patch result of one example of the invention: (a) is the geometric information of the background; (b) and (c) the reconstructed patch results of two persons in the frame, respectively;
FIG. 8 is an example result of a box sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention;
FIG. 9 is an example result of a toy sequence: (a) is a frame in a sequence of pictures; (b) is the label graph of the frame calculated by the invention; (c) is the segmentation result graph of the frame calculated by the invention.
Detailed Description
The invention provides a stable and efficient multi-body depth recovery and segmentation method, and a basic flow chart of the invention is shown in figure 1, which mainly comprises the following steps:
step 1) an initialization processing stage, wherein energy minimization is performed on a video by an iterative method according to an energy equation provided by the invention to obtain an initial label of each frame of the video, wherein the initial label consists of a depth label and a segmentation label of a pixel, and the initialization processing stage specifically comprises the following steps:
the object of the invention is to find both the depth values of all pixels and to which object the pixel belongs, so that there are two labels, one depth label and one segmentation label. The invention uniformly represents the two label values of the pixel by an expanded label:
L = { d 1 1 , d 2 1 , . . . , d m 1 1 , . . . , d 1 K , d 2 K , . . . , d m K K }
it is assumed here that there are K objects (including the background) in the scene. The pixel disparity values (actually the inverse of the depth, referred to herein as the depth index) of the kth object on the image captured by the camera range from
Figure GDA0000151171500000062
In between, this interval is divided evenly so that each disparity value is defined as follows:
<math> <mrow> <msubsup> <mi>d</mi> <mi>i</mi> <mi>k</mi> </msubsup> <mo>=</mo> <mrow> <mo>(</mo> <mi>i</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>&Delta;d</mi> <mo>+</mo> <msubsup> <mi>d</mi> <mi>min</mi> <mi>k</mi> </msubsup> </mrow> </math>
this is the meaning of all elements in the L-set. Therein are provided with
<math> <mrow> <mo>|</mo> <mi>L</mi> <mo>|</mo> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </msubsup> <msub> <mi>m</mi> <mi>k</mi> </msub> </mrow> </math>
The object with the symbol l is represented by S (l), i.e. the segmentation symbol. D (l) represents the disparity value of the pixel labeled l, i.e., the depth label. Any element L of the L setiS (l) and D (l) are readily available, and index term h is first found so that the following inequality is satisfied:
<math> <mrow> <mn>1</mn> <mo>&le;</mo> <mi>i</mi> <mo>-</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>h</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msub> <mi>m</mi> <mi>j</mi> </msub> <mo>&le;</mo> <msub> <mi>m</mi> <mi>h</mi> </msub> </mrow> </math>
then s (l) ═ h, <math> <mrow> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <msubsup> <mi>d</mi> <mrow> <mi>j</mi> <mo>-</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>h</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msub> <mi>m</mi> <mi>j</mi> </msub> </mrow> <mi>h</mi> </msubsup> <mo>.</mo> </mrow> </math>
the following energy equations were optimized:
<math> <mrow> <msup> <mi>E</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <mi>L</mi> <mo>;</mo> <mover> <mi>I</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>
wherein P isinitIs defined as follows:
<math> <mrow> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>|</mo> <mi>&phi;</mi> </mrow> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msup> <mi>&phi;</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> </mrow> </math>
where x istIs the t-th frame image ItOne pixel above, t is 1 … n, and n is the total frame number of the video; n (x)t) Is xtAll neighboring pixels of (g), p (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) And | η }, which represents the difference in labels between neighboring pixels. η is a truncation parameter so that discontinuity of the object boundary is not lost when the energy function is minimized because the energy value of the smoothing term is too large, i.e. is prevented from being excessively smoothed.
Here phi' (x)t) Refers to selected frames within which x istThis pixel is visible. These frames are selected here by the method disclosed in the document S.B.Kang and R.Szeliski.extraction view-dependent depth maps from acquisition of images.International Journal of Computer Vision (IJCV' 2004).
P in this casecIs defined as such:
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> </mrow> </math>
Wherein sigmacIs a parameter controlling the shape of the difference function of the above formula, It′(x') is a pixel xtAt phi' (x)t) Color value of corresponding pixel x ', homogeneous coordinate x ' of x 'hThe following can be obtained:
<math> <mrow> <msup> <mi>x</mi> <mrow> <mo>&prime;</mo> <mi>h</mi> </mrow> </msup> <mo>~</mo> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <msub> <mi>R</mi> <mi>t</mi> </msub> <msubsup> <mi>K</mi> <mi>t</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msubsup> <mi>x</mi> <mi>t</mi> <mi>h</mi> </msubsup> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>T</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> </mrow> </math>
from above x'hAnd converting the homogeneous coordinate into a two-dimensional coordinate to obtain the coordinate of x'. The above D (l) is the pixel xtIn the reference symbol l, Kt′、Rt′And Tt′The method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t' th frame. Kt、RtAnd TtThe method comprises the steps of respectively obtaining an internal parameter matrix, an external parameter rotation matrix and a translation matrix of a camera corresponding to the t-th frame. It is noted that camera internal and external references for all frames are known before all steps of the invention are performed.
Step 2) after each frame is subjected to image segmentation, optimizing the initial label of each frame of image by using a multi-body plane fitting method to obtain the optimized labels of all segmented blocks of each frame of image, wherein the specific steps are as follows:
first, mean shift is used to calculate ItIs divided, i.e.
Assume that there are K objects (including background) in the environment. For each segment s as shown aboveiAssuming that it belongs to the object k and is a plane, there are three parameters
Figure GDA0000151171500000081
Then optimizing energy equation (1) can obtain the values of the three parameters and a minimum energyEnumerating this partition siThe belonged object K can obtain K minimum energy namely E'0,E′1,…,E′(k-1)Taking the smallest of them (not provided as E'j) Corresponding object j and corresponding three parametersMinimum energy as the partition, its corresponding object, and its three plane parameters [ a ]i,bi,ci]。
The initial per-pixel label was previously calculated in step 1) of the initialization phase, then s is assignediAll pixels in the pixel array are substituted into energy equation (1) according to the initial label to obtain an energy E'tIf E'j<E′tThen all s are updatediPixel x in (2)tThe division number of (a) is that the object is j, that is, S (x)t) J, the plane parameter is [ a ]i,bi,ci]I.e. by
Figure GDA0000151171500000084
The effect pairs before and after optimization using the multi-volume plane fitting algorithm are shown in fig. 5.
Step 3) adaptive selection of matched frames and label prior constraint. I.e. for each pixel x on the t-th frame, according to the label obtained during the initialization phase or the last iteration optimizationtTwo sets of frames are selected from adjacent frames, one set of frames consisting of pairs of pixels xtVisible frame composition, denoted as phiv(xt) Another set of frames consisting of those pairs of pixels xtInvisible frame composition, denoted phio(xt) The method specifically comprises the following steps:
1) after obtaining the optimized label in step 2), transforming the label map of the t 'frame to the t frame to obtain L by the method of W.R. Mark, L.McMillan, and G.Bishop.post-rendering 3D forwarding, (SI 3D' 1997)t′,t. If by such a re-projection, none of the pixels on the label map of the t' th frame are projected onto xtAs shown in FIG. 2, then let t' th frame belong to φo(xt) Otherwise, it belongs to phiv(xt)。
2) The matching calculation need not be performed for all frames in practice, but only a maximum of N needs to be selected1The frame is matched (here N)1Generally 16 to 20). If found | φv(xt) | is less than a lower limit N2(generally 5) then find some neighboring pixel-free reprojection to xtFrame phi ofo(xt) So that
v(xt)|+|φo(xt)|=N2
3) Note that the occluded pixel is not computationally expensive to match, so if a pixel is not visible in all neighboring frames, its depth value is not directly available. But in this case the depth value of the pixel and the object to which it belongs can still be approximated, which is whyo(xt) The reason for this is. The method comprises the following steps:
for the label map projected from the adjacent frame, as shown in FIG. 3, for each missing pixel xtSearching in horizontal and vertical directions respectively to find two nearest effective projection pixels, and selecting the one with the smallest label among the four pixels, and recording the one as x*Its reference numeral is xtReference numerals of (a). Can use x*At Lt′,tIn which the reference number replaces xtAt Lt′,tReference symbol in (1) is Lt′,t(x)=Lt′,t(x*) Is determined by the distance between the two pixels, this confidence level is defined as follows:
<math> <mrow> <msub> <mi>&omega;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <mi>x</mi> <mo>*</mo> </msup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msubsup> <mrow> <mn>2</mn> <mi>&sigma;</mi> </mrow> <mi>&omega;</mi> <mn>2</mn> </msubsup> </mfrac> </mrow> </msup> </mrow> </math>
wherein the constant σωSet to 10.
Although this missing label inference method is not very accurate, it can be improved for data items that have been computed where occlusion is important. The label a priori constraints are defined as follows:
<math> <mrow> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>,</mo> <mi>t</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>&lambda;</mi> <mi>o</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>&omega;</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mi>&beta;</mi> <mrow> <mi>&beta;</mi> <mo>+</mo> <mo>|</mo> <mi>l</mi> <mo>-</mo> <msub> <mi>L</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>t</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> </mrow> </math>
wherein λoIs a weight and β is used to control the shape of the difference function of the above equation. The above formula requires when ωo(x) At very high time, Lt(xt) Need to follow Lt′,t(xt) Close.
FIG. 4 shows the results of label calculations using adaptive selection of matching frames and label priors, and comparisons with labels calculated without the use. It can be seen that the label map computed using the adaptive selection of matched frames and the label a priori constraints is significantly improved at the discontinuity boundaries.
Step 4) an iterative optimization stage, which is to perform iterative optimization on all frames of the monocular video sequence according to the energy equation provided by the invention to obtain label maps of all the frames, and then improve the fineness of depth recovery by using a hierarchical confidence coefficient propagation algorithm, and specifically comprises the following steps:
1) according to the energy equation
Et(Lt)=Ed(Lt)+Es(Lt)
And optimizing to obtain the minimum energy value, namely obtaining the depth values of all the pixels and the segmentation blocks. The energy equation is optimized in two passes by using a confidence coefficient propagation algorithm to obtain a relatively accurate result.
Here, the
<math> <mrow> <msub> <mi>E</mi> <mi>s</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <msub> <mi>x</mi> <mi>t</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>
A smoothing term is shown to measure the index difference between adjacent pixels in the image so that the index difference between adjacent pixels is as small as possible. ρ is defined as ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) And | η }, representing the labeled difference between neighboring pixels. η is a cutoff value so that when the energy function is minimized, discontinuities in the object boundary are not lost, i.e., are not overly smoothed, due to the energy value of the smoothing term being too large.
Data item Ed(Lt) The definition is as follows:
<math> <mrow> <msub> <mi>E</mi> <mi>d</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>
here, the <math> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mrow> <mo>(</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <msub> <mrow> <mo>,</mo> <mi>L</mi> </mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </math>
Wherein
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>,</mo> </mrow> </math>
Described is a pixel xtAnd color similarity between x'; x is the number oftAnd x 'are pixels on the t-th and t' -th frames, respectively; p is a radical ofvDescribed is a pixel xtAnd x' and the consistency of the segmentation, i.e. xtAnd x' is on the same object, and the depths are consistent. The specific definition is as follows:
<math> <mrow> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
wherein l 'is the index for x'. If l and l 'are different objects, i.e. S (l) ≠ S (l'), then the two pixels are not the corresponding pixels in the two frames and need to be separated. Otherwise, with pgThe geometric consistency between two pixels is measured, and is defined as follows:
<math> <mrow> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <msubsup> <mi>x</mi> <mi>t</mi> <msup> <mi>t</mi> <mrow> <mo>&prime;</mo> <mo>&RightArrow;</mo> <mi>t</mi> </mrow> </msup> </msubsup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>&sigma;</mi> <mi>d</mi> <mn>2</mn> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> </mrow> </math>
here, the
Figure GDA0000151171500000106
Is to project the x ' on the t ' frame to the corresponding point on the t-th frame according to the D (l ') obtained by calculation.
2) In the above processing procedure, the present invention adopts a Belief Propagation (BP) algorithm to optimize the objective function. Since the number of labels is proportional to the memory requirement. This can easily result in memory requirements that exceed the memory space of the machine being operated for processing large resolution images. The adoption of a hierarchical solving strategy can reduce the memory usage amount, but can reduce the quality of object segmentation and depth recovery to a certain extent, particularly in a discontinuous boundary region. In order to obtain high-quality object segmentation and depth recovery results as far as possible, the invention adopts a simple solving strategy based on region segmentation to overcome the bottleneck of the memory. The process is simple, namely, the image is uniformly cut into M multiplied by M areas, and energy optimization is carried out on each area. If a color partition spans multiple regions, a corresponding split is required. The strategy is simple and effective, can effectively overcome the bottleneck of the memory, and has little influence on the processing result.
To this end, all steps of the inventive multi-volume depth recovery and segmentation are completed.
At initialization and two iterative optimizations, the number of layers m per object depth is usuallykIs set between 51 and 101. After two iterative optimizations, the segmentation results are usually already very accurate. Then, in order to further improve the accuracy of the depth, the present invention may fix the division index, at which time the index is actually equivalent to the depth order. The invention adopts a hierarchical confidence propagation algorithm from coarse to fine, can effectively expand the depth series in the global optimization without increasing a lot of calculation cost, thereby improving the precision of depth recovery.
In the experiment, the image resolution size of the video sequence was 960 × 540. The majority of the parameters in the system may be default values and need not be adjusted during processing, e.g. by taking lambdas=5/|L|,η=0.03|L|,λo=0.3,σc=10,σdWhen β is 2, 0.02| L |, the results shown in fig. 8 and 9 are obtained.
One set of experiments of the present invention is the experiment of a set of box sequences as shown in fig. 8, where fig. 8(a) is a frame in a video and fig. 8(b) is a label graph of the frame obtained using the present invention; another set of experiments of the present invention is experiments of a set of toy sequences as shown in fig. 9, fig. 9(a) is a frame in a video, fig. 9(b) is a numbered view of the frame using the present invention, and fig. 9(c) is a segmented view of the frame using the present invention. From the comparison of the label diagrams of fig. 8 and 9 with the original diagram, it can be seen that the whole label diagram is well-defined and the restoration is accurate at the boundary; from the comparison between the segmentation map of fig. 9 and the original map, it can be seen that the segmentation result is very accurate, and different objects are accurately segmented, thereby illustrating that the result obtained by the algorithm provided by the present invention is very accurate under the multi-body condition.

Claims (2)

1. A method for multi-volume depth recovery and segmentation of video, characterized by comprising the steps of:
(1) performing energy minimization on the video by using an energy equation of an equation (1) through an iterative method to obtain an initial label of each frame of the video, wherein the initial label consists of depth information and segmentation information of pixels,
<math> <mrow> <msup> <mi>E</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <mi>L</mi> <mo>;</mo> <mover> <mi>I</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>.</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein,
<math> <mrow> <msub> <mi>P</mi> <mi>init</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>|</mo> <mi>&phi;</mi> </mrow> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msup> <mi>&phi;</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formulae (1), (2) and (3), ItRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; phi' (x)t) Represents a pair of pixels xtA visible frame, and the pair of pixels xtIn visible frame and xtRe-projection and x of corresponding pixel in t frametOverlapping; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); sigmacA parameter representing a shape of a difference function of the control formula (3); x 'represents the sum pixel x in the t' th frametCorresponding pixel, and the t 'th frame is of phi' (x)t) A frame of (2); i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hAnd (3) converting the homogeneous coordinate into a two-dimensional coordinate to obtain:
<math> <mrow> <msup> <mi>x</mi> <mrow> <mo>&prime;</mo> <mi>h</mi> </mrow> </msup> <mo>~</mo> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <msub> <mi>R</mi> <mi>t</mi> </msub> <msubsup> <mi>K</mi> <mi>t</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msubsup> <mi>x</mi> <mi>t</mi> <mi>h</mi> </msubsup> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <msub> <mi>K</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <msubsup> <mi>R</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mi>T</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>T</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (4), h represents a homogeneous coordinate; d (l) represents a pixel xtDepth information in the label of (1); kt′、Rt′And Tt′Respectively corresponding to an internal parameter matrix, a rotation matrix of external parameters and a translation matrix of the external parameters of the camera corresponding to the t' th frame; kt、RtAnd TtRespectively obtaining an internal parameter matrix, a rotation matrix and a translation matrix of the external parameter of the camera corresponding to the t-th frame;
(2) after each frame is subjected to image segmentation, optimizing the initial labels of each frame of image by using a multi-body plane fitting method to obtain optimized labels of all segmented blocks of each frame of image;
(3) using the optimized label obtained finally in step (2) to identify each pixel x on the t-th frametSelecting a set of visible frames from neighboring framesv(xt) And a set of invisible frames phio(xt) All pixels in the visible frame are transformed to the tth frame without matching xtAt least one pixel in the invisible frame is coincided with x when being transformed to the t frametOverlapping;
(4) performing energy minimization on each frame of the video by using an energy equation shown in the formula (5) by using an iterative method to obtain an iterated label of each frame of the video, further expanding the progression of depth in the iterated label by using a hierarchical confidence coefficient propagation algorithm,
Et(Lt)=Ed(Lt)+Es(Lt) (5)
wherein,
<math> <mrow> <msub> <mi>E</mi> <mi>s</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>&lambda;</mi> <mi>s</mi> </msub> <munder> <mi>&Sigma;</mi> <msub> <mi>x</mi> <mi>t</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mi>&rho;</mi> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>d</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> </mrow> </munder> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>L</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mrow> <mo>(</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mn>0</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>o</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <msub> <mrow> <mo>,</mo> <mi>L</mi> </mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>&Sigma;</mi> <mrow> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msub> <mi>&phi;</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mrow> <msub> <mi>&sigma;</mi> <mi>c</mi> </msub> <mo>+</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>I</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>I</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>v</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>l</mi> <mo>,</mo> <msub> <mi>L</mi> <msup> <mi>t</mi> <mo>&prime;</mo> </msup> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mi>S</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>S</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>p</mi> <mi>g</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>D</mi> <mrow> <mo>(</mo> <msup> <mi>l</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <msubsup> <mi>x</mi> <mi>t</mi> <msup> <mi>t</mi> <mrow> <mo>&prime;</mo> <mo>&RightArrow;</mo> <mi>t</mi> </mrow> </msup> </msubsup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>&sigma;</mi> <mi>d</mi> <mn>2</mn> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow> </math>
in formulae (5) to (11), Ed(Lt) And Es(Lt) Respectively representing a data item and a smooth item in an energy equation; i istRepresenting the t-th frame image, t is 1 … n, and n is the total frame number of the video; x is the number oftIs represented bytOne pixel above; l ist(xt) Denotes xtThe reference number of (a); n (x)t) Representing a pixel xtAll neighboring pixels of (a); ρ (L)t(xt),Lt(yt))=min{|Lt(xt)-Lt(yt) L, η } representing the difference in labels between adjacent pixels; η represents the truncation parameter; x 'represents the sum pixel x in the t' th frametA corresponding pixel; i ist(x) Representing a pixel xtA color value of (a); i ist′(x ') is the color value of pixel x'; the coordinate of x ' is x ' obtained from formula (4) 'hConverting the homogeneous coordinate into a two-dimensional coordinate to obtain the homogeneous coordinate; p is a radical ofcRepresenting a pixel xtAnd color similarity of x'; l represents xtThe reference number of (a); l 'denotes the reference number of the pixel x'; s (l) and S (l ') denote the division symbols in the symbol l and the symbol l', respectively; p is a radical ofgRepresenting a measure of geometric consistency between two pixels; d (l) and D (l ') denote depth indices in index l and index l', respectively;
Figure FDA0000151171490000031
is the pixel that reprojects pixel x 'onto the t-th frame according to D (l'); p is a radical ofvRepresenting a pixel xtAnd the geometric consistency and segmentation consistency of the pixels corresponding to the coordinates x'.
2. The method for multi-body depth restoration and segmentation of video according to claim 1, wherein the method of "optimizing the initial labels by using multi-body plane fitting" in step (2) is as follows:
after each frame is subjected to image segmentation, each segmentation block is successively endowed with an object label, the object labels endowed by the same segmentation block at each time are different from each other, and then the energy equation shown in the formula (1) is utilized to obtain the corresponding minimum energy value and the parameter of the plane where the segmentation block is located for each assignment result of each segmentation block; comparing the minimum value of the minimum energy values in each divided block with the minimum energy value corresponding to the initial label: if the minimum value in the minimum energy values in the segmentation blocks is smaller than the minimum energy value corresponding to the initial label, assigning the object label corresponding to the minimum value in the minimum energy values in the segmentation blocks as the segmentation label to the pixels in the segmentation blocks to obtain the optimized label of the segmentation blocks; otherwise, taking the initial label as the optimized label of the segmentation block.
CN2010106169405A 2010-12-31 2010-12-31 Method for performing multi-body depth recovery and segmentation on video Active CN102074020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010106169405A CN102074020B (en) 2010-12-31 2010-12-31 Method for performing multi-body depth recovery and segmentation on video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010106169405A CN102074020B (en) 2010-12-31 2010-12-31 Method for performing multi-body depth recovery and segmentation on video

Publications (2)

Publication Number Publication Date
CN102074020A CN102074020A (en) 2011-05-25
CN102074020B true CN102074020B (en) 2012-08-15

Family

ID=44032549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010106169405A Active CN102074020B (en) 2010-12-31 2010-12-31 Method for performing multi-body depth recovery and segmentation on video

Country Status (1)

Country Link
CN (1) CN102074020B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017108413A1 (en) * 2015-12-21 2017-06-29 Koninklijke Philips N.V. Processing a depth map for an image

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013026826A (en) * 2011-07-21 2013-02-04 Sony Corp Image processing method, image processing device and display device
US8938114B2 (en) * 2012-01-11 2015-01-20 Sony Corporation Imaging device and method for imaging hidden objects
US9621869B2 (en) * 2012-05-24 2017-04-11 Sony Corporation System and method for rendering affected pixels
CN102903096B (en) * 2012-07-04 2015-06-17 北京航空航天大学 Monocular video based object depth extraction method
CN103002309B (en) * 2012-09-25 2014-12-24 浙江大学 Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
CN103198486B (en) * 2013-04-10 2015-09-09 浙江大学 A kind of depth image enhancement method based on anisotropy parameter
CN103500447B (en) * 2013-09-18 2015-03-18 中国石油大学(华东) Video foreground and background partition method based on incremental high-order Boolean energy minimization
US20150381972A1 (en) * 2014-06-30 2015-12-31 Microsoft Corporation Depth estimation using multi-view stereo and a calibrated projector
CN104616286B (en) * 2014-12-17 2017-10-31 浙江大学 Quick semi-automatic multi views depth restorative procedure
CN104574379B (en) * 2014-12-24 2017-08-25 中国科学院自动化研究所 A kind of methods of video segmentation learnt based on target multi-part
CN106056622B (en) * 2016-08-17 2018-11-06 大连理工大学 A kind of multi-view depth video restored method based on Kinect cameras
US11361508B2 (en) * 2020-08-20 2022-06-14 Qualcomm Incorporated Object scanning using planar segmentation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101142593A (en) * 2005-03-17 2008-03-12 英国电讯有限公司 Method of tracking objects in a video sequence
CN101271578A (en) * 2008-04-10 2008-09-24 清华大学 Depth sequence generation method of technology for converting plane video into stereo video
CN101789124A (en) * 2010-02-02 2010-07-28 浙江大学 Segmentation method for space-time consistency of video sequence of parameter and depth information of known video camera

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101142593A (en) * 2005-03-17 2008-03-12 英国电讯有限公司 Method of tracking objects in a video sequence
CN101271578A (en) * 2008-04-10 2008-09-24 清华大学 Depth sequence generation method of technology for converting plane video into stereo video
CN101789124A (en) * 2010-02-02 2010-07-28 浙江大学 Segmentation method for space-time consistency of video sequence of parameter and depth information of known video camera

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Long Quan et al..Image-Based Modeling by Joint Segmentation.《International Journal of Computer Vision》.2007,第75卷(第1期), *
Sing Bing Kang et al..Extracting View-Dependent Depth Maps from a Collection of Images.《International Journal of Computer Vision》.2004,第58卷(第2期), *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017108413A1 (en) * 2015-12-21 2017-06-29 Koninklijke Philips N.V. Processing a depth map for an image

Also Published As

Publication number Publication date
CN102074020A (en) 2011-05-25

Similar Documents

Publication Publication Date Title
CN102074020B (en) Method for performing multi-body depth recovery and segmentation on video
Roussos et al. Dense multibody motion estimation and reconstruction from a handheld camera
EP2595116A1 (en) Method for generating depth maps for converting moving 2d images to 3d
CN106910242A (en) The method and system of indoor full scene three-dimensional reconstruction are carried out based on depth camera
CN111882668B (en) Multi-view three-dimensional object reconstruction method and system
CN109242873A (en) A method of 360 degree of real-time three-dimensionals are carried out to object based on consumer level color depth camera and are rebuild
Zhang et al. Recovering consistent video depth maps via bundle optimization
Lee et al. Silhouette segmentation in multiple views
US20090285544A1 (en) Video Processing
CN103002309B (en) Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
Zhang et al. Simultaneous multi-body stereo and segmentation
Bebeselea-Sterp et al. A comparative study of stereovision algorithms
WO2018133119A1 (en) Method and system for three-dimensional reconstruction of complete indoor scene based on depth camera
CN103049929A (en) Multi-camera dynamic scene 3D (three-dimensional) rebuilding method based on joint optimization
Kahl et al. Multiview reconstruction of space curves
Wang et al. Vid2Curve: simultaneous camera motion estimation and thin structure reconstruction from an RGB video
Lee et al. Automatic 2d-to-3d conversion using multi-scale deep neural network
Mahmoud et al. Fast 3d structure from motion with missing points from registration of partial reconstructions
Kim et al. Multi-view object extraction with fractional boundaries
Fan et al. Collaborative three-dimensional completion of color and depth in a specified area with superpixels
Klose et al. Reconstructing Shape and Motion from Asynchronous Cameras.
Engels et al. Automatic occlusion removal from façades for 3D urban reconstruction
Guo et al. Mesh-guided optimized retexturing for image and video
Ruhl et al. Interactive scene flow editing for improved image-based rendering and virtual spacetime navigation
Gupta et al. 3dfs: Deformable dense depth fusion and segmentation for object reconstruction from a handheld camera

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210709

Address after: Room 288-8, 857 Shixin North Road, ningwei street, Xiaoshan District, Hangzhou City, Zhejiang Province

Patentee after: ZHEJIANG SHANGTANG TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 310027 No. 38, Zhejiang Road, Hangzhou, Zhejiang, Xihu District

Patentee before: ZHEJIANG University

TR01 Transfer of patent right