WO2014075224A1

WO2014075224A1 - Video object segmentation with llc modeling

Info

Publication number: WO2014075224A1
Application number: PCT/CN2012/084536
Authority: WO
Inventors: Gang Cheng; Lin Du; Zhenglong Li
Original assignee: Thomson Licensing
Priority date: 2012-11-13
Filing date: 2012-11-13
Publication date: 2014-05-22

Abstract

Video object segmentation is accomplished utilizing Locality-constrained Linear Coding (LLC) modeling and adaptive model learning. A video object isolator utilizes an LLC segmentor to segment a video object from a video based on the LLC modeling. Additionally, the video object isolator can use an LLC model learner to further adapt the model.

Description

VIDEO OBJECT SEGMENTATION WITH LLC MODELING

BACKGROUND

[0001] With the development of capture and storage devices, video data has increased tremendously in the last few years. It is commonly believed that it will continue to increase in the future. However, only a few objects in the video are useful for content understanding and analyzing. Therefore, video object segmentation is necessary for video processing.

[0002] Video object segmentation can be regarded as a labeling problem, where each pixel in all frames is assigned a unique label - foreground or background. Intuitively, this can be done by image segmentation if video is decoded into a sequence of image frames. There are methods in image segmentation such as, for example, Magic Wand (see generally, Li, Y., Sun, J., and Shum, H.-Y. : 'Video object cut and paste'. Proc. ACM SIGGRAPH 2005 Papers, Los Angeles, California 2005), Graph cuts (see generally, Y. Y. Boykov and M. P. Jolly, "Interactive graph cuts for optimal boundary region segmentation of objects in N-D images," in Computer Vision, 2001 and ICCV 2001. Proceedings. Eighth IEEE International Conference on, 2001, pp. 105-112 vol.1) and so on.

[0003] However, user interaction is needed for most of the segmentation methods, and it is tedious for a user to do it for all the frames. In order to solve this problem, Li et al. (Y. Li, J. Sun, and H.-Y. Shum, "Video object cut and paste," presented at the ACM SIGGRAPH 2005 Papers, Los Angeles, California, 2005) proposed to construct a 3D graph on video frames which can be viewed as a spatial- temporal volume. They used watershed to pre-segment the image and optimize the energy function with graph cuts. However, all of these approaches are either based on superpixels or are not very robust. As a result, temporal coherency has been difficult to maintain and post-processing is needed for noisy segmentation results, such as feature tracking, constrained 2D graph cut, etc.

SUMMARY

[0004] Video object segmentation is accomplished utilizinglocality- constrained linear coding (LLC) modeling and adaptive model learning. The video sequence is processed frame by frame. In each iteration, a three-dimensional (3D) graph based on two successive frames is constructed and then graph cuts are used to determinea video object. In one instance, LLC is utilized to model the foreground and background model and online model learning is used to adapt to the variation of an object in a video combined with LLC. The techniques permit better constructions, local smooth sparsity and analytical solutions.

[0005] The above presents a simplified summary of the subject matter in order to provide a basic understanding of some aspects of subject matter embodiments. This summary is not an extensive overview of the subject matter. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the subject matter. Its sole purpose is to present some concepts of the subject matter in a simplified form as a prelude to the more detailed description that is presented later.

[0006] To the accomplishment of the foregoing and related ends, certain illustrative aspects of embodiments are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles of the subject mattercan be employed, and the subject matter is intended to include all such aspects and their equivalents. Other advantages and novel features of the subject mattercan become apparent from the following detailed description when considered in conjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] FIG. 1 is a flow diagram of a method of video segmentation with LLC

[0008] FIG. 2 is an example of results of a sample video frame.

[0009] FIG. 3 is a flow diagram of a method of model learning.

[0010] FIG. 4 is an example system employing an embodiment.

DETAILED DESCRIPTION

[0011] The subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the subject matter. It can be evident, however, that subject matter embodimentscan be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the embodiments.

[0012] Video object segmentation utilizing LLC to model the foreground and background model provides better construction, local smooth sparsity and an analytical solution. In addition, online model learning adapts to the variations of an object in a video that is combined with LLC. These techniques solve many of the difficulties faced with other methods.

[0013] Video object segmentation by minimal cuts of the graph can be viewed as the identical problem of energy minimization. Although there are many energy functions proposed in recent works, likelihood energy is one of the most used functions. It evaluates the conformity of each node to the foreground or background model. In Boykov Jolly {supra), a histogram of intensity distribution is used to model the foreground and background. Likelihood energy can then be calculated by the negative log-likelihoods of the probability density function. While this is efficient for gray images, it is not tractable for color images because of the mass histogram bins (256x256x256). In order to solve this problem, Li et al. {see generally, Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum, "Lazy snapping," presented at the ACM SIGGRAPH 2004 Papers, Los Angeles, California, 2004) introduce a K-means to cluster the colors and then uses cluster centers to calculate the likelihood energy based on the distance between each pixel and its nearest center. In addition, Gaussian Mixture Models

(GMM) can be introduced to replace the histogram of intensity distribution, and with GMM, iterative optimization could also be used to refine the segmentation via user interaction {see generally, C. Rother, V. Kolmogorov, and A. Blake, ""GrabCut": interactive foreground extraction using iterated graph cuts," presented at the ACM SIGGRAPH 2004 Papers, Los Angeles, California, 2004).

[0014] However, while these methods could be practicable for image segmentation, they are insufficient for video object segmentation. Subject to the complexity of computation, only a few cluster centers or GMM components are used to calculate the likelihood energy, which is not enough for clutter foreground or background. And, more importantly, an object in a video is varying in almost all the frames such as moving, rotating, scaling and so forth. Therefore, it is necessary to design a scheme to adapt to the variation of an object in the video. In image categorization, each image is commonly modeled by a histogram of its local features. If a model in an image categorization can be viewed as histogram or GMM, components in a video object segmentation coding scheme can be used to calculate the likelihood energy. Recently, LLC has achieved excellent performance in image categorization (see generally, Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. NIPS'09). Compared with sparse coding, LLC presents some attractive properties such as better construction, local smooth sparsity and an analytical solution.

[0015] FIG. 1 is an example method 100 of one instance of a technique to provide video object segmentation. Given a video sequence 102, a first frame is grabbed to do the initialization 104. The initialization includes two parts: getting an object mask and initializing the foreground and background model. The object mask can be calculated by any image segmentation method. The foreground or background model is represented by color centers which can be clustered by K-means. Then the video sequence is processed frame by frame. The likelihood energy is calculated with LLC 106 and energy E2 and E3 (smoothness) are calculated as well 108. In each iteration, a 3D graph G = (J⁷, ε) is constructed 110. While Vis the set of all nodes which are divided into two parts: terminals {s, t) (denoting foreground and background) and non-terminals (denoting pixels in both frames), ε is the set of all edges which are composed of two types: intra frame edges ε_τ (connecting adjacent pixels in the same frame) and inter frame edges ε/ (connecting adjacent pixels in adjacent frames). It has been proven that the minimal cuts problem of such graph is identical to the following energy minimization:

E(X) = _jE_l (x_i) + a ∑E₂ (_Xi ,_Xj)

(Eq. 1) + β ∑-¾(*„*,),

wherex, is the label of each node p and X= {x . ^v /^'}. The first term

evaluates the conformity of each node to the foreground or background model, so it is also referred to as likelihood energy. The last two terms E₂ and i¾ measure the differences of adjacent nodes: E₂ for the ones in the same frame, i¼ for the ones between two adjacent frames. Therefore, they are commonly viewed as the representation for smoothness and can be defined as follows:

(Eq. 2) dist , j)

where ^ " ^{Pi Pj} " and E is the expectation of color contrast.

[0016] With LLC coding, for each pixelpi^e R³ (RGB), the following criteria should be satisfied:

mi*, Dc ² + A||d,. ® c.

(Eq. 3) s.t. V c, = 1, V;

where ® denotes the element-wise multiplication, d,^e R^Mis the locality adaptor, D^e R^3xMrepresents the model and c,^e R^M is the coefficient. Although (Eq. 3) has an analytical solution and it is fast to calculate, approximated LLC is used to speed up the optimization. With appropriated LLC, K nearest neighbors of p, are first calculated which can be taken as the local bases D, , and then a much smaller linear system is solved to get the optimized coefficient ^c' :

c, = argmin^p, - D,c,

(Eq. 4) s.t. l^rc,. = 1, V;.

Residual is then computed according to: (Eq. 5)

As two models are kept - one for foreground and the other for background, then the likelihood energy

is defined as follows:

£₁( ,. = l) = 0 Vz e

where U is uncertain region, ^σ is the max value of E₂ and i¾- Thus, the maxflow of the graph is solved to get the object 112. The model is then updated with LLC based learning (discussed infra) (114) and then the iteration continues to the next frame 116 or ends 118 if completed. In FIG. 2, a comparison result 200 is shown with a sample video frame 202. Here, it can be seen that processing with LLC 208 yields a superior isolated object from the sample frame 202 compared to GMM 204 and K-means 206.

[0017] Generally, a foreground or background model is generated by clustering method such as K-means 206, Gaussian Mixture Models 204 and so on. And in most cases, these methods are enough to model the foreground and background model for image segmentation. In order to tackle foreground or background clutter, iterative optimization based on interactive user input is used to update the model. While this is acceptable for image segmentation, it is insufficient for video object segmentation, because it is tedious for users to do interactive labeling with every frame of the video.

[0018] As video object segmentation is done frame by frame, once segmentation is given, current segmentation result can be used to learn the model for next frame. In addition, while we use a motion estimation method to propagate the labels of specified pixels, model learning can be reinforced with these labeled pixels. Segmentation and model learning can be optimized iteratively which can be solved with the Coordinate Descent method. Given a video and afirst manually labeled frame, initialized model ϋί_η¾ can be trained by K-means or other clustering methods. As illustrated in FIG. 3, a method 300 begins to learn a model 302 by initializing parameters (304). Then an outer loop 324can be implemented for each frame of the video to update the model.

[0019] The segmentation is first done based on a previously updated model, and the summation of the coefficients c_∞mis also initialized in order to check the validity of each word of the model 306. Next, the labeled pixels are looped through 322to update the corresponding words of the model which is solved in a gradient descent manner 308, 310, 312, 314. The related words are updated - not all the model in each inner loop, which is actually the nearest ones. As a result, a larger model can be used in an instance of the present methods for accuracy of modeling which is an advantage over GMM or K-means based methods for limited efficiency. At the end of the iteration, the c_∞mis checked to make sure that unused words of the model are replaced by randomly sampled words for the purpose of adaptation of the variation of the foreground or background 316. When the frames 318 are completed, the learning ends 320.

[0020] The above methods and processes can be employed in whole or in part in an example system 400 shown in FIG. 4. The video object isolator 402 segments objects found in a video 404 such that it can output an isolated video object 406. The video object isolator 402 can reside on a processor that has been configured to perform the steps and methods described herein. The processor can utilize hardware, firmware and/or software as required to properly perform these functions. The video object isolator 402 segments objects from the video 404 by utilizing an LLC object segmentor 408 that employs the techniques described above to segment an object from the video 404. An LLC model learner 410 can be optionally employed to facilitate in adapting the model to better segment an object from the video 404.

[0021] What has been described above includes examples of the embodiments.

It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the embodiments, but one of ordinary skill in the art can recognize that many further combinations and permutations of the embodiments are possible. Accordingly, the subject matter is intended to embrace all such alterations, modifications and variations that fall within scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.

Claims

1. A system that provides video object segmentation, comprising:

a video object isolator that performs video object segmentation based on a Locality-constrained Linear Coding (LLC) model.

2. The system of claim 1, wherein the video object isolator uses adaptive model learning.

3. The system of claim 1, wherein the video object isolator uses an iterative frame by frame process to segment an object.

4. The system of claim 1, wherein the video object isolator uses the LLC model on a foreground and background of a video.

5. The system of claim 4, wherein the video object isolator determines the energy likelihood based on the LLC model.

6. The system of claim 1, wherein the video object isolator constructs a three-dimensional graph based on two successive frames and uses graph cuts to segment a video object.

7. A method for video object segmentation, comprising:

segmenting a video object using a Locality-constrained Linear Coding (LLC) model.

8. The method of claim 7 further comprising:

using adaptive model learning to adapt the LLC model to improve video object segmentation.

9. The method of claim 7 further comprising:

processing a video frame by frame to build a three-dimensional graph based on two successive frames; and

using graph cuts to segment the video object.

10. The method of claim 7 further comprising:

using LLC to model a foreground and a background of a video to segment a video object.

11. The method of claim 7further comprising:

using online model learning to adapt to a variation of an object in a video combined with LLC.

12. A system that isolates an object in a video, comprising:

a means for isolating a video object using a Locality-constrained Linear Coding (LLC) model; and

a means for using adaptive model learning to adapt to a variation of an object in a video combined with LLC.

13. The system of claim 12 further comprising:

a means for processing a video frame by frame to build a three-dimensional graph based on two successive frames; and

a means for using graph cuts to isolate the video object.