CA2780710A1

CA2780710A1 - Video segmentation method

Info

Publication number: CA2780710A1
Application number: CA2780710A
Authority: CA
Inventors: Minglun Gong
Original assignee: Genesis Group Inc
Current assignee: Genesis Group Inc
Priority date: 2012-06-11
Filing date: 2012-06-11
Publication date: 2013-12-11

Abstract

A system and method implemented as a software tool for foreground segmentation of video sequences in real-time, which uses two Competing 1-class Support Vector Machines (C-1SVMs) operating to separately identify background and foreground. A globalized, weighted optimizer may resolve unknown or boundary conditions following convergence of the C-1SVMs. The objective of foreground segmentation is to extract the desired foreground object from live input videos, with fuzzy boundaries captured by freely moving cameras. The present disclosure proposes the method of training and maintaining two competing classifiers, based on Competing 1-class Support Vector Machines (C-1SVMs), at each pixel location, which model local color distributions for foreground and background, respectively. By introducing novel acceleration techniques and exploiting the parallel structure of the algorithm (including reweighing and max-pooling of data), real-time processing speed is achieved for VGA-sized videos.

Description

VIDEO SEGMENTATION METHOD
DESCRIPTION
FIELD
[001] The present application relates to methods and systems for image and video processing, and in particular, methods and systems for extracting foreground objects from video sequences.
BACKGROUND

[002] Real-time foreground segmentation for live video relates to methods and systems for image and video processing, and in particular, methods and systems for extracting foreground objects from live video sequences, even where the boundaries between the foreground objects and the background regions are complicated by multiple closed regions, partially covered pixels, similar background colour, similar background textures, etc., in a computationally efficient manner so as to permit so-called "real-time" operation.

[003] Foreground segmentation, also referred to as video cutout, is the extraction of objects of interest from input videos. It is a fundamental problem in computer vision and often serves as a pre-processing step for other video analysis tasks such as surveillance, teleconferencing, action recognition and retrieval. A
significant number of techniques have been proposed in both computer vision and graphics communities. However, some of them are limited to sequences captured by stationary cameras, whereas others require large training datasets or cumbersome user interactions. Furthermore, most existing algorithms are rather complicated and computationally demanding. As a result, there is still lacking an efficient yet powerful algorithm that can process challenging live video scenes with minimum user interactions.

[004] There is a need for a system and method for foreground segmentation which is both robust, and computationally efficient.

[005] Existing approaches to foreground segmentation may be categorized as unsupervised and supervised.

[006] Unsupervised approaches, try to generate background models automatically and detect outliers of the models as foreground. Most of them, referred to as background subtraction approaches, assume that the input video is captured by a stationary camera and model background colors at each pixel location using either generative methods (e.g.: J. Zhong and S. Sclaroff, Segmenting foreground objects from a dynamic textured background via a robust Kalman filter, ICCV, 2003 [Zhong 1]; or J. Sun ,W. Zhang, X. Tang, and H. Shum, Background cut, ECCV, 2006 [Sun 2]) or nonparametric methods (for example: Y. Sheikh and M. Shah.
Bayesian object detection in dynamic scenes, CVPR, 2005 [Sheikh 3]; or J.
Wang, P. Bhat, A. Colburn, M. Agrawala, and M. Cohen, Interactive video cutout, SIGGRAPH, 2005 [Wang 4]). Some of these techniques can handle repetitive background motion, such as rippling water and waving trees, but are unsuitable for a camera in motion.

[007] Considering existing unsupervised methods where camera motion does not change the viewing position, such as PTZ security cameras, the background motion has been described by a homography, which has been used to align different frames before applying the conventional background subtraction methods (e.g. E. Hayman and J. Eklundh, Statistical background subtraction for a mobile observer, ICCV, 2003 [Hayman 5]). The method of Y. Sheikh, 0. Javed, and T. Kanade, Background subtraction for freely moving cameras, ICCV, 2009 [Sheikh 6], proposed to deal with freely moving cameras by means of tracking the trajectories of salient features across the whole video, where the trajectories are used for estimating the background trajectory space, based on which foreground feature points can be detected accordingly. While this method automatically detects moving objects, it tends to classify background with repetitive motion as foreground, as well as to confuse large rigidly moving foreground objects with background.

[008] Supervised methods allow users to provide training examples to train the segmentation method being being employed. Certain existing methods (for example: V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, &layer segmentation of binocular stereo video, CVPR, 2005 [Kolmogorov 7]; A.
Criminisi, G. Cross, A. Blake, and V. Kolmogorov, Bilayer segmentation of live video, CVPR, 2006 [Criminisi 8]; and P. Yin, A. Criminisi, J. Winn, and I. Essa, Tree-based classifiers for bilayer video segmentation, CVPR, 2007 [Yin 9]) integrate multiple visual cues such as color, contrast, motion, and stereo with the help of structured prediction methods such as conditional random fields. Although operational for video conferencing applications, these algorithms require a large set of fully annotated images and considerable offline training, which bring up many issues when attempting to apply them in different scene setups.

[009] Some existing matting algorithms also provide supervised foreground segmentation by modelling the video sequence as a 3D volume of voxels. Users are required to label fore/background on multiple frames or directly on the 3D
volume. To enforce temporal coherence, these algorithms usually segment over the entire volume at one time, which restricts their capacity toward live video processing.

[010] The Video Snap-Cut by X. Bai, J. Wang, D. Simons, and G. Sapiro, Video snapcut robust video object cutout using localized classifiers, SIGGRAPH, 2009 [Bai 10], is one existing technique, starting from a segmentation of the first frame, both global and local classifiers are trained using color and shape cues, then labelling information is propagated to the rest of the video frame by frame.
Video SnapCut expects users to provide a fine annotation of the entire first frame which can be challenging for fuzzy objects, and runs at about 1 FPS for VGA-sized videos (excluding the time for matting).

[011] There is a need for a robust, minimally supervised video segmentation technique which is able to operate in real time, and which is able to handle freely moving cameras and/or background images.

[012] There is a need for a video segmentation technique designed for parallel computing which is both easy to implement and has low computation cost, that is capable of dealing with challenging video segmentation scenarios with minimal user interaction.
SUMMARY

[013] This present application relates to a foreground/background segmentation approach that is designed for parallel computing, is both easy to implement and has low computation cost, and is capable of dealing with challenging video segmentation scenarios with minimal user interaction. As shown in Figure 1, with only a few strokes from user on the first frame of the video, the preferred embodiment of the present invention is able to propagate labelling information to neighbouring pixels through a simple train-relabel procedure, resulting in a proper segmentation of the frame. This same procedure is used to further propagate labeling information across adjacent frames, regardless the fore/background motion.
Several techniques are also proposed in order to reduce computational costs.
Furthermore, by exploiting the parallel structure of the proposed algorithm, real-time processing speed of 14 frames per second (FPS) is achieved for VGA-sized videos.

[014] A number of improvements are proposed. First, the segmentation method maintains two Competing 1-class Support Vector Machines (C-1SVM5) at each pixel location, rather than operating a single classifier. A first competing 1-class Support Vector Machines (C-1SVM) captures the local foreground color densities and a second competing 1-class Support Vector Machines (C-1SVM) captures the local background color densities separately from the first C-1SVM. However, the two C-1SVM's determine the proper foreground/background label for the pixel jointly.

Through iterations between training local C-1SVMs and applying them to label the pixels, the algorithm can effectively propagate \ initial user labeling to the whole image and to consecutive frames. The frame/image is partitioned into known foreground (if the foreground-1SVM says foreground and the background-1SVM
says not background), known background (if the foreground-1SVM says not foreground and the background-1SVM says background) and unknown (if the foreground a-SVM
and background-1SVM disagree as to classification). Then, optionally, the unknown pixels are forced to either foreground or background by a smoothing function.
The smoothing function disclosed in Algorithm 2 below is a globally optimized thresholded costing function biased towards foreground identification. On a step-wise basis as frames advance, the edges of the foreground and background are eroded, and the impact of older frames on the support vectors for individual pixels attenuated in time.

[015] Choice of grid sizes, and novel approaches to structuring the grid for computational purposes provide optional advantages. In general, each pixel may be trained to the algorithm using its own neighbourhood of sufficient size, which may be augmented by training based on the centre points of one or more neighbourhoods elsewhere within the image. In different approaches shown: (i) a pixel may be trained to the algorithm with reference to all pixels within a shape about the pixel;
(ii) a pixel may be trained with reference to all the pixels within its neighbourhood and then with the middle pixels in adjacent neighbourhoods of similar size; or even (iii) a pixel may be trained with reference to its neighbourhood and the centre points of neighbourhoods of similar size not full adjacent to the neighbourhood of the pixel, but separated by some known distance in a pseudo adjacency. Furthermore, by exploiting the parallel structure of the proposed algorithm, and appropriate grid spacing and sizes, real-time processing speed of 14 frames per second (FPS) is achieved for VGA-sized videos.

[016] The steps of the segmentation method disclosed in this application can be summarized with reference to Figure 11 as follows:

[017] Step 1: the design parameters of the 1-class support vector machines (1SVM) to be used to separately classify foreground and background at each pixel are established. The design parameters include: the choice of kernel function k(=,=);
whether the C-15VM will train based on batch, online or modified online learning, or some combination; the size and shape of neighbourhoods about each pixel upon which each of the C-1SVM; the score function to be used, and the margin y.
Optionally, the initialization step also includes a choice of whether to classify based on the entire neighbourhood, or only subgroups within the neighbourhood in which case maxpooling and special decay (discussed below) would be used to classify each pixel according to a train-relabel loop.

[018] Step 2: Obtaining an image or video stream for processing. The present method operates either on a single digital image or a video stream of digital images. A single image may be referred to as / , while a series of frames t in a video stream may be referred to as J.

[019] Step 3: obtain initial background sample set (BSS) and initial foreground sample set (FSS). The sample sets of known background and foreground may be provided by user instructions (e.g. swiping a cursor over the image, identifying particular colours, etc.) in a supervised method, or through an automated unsupervised learning method. In the video stream, at each a given pixel in time/frame t, the sample sets of BSS and FSS are referred to jointly as the label Lt (p).

[020] Step 4: Training of the C-1SVM occurs as follows. For each pixel, train the background-1SVM (B-1SVM) using the BSS and train the foreground-1SVM

(F-1SVM) using the FSS.

[021] Step 5: Classification of each pixel is performed independently by each 1SVM. The classification routine may be run on the entire neighbourhood.
Or, by max pooling over other specified subgroups within the neighbourhood, as discussed below.

[022] Step 6: Relabeling of the BSS and FSS occurs on a pixel-wise basis if the C-1SVM agrees as to the classification of the pixel as foreground or background.
Otherwise, it is not relabelled. The steps 4 through 6 are repeated in a Train-Relabel loop until no new pixels are labelled. Four categories of pixels result: those labelled foreground by both classifiers, those labelled background by both classifiers, those labelled background by the F-1SVM and foreground by the B-1SVM, and those labelled foreground by the F-1SVM and background by the B-1SVM. This is a sufficient output of the segmentation method, but additional steps are also possible.

[023] Step 7: optionally, a bianarization set further segments the non-binary output of Step 6 by forcing the unlabeled pixels into foreground or background according to some coherence rule. A global optimizing function (discussed below), has been shown to be useful in this regard.

[024] Step 8: optionally, a smooth/erode function may be used to further prepare the output of Step 6 for insertion back into the algorithm at Step 4 as a proxy to additional supervised user labelling. Using the global optimizer to smooth the data and/or eroding the boundary by fixed number of pixels prepares new BSS and FSS

data for the Train-Relabel loop in the following frame.

[025] Step 9: (not shown) Nothing prohibits additional user relabeling of the BSS and FSS either in respect of a given frame, or prior to segmentation of a future frame.
BRIEF DESCRIPTION OF THE DRAWINGS

[026] Figure 1 is a series of images showing the initialization of the segmentation method of the present invention on input frame 0 of the "walk"
sequence in Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, and R.
Szeliski, Video matting of complex scenes, Siggraph, 2002, pages 243-248 [Chuang 111 (images (a) through (f), followed by the segmentation steps at input frame 1 (images (g) through (i)) and input frame 99 (images (j) through (I).

[027] Figure 2 are Venn diagrams representing how prior methods of binary SVM (diagram (a)) differ from the method of the present invention in using dual C-1SVMs (diagram (b)).

[028] Figure 3 is a graphical depiction of the neighbourhood system used for 1SVM training on a 9x9 pixel local neighbourhood (fr) of the pixel shown.

[029] Figure 4 is a graphical depiction of the neighbourhood system used for 1SVM training on a 21x21 local neighbourhood (fir) of the pixel shown, using nine adjacent 7x7 pixel subgroups (fir) in a fully adjacent square formation.

[030] Figure 5 is a graphical depiction of the neighbourhood system used for 1SVM training on a 33x33 local neighbourhood (fir) of the pixel shown, using twenty-five non-adjacent 5x5 pixel subgroups (fir) which are in a regular square formation, but with 2-pixel wide strips separating 5x5 pixel subgroups (fir).

[031] Figures 6(a), (b) and (c) show foreground assignment (in black) by the foreground-1SVM for image frame 0 from the sequence used in Figure 1, after 1 iteration, 2 iterations, and convergence, respectively; while Figures 6(d), (e) and (f) show background assignment (in black) for the same image, and respective number of iterations; in each case using the grid initiation groupings and subgroupings of Figure 5.

[032] Figure 7 shows results on testbed video input sequences referred to as (from top to bottom) "talk", "jug", "hand", and "car".

[033] Figures 8(a) and 8(b) compare the segmentation accuracy of a preferred embodiment of the method of the present invention for two sequences, where a ground truth segmentation is available for every 5 or 10 frames.

[034] Figure 9 is a series of segmentation frames showing that one implementation of the C-1SVM segmentation method auto-corrects for the background bias shown for the final segmentation of the first input frame (images (a), (b) and (c)) over 10 frames following the initial frame (see examples in second Tow of images) without any user intervention.

[035] Figure 10 shows comparisons of preferred embodiments of present invention to existing prior art methods for the segmentation of the "jug" in the "jug"
sequence, where: (a) is the input frame; (b) is the result using the segmentation method in [Zhong 1]; (c) is the result using the segmentation method in L.
Cheng and M. Gong, Realtime background subtraction from dynamic scenes, ICCV, 2009 [Cheng 12]; (d) is the pure background training frame; (e) is the segmentation result of the present invention using (d) but without any user labelling; and (f) is the segmentation result of the present invention using user labelling.

[036] Figure 11 is a flow chart for a generalized video segmentation method according to the present application.
DETAILED DESCRIPTION OF THE INVENTION

[037] Certain implementations of the present invention will now be described in greater detail with reference to the accompanying drawings.

[038] Figure 1 shows how a preferred embodiment of the present invention implements foreground segmentation of the "walk" sequence from [Chuang 11], which is challenging due to fuzzy object boundary and camera motion. The user is only required to label the first frame (a) using strokes (b). Local classifiers are trained at each pixel location and then used to relabel the center pixel (c).
Repeating training and relabeling leads to convergence (d-e), even though ambiguous (grey) areas still exist. Final segmentation is obtained using graph cuts (f). When the new frames (g &
j) arrive, they are first labeled (h & k) using the classifiers trained by previous frames, before the same train-relabel procedure is used to produce the segmentation results (i) & (I). Note that the proposed algorithm is able to extract the details of the hair without resorting to matting techniques

[039] Figure 2 is a graphical representation of how prior methods of binary SVM differ from the method of the present invention in using dual C-1SVMs. The boundaries 3 and 5 represent the results of the foreground C-1SVMs; the boundaries 4 and 6 represent the results of the background C-1SVMs; and the lines 7 and 8 represent binary SVM with above the line foreground and below the line background.
White circles and black dots represent the foreground and background training instances, respectively, while dots 1 and 2 each denote an example unlabelled pixel being labelled using the method. In scenario (a), binary SVM classifies the test example 1 as foreground, whereas the C-1SVMs labels it as unknown, since neither of the 1SVMs accepts it as "inlier". In the second case (b), binary SVM cannot confidently classify the test example since the margin is too small, whereas C-1SVMs is able to correctly label it as background.

[040] The method of foreground segmentation proposed foregoing the classification of the problem as a binary matter; and instead creates two competing classifiers which operate independently. Where the competing classifiers disagree, the pixels are labelled unknown, and ultimately resolved through a final step globalized costing function. Improved performance is predicted for two reasons:

[041] First, foreground and background may not be well separable in the color feature space. For example, the black sweater and the dark background shown in Figure 1(a) share a similar appearance. As a result, it is not proper to deal with this scenario by means of training a global binary SVM and use it to classify the entire image. Furthermore, trying to train local binary SVMs at each pixel location is problematic as well since in most cases merely one of the two (fore/background) types of observations is locally available.

[042] Second, even in areas that both fore/background examples are available, modeling the two sets separately using the C-1SVM5 gives us two hyperplanes that enclose the training examples more tightly. As illustrated in Figure 2, this helps toward better detecting and handling of ambiguous cases.

[043] In the proposed method, the training can be based either on batch learning or online learning. Training a SVM using a large set of examples is a classical batch learning problem, the solution of which can be found through minimizing a quadratic objective function. Those of skill in the art will appreciate that similar or even better generalization performance can be achieved using online learning with a much less computational cost, by showing all examples repetitively to an online learner, when comparing to that of batch learning. A less noticed but distinct advantage of online learning is that it produces a partially trained model immediately, which is then gradually refined toward the final solution.
However, either option may be practised within the scope of the method disclosed in this application.

[044] In one example, the online learner of a preferred embodiment of the foreground segmentation method proceeds as follows: Let ft() be a score function of examples at time t, and let k(=,.) be a kernel function. Now, denote at a non-negative weight of example of time t, and clamp(=,A,B), an identical function of the first argument bounded from both sides by A and B. When a new example, xt arrives, the score function becomes:
t--1 f(x) = ,xt) (1) In this example, the update rule for the weights is:
at= clamp (1/ ¨ (1 ¨ 'Oft (xt) k(x ,xt) , 0, (1 - T)C), t (2) at <- (1- t)ai, Vi =1,...,t -1 Where y := 1 is the margin, r E (0,1) the decay parameter, and C > 0 the cut-off value.

[045] Directly applying Eq.(2) adds multiple support vectors to the model;
all would come from the same sample and would have different weights. Also, as shown in Eq.(2), once a support vector (x, a) is added to the applicable 1SVM, over time its weight at is only affected by the decay rate (1 ¨ T). Hence, to ensure the support vectors converge to their proper weights, the decay parameter should be careful adjusted using e.g. cross validation results.

[046] In a modified online learning example, and to avoid the complexity of monitoring/performing cross validation of results, the C-1SVM segmentation method may not rely on the decay at all, instead it may execute an explicitly reweighting scheme: If a training example xt arrives and it turns out identical to an existing support vector (x, ,at) inside the model, this support vector is first taken out when computing the score function, it is then included with its newly obtained weight, 4 to substitute for the original weight at. To summarize t-i ft(xt) =Iaix(xi xt)k(xi,xt), (3) i=1 at 4_ = clamp (y ft(xt) at,T)C), (4) k(xt , xt) , 0, (1 ¨
where x(.) is an indicator function: x(true) = 1 and x(false) = 0.

[047] Intuitively, this modified online learning method resets the weight component of a particular support vector (xt ,at), based on how well the separating hyperplane defined by the remaining support vectors is able to classify example xt.
This reweighting process can either increase or decrease at and hence an implementation of the C-1SVM using modified online learning does not rely on decay as do some prior art methods. With fewer operations, this leads to a method with shorter training time (i.e. fewer computations).

Maxpooling of subgroups

[048] Training 1SVMs with large scale examples is known to be computationally expensive, which becomes a serious issue in a real-time processing scenario. In addition to online learning, in one example, the present segmentation method proposes "max-pooling" of subgroups, as follows: the whole example set is divided into N non-intersecting groups ipi (0 < i < N) and a 1SVM is trained on each group. Then the original 1SVM score function is approximated by the maximum operation of these 1SVM score functions from subgroups. That is:
f (x) = maxo<i<N (x), (5) where is the score function trained using examples in subgroup

[049] Different options are proposed for dividing examples into subgroups, and thereby exploit the spatial coherence of images so that the 1SVM trained on each subgroup models local appearance density.

[050] In addition to the idea of using competing, separately initialized and trained 1SVM classifiers for the foreground segmentation, another improvement exists in the train-relabel procedure between video frames (i.e. in time). Two competing 1SVMs, Fp for foreground and Bp for background, are trained locally for each pixel p using pixels with known labels within the local window/neighbourhood Sly. Once trained, Fp and Bp are used to jointly label p as either foreground, background, or unknown. Since the knowledge learned from neighbouring pixels in the neighbourhood group flp is used for labelling pixel p, the above procedure effectively propagates known foreground and background information to its neighbourhood. As a result, and as shown in Figures 1(a), (b), (c), (d), (e) and (f), the algorithm can segment the whole image based on only a few initial strokes.

[051] For inter frame training, a similar train-relabel procedure for iteratively expanding propagating foreground and background information in a single frame is

Claims

1. A computer implemented method for segmenting a digital image into foreground and background comprising the steps of:
(a) Initializing design parameters for a background 1-class support vector machine (B-1SVM) and for a foreground 1-class support vector machine (F-1SVM) as computer implemented functions within a computer system;
(b) Inputting the digital image to the computer system;
(c) Inputting a background sample set of known background pixels in the image and a foreground sample set of known foreground pixels in the image, to the computer system to define an current label of the image;
(d) Until no further changes occur in the current label of the image, perform the following computer implemented steps of:
(i) Training a B-1SVM based on the design parameters at each pixel using pixels labelled as background within the current label of the image, and training a F-1SVM based on the design parameters at each pixel using pixels labelled as foreground within the current label of the image;
(ii) Classifying each pixel using the B-1SVM and the F-1SVM to obtain a competing classification for each pixel; and (iii) Relabeling the current label of the image to identify the pixels which the competing classification agrees to be background and to identify the pixels which the competing classification agrees to be foreground.

2. The computer implemented method of claim 1 further comprising the step of:
(e) Applying a global optimizing function to relabel as either foreground or background within the current label of the image, pixels in the image which have not yet been labelled as foreground or background by the B-1SVM and the F-1SVM.

3. The computer implemented method of claim 1 wherein the background sample set is obtained through unsupervised means.

4. The computer implemented method of claim 1 wherein the background sample set is obtained through supervised means.

5. The computer implemented method of claim 1 wherein the foreground sample set is obtained through unsupervised means.

6. The computer implemented method of claim 1 wherein the foreground sample set is obtained through supervised means.

7. The computer implemented method of claim 1 wherein the design parameters include a kernel function k(~) from the group of kernel functions consisting of:
homogeneous polynomial basis function, unhomogeneous polynomial basis function, Guassian radial basis function, and hyperbolic tangent basis function.

8. The computer implemented method of claim 7 wherein the kernel function k(~) is a Guassian radial basis function.

9. The computer implemented method of claim 1 wherein the design parameters include a neighbourhood, and the training step and classifying step at each pixel occur over the entire neighbourhood about such pixel.

10. The computer implemented method of claim 1 wherein the design parameters include a neighbourhood further divided into subgroups, the training step at each pixel uses only pixels in the subgroup centred on such pixel and the classifying step uses only centre pixels at the subgroups in the neighbourhood.

11. The computer implemented method of claim 10 wherein the neighbourhood about each pixel is 3n by 3n pixels centred on such pixel (n an odd integer greater than 1), and the subgroups are 9 non-intersecting n by n squares within the neighbourhood.

12. The computer implemented method of claim 10 wherein the neighbourhood about each pixel is a square of k times n plus (k-1) times g pixels on each side, where n is an odd integer greater than 1 being the width of each subgroup, k is an odd integer greater than 1 with k2 being the number of subgroups, and g is the width in pixels of a gap between subgroups such that g times 2 plus 1 is not greater than n.

13. The computer implemented method of claim 10 wherein the neighbourhood about each pixel is 33 by 33 pixels centred on such pixel, the neighbourhood is further divided into twenty-five non-intersecting subgroups of 5 by 5 pixel squares with adjacent subgroups all separated by a 2 pixel wide gap.

14. The computer implemented method of claim 2 wherein the global optimizing function solves, for each pixel, a Markov random field based energy function having a data term determined by costs of assigning such pixel to foreground and background and a contrast term which adaptively penalizes segmentation boundaries based on local color differences.

15. The computer implemented method of claim 1 wherein the training step is performed by a learning method from the group of learning methods consisting of batch learning methods, online learning methods of modified online learning methods.

16. The computer implemented method of claim 15 wherein the modified online learning method is performed according to Equation (3) and Equation (4).

17. The computer implemented method of claim 10 wherein the training step is performed according to the following algorithm, for a centre subgroup ~p within the neighbourhood of each pixel p, for the current label of the image L
t if L t (q) equals foreground then Use L t (q) to train ~p based on Equations (3) & (4);
else if L t (q) equals background then Use L t (q) to train ~p based on Equation (3) & (4);
end if end for end for

18. The computer implemented method of claim 17 wherein the classifying step is performed according to the following algorithm, with design parameters set, foreground score function at pixel p being .function.F(p), background score function at pixel p being .function.B (p), and subgroups ~q being the subgroups in neighbourhood .OMEGA.p of pixel p which do not intersect pixel p:
for each pixel p do Initialize approximate scores .function.F(p) and .function.B(p) to 0;
for each subgroup ~q in .OMEGA.p do end for Set foreground loss l F(p) = max (0,.gamma. - .function.F(p));
Set background loss l B(p) = max (0, .gamma. - .function.B(p));
if then Set L t (p) to foreground;
else if then Set L t (p) to unknown;
else Set L t(p) to unknown;
end if end for.

19. The computer implemented method of claim 18 wherein the following design parameters are set as margin .gamma. = 0.4, cut off value C = 0.5, , and spacial decay .tau spacial = 0.05.

20. A computer implemented method for segmenting a video stream of digital images into foreground and background comprising the steps of:
(a) Initializing design parameters for a background 1-class support vector machine (B-1SVM) and for a foreground 1-class support vector machine (F-1SVM) as computer implemented functions within a computer system;
(b) Inputting the digital images of the video stream to the computer system;
(c) Inputting to the computer system a background sample set of known background pixels in a current image and a foreground sample set of known foreground pixels in such current image, to define a current label of the current image;
(d) Until no further changes occur in the current label of the current image, perform on pixels of the current image the train-relabel steps of:
(i) Training a B-1SVM based on the design parameters at each pixel within the current image using pixels labelled as background within the current label of the current image, and training a F-1SVM based on the design parameters at each pixel using pixels labelled as foreground within the current label of the current image;
(ii) Classifying each pixel using the B-1SVM and the F-1SVM to obtain a competing classification for each pixel; and (iii) Relabeling the current label of the current image to identify the pixels which the competing classification agrees to be background and to identify the pixels which the competing classification agrees to be foreground;
(e) While images remain to be processed in the video stream, set the next image in the video stream as the current image and return to step (d).

21. The computer implemented method of claim 20 further comprising the step after step (d) and before step (e) of:

(d.1) Applying a global optimizing function to relabel as either foreground or background within the current label of the current image, pixels in the image which have not yet been labelled as foreground or background by the B-1SVM and the F-1SVM.

22. The computer implemented method of claim 21 further comprising the step after step (d.1):
(d.2) relabeling the current label for the current image to the output of the global optimizing function with morphological erosion on a boundary where pixels identified as foreground are otherwise adjacent to pixels identified as background.

23. The computer implemented method of claim 20 wherein the background sample set is obtained through unsupervised means.

24. The computer implemented method of claim 20 wherein the background sample set is obtained through supervised means.

25. The computer implemented method of claim 20 wherein the foreground sample set is obtained through unsupervised means.

26. The computer implemented method of claim 20 wherein the foreground sample set is obtained through supervised means.

27. The computer implemented method of claim 20 wherein the design parameters include a kernel function k(.cndot.,.cndot.) from the group of kernel functions consisting of: homogeneous polynomial basis function, unhomogeneous polynomial basis function, Guassian radial basis function, and hyperbolic tangent basis function.

28. The computer implemented method of claim 27 wherein the kernel function k(.cndot.,.cndot.) is a Guassian radial basis function.

29. The computer implemented method of claim 20 wherein the design parameters include a neighbourhood, and the training step and classifying step at each pixel occur over the entire neighbourhood about such pixel.

30. The computer implemented method of claim 20 wherein the design parameters include a neighbourhood further divided into subgroups, the training step at each pixel uses only pixels in the subgroup centred on such pixel and the classifying step uses only centre pixels at the subgroups in the neighbourhood.

31. The computer implemented method of claim 30 wherein the neighbourhood about each pixel is 33 by 33 pixels centred on such pixel, the neighbourhood is further divided into twenty-five non-intersecting subgroups of 5 by 5 pixel squares with adjacent subgroups all separated by a 2 pixel wide gap.

32. The computer implemented method of claim 21 wherein the global optimizing function solves, for each pixel, a Markov random field based energy function having a data term determined by costs of assigning such pixel to foreground and background and a contrast term which adaptively penalizes segmentation boundaries based on local color differences.

33. The computer implemented method of claim 20 wherein the training step is performed by a learning method from the group of learning methods consisting of batch learning methods, online learning methods of modified online learning methods.

34. The computer implemented method of claim 33 wherein the modified online learning method is performed according to Equation (3) and Equation (4).

35. The computer implemented method of claim 30 wherein the training step is performed according to the following algorithm, for a centre subgroup ~p, within the neighbourhood of each pixel p, for the current label of the image L
t for each pixel p do for each pixel q in ~p do if L t (q) equals foreground then Use L t (q) to train p based on Equations (3) & (4);
else if L t (q) equals background then Use L t (q) to train p based on Equation (3) & (4);
end if end for end for

36. The computer implemented method of claim 35 wherein the classifying step is performed according to the following algorithm, with design parameters T F low,T F high,T B low, T B high set, foreground score function at pixel p being f F(p), background score function at pixel p being f B(p), and subgroups .OMEGA. q being the subgroups in neighbourhood .OMEGA. p of pixel p which do not intersect pixel p:
for each pixel p do Initialize approximate scores f F (p) and f B(p) to 0;
for each subgroup .OMEGA. q in .OMEGA. p do end for Set foreground loss I F (p) = max (0..gamma.-f F (p));
Set background loss l B(p) = max (0, .gamma. ¨ f B(p));
if Set L t (p) to foreground;
else if (I F (p) > T F high) && (l B(p) < T B low) then Set L t (p) to background;
else Set L t (p) to unknown;
end if end for.

37. The computer implemented method of claim 36 wherein the following design parameters are set as margin .gamma. = 1, T F low = 0.1 , T F high = 0.3, T B
low = T B high =
0.4, cut off value C = 0.5, and spacial decay .pi. spacial=0.05

38. A method for real-time segmentation of a foreground object from a video stream comprising the steps of:
(a) Inputting the video stream to a computer system;

(b) Applying computer implemented instructions on the computer system to establishing a background 1-class support vector machine (B-1SVM) and a foreground 1-class support vector machine (F-1SVM) to analyse pixels in frames of the video stream;
(c) Obtaining user selected criteria on a location of the foreground object within one or more of the frames;
(d) Applying the background C-1SVM and the foreground C-1SVM to the video image initialized by the user selected criteria on the location of the foreground object;
(e) Applying computer implemented instructions to implement the following initialization algorithm on desired subgroups of pixels:
for each pixel p do for each pixel q in ~p do if L t(q) equals foreground then Use L t (q) to train ~p ) based on Equations (3) & (4);
else if L t(q) equals background then Use L t (q) to train ~p based on Equation (3) & (4);
end if end for end for (f) applying computer implemented instructions for foreground, background and boundary segmentation using the following algorithm:
Require: Threshold parameters: for each pixel p do Initialize approximate scores .function.F(p) and .function.B(p) to 0;
end for Set foreground loss l F(p) = max (0,.gamma.- .function.F(p));
Set background loss I B(p) = max (0, .gamma. - .function.B(p));
if then Set L t (p) to foreground;

else if && then Set L t (p) to background;
else Set L t (p) to unknown;
end if end for (g) wherein the thresholds and equations are more particularly set out in the specification hereto.