WO2008008045A1

WO2008008045A1 - Method and system for context-controlled background updating

Info

Publication number: WO2008008045A1
Application number: PCT/SG2007/000205
Authority: WO
Inventors: Liyuan Li; Ruijiang Luo; Ruihua Ma; Karianto Leman; Pankaj Kumar; Beng Hai Lee; Welmin Huang
Original assignee: Agency For Science, Technology And Research
Priority date: 2006-07-11
Filing date: 2007-07-11
Publication date: 2008-01-17
Also published as: WO2008008046A1; SG150527A1

Abstract

A method and system for background updating for adaptive background subtraction in a video signal. The method comprises the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

Description

Method And System For Context-Controlled Background Updating

FIELD OF INVENTION

The present invention relates broadly to a method and system for background updating, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating.

BACKGROUND

Adaptive background subtraction is typically the first fundamental step for video surveillance. A typical surveillance system consists of a stationary camera directed at the scene of interest. A pixel-level background model is then generated and maintained to keep track of time evolving background. Background maintenance is the crucial part that may affect the performance of background subtraction in time-varying situations. The methods of basic background subtraction employ a single reference image corresponding to the empty scene as the background model. A Kalman filter is usually used to follow the slow illumination changes. However, it has been realized that such simple model was not suitable for surveillance in real- world situations.

Adaptive background subtraction (ABS) techniques based on statistical models to characterize the background appearances at each pixel were developed for various complex backgrounds. Wren [9] employed a single Gaussian to model the color distribution for each pixel. In [4], a mixture of Gaussians (MOG) is proposed to model the background of multiple states, e.g., normal and shadow appearance, and complex variations, e.g., the bush under winds. Many enhanced variants of MoG have been proposed in recent years. Some of the enhancements integrated the gradients, depthes, or local features into the Gaussians, and others employed the non-parametric models, e.g. kernels, to replace the Gaussians.

In [2], a model of principal feature representation (PFR) to characterize each background pixel was proposed. Using PFR, multiple features of the background, such as color, gradient, and color co-occurrence, can be learned automatically and integrated in the classification of background and foreground. Employing various statistical models and multiple features for background modelling, the ABS methods become more and more robust with respect to a variety of complex backgrounds. Most of the existing methods of adaptive background subtraction employ a constant learning rate for background updating. Some existing methods update the background model in a constant period of time.

With a constant learning rate or a constant periodic update, existing methods gradually forget the old background and absorb the new background appearance into the background model. The foremost assumption behind it is that the most frequently observed features at a pixel should come from the background. This assumption is valid for situations of simple foreground activities even through the background is highly complex, e.g., a scene of various dynamic properties. However, when some background pixels are frequently occluded by foreground objects, e.g., by a person staying motionless or by frequent heavy crowds, this assumption becomes violated.

Some proposed approaches tried to control the learning rate according to the results of segmentation or tracking. However, this control is based on positive feedback since it depends on the results of background subtraction. It may not be able to correct the errors caused by background subtraction itself.

A need therefore exists to provide a method and system for background updating that seek to address at least one of the above disadvantages.

SUMMARY

In accordance with a first aspect of the present invention there is provided a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

A first learning rate for the pixels that are occluded may be lower than a second learning rate for the pixels that are exposed. The method may further comprise the steps of determining whether said respective pixels that are exposed are detected as a background point or as a foreground point in a current background subtraction for the current image; and setting different learning rates for the adaptive background subtraction for exposed pixels that are detected as foreground points and for exposed pixels that are detected as background points respectively.

A third learning rate. for the exposed pixels that are detected as foreground points may be higher than the second learning rate for the exposed pixels that are detected as background points.

One contextual background representation type A may comprise a facility for the public such as a counter or a bench, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type A background region, whether said each image region is occluded based on matching OHRs of the type A background region and of said each image region respectively and based on matching PCRs of the type A background region and of said each image region respectively; and determining all pixels of said each image region as either occluded or exposed depending on said evaluation.

All pixels may be determined as exposed if a match likelihood in said evaluation is above a threshold value, and are determined as occluded otherwise.

One contextual background representation type B may comprise a large homogeneous region such as a ground plane or a wall surface, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type B background region, whether neighborhood regions around respective pixels in said each image region are occluded based on matching PCRs of the type B background region and of the respective neighborhood regions; and determining pixels of said each image region as either occluded or exposed depending on the respective evaluations. Each pixel may be determined as occluded if a majority of neighborhood pixels in the neighborhood region of said each pixel are within said type B background region and less of the neighborhood pixels themselves are evaluated as exposed based on a match likelihood being above a threshold value, and is determined as exposed otherwise.

The method may further comprise setting a zero learning rate for pixels belonging to foreground regions.

The method may further comprise the step of performing adaptive background subtraction using said set rates for the respective pixels.

The adaptive background subtraction may be based on a Mixture of Gaussian or Principle Feature Representation.

The method may further comprise maintaining a model base for the contextual background representation types, the model base including models for different illumination conditions.

The method may further comprise adjusting an appearance, a spatial characteristic, or both, of the models in the model base over a long duration compared with a frame duration in the video signal.

In accordance with a second aspect of the present invention there is provided a system for background updating for adaptive background subtraction in a video signal, the system comprising means for defining one or more contextual background representation types; means for segmenting an image of a scene in the video signal into contextual background regions; means for classifying each contextual background region as belonging to one of the contextual background representation types; means for determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; means for receiving a current image of the scene in the video signal; means for determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and means for setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively. In accordance with a third aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.

Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.

Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.

Figure 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.

Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.

Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.

Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment. Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.

Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.

DETAILED DESCRIPTION

The described embodiment provides a novel 2/4D method of multi-object tracking for real-time video surveillance. An appearance model, principal color representation (PCR), is applied to multi-object tracking. The PCR model characterizes the appearance of an object or a region with a few most significant colors. The likelihood of observing a tracked object in a foreground region is derived according to their PCRs. Based on the Bayesian estimation theory, multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location.

By exploiting that the close and unoccluded objects have richer visual information than the distant or occluded ones, sequential solutions to the subproblems which process the objects in a group from the most visible to the least visible ones are derived according to the likelihoods estimated based on PCR. In the assignment step, each tracked object is assigned to a foreground region in the coming frame. When an object is assigned, its visual information will be excluded from the PCR of the region.

In the location step, multiple objects assigned to one region are located one-by-one according to their depth order. A two-phase mean-shift algorithm based on PCR is derived for locating objects. When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.

Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "calculating", "determining", "excluding", "generating", "assigning", "locating", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.

In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.

Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.

The invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.

The distinctive background objects (regions) in the example embodiment are classified into two categories:

Type-1 CBR: a facility for the public in the scene;

Type-2 CBR: a large homogenous region.

Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions. The example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors. Let R'_b be the f-th. CBR in the empty scene /(x), and G(x) and O(x) be the gradient and orientation images of /(x), respectively. If the orientation values are quantized into 12 bins each covering 30°, the orientation histogram forR'_b is defined as

where μ_τ () is a binary function on the threshold T and δ_k() is a delta function defined as

*- ^- " \ O, otherwise .„ .

^ (2a)

The OHR H_b is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.

By scanning the region R'_b , a table of the PCR for the region can be obtained. The PCR for R'_b is defined as

^ = {pι, {^ = (c?,l*)}{£α} _(3a) where p_t is the size of R!_b, c^A,- is the Ar-th most significant color of R'_b and p^k _t is its significance value. The significance value is computed by

<5(cl, c2) is a delta function. It equals to 1 when the color distance d(C[, C₂) is smaller than a small threshold ε, otherwise, it is 0. The color distance used here is

,, , , 2 < C_{1 (} C₂ >

0( CI _v C₂) = 1 — γ jpj

IMI² + IMh _(5a^ where <•,•> denotes the dot product [2, 3]. The principal color components E^k ,• are sorted in descendent order according to their significance values p^k _t . The first N,- components which satisfy ' ∑_k=oP^" i — 0.9op_{t are use(}j _{as t}jj_e p^ _of ^_{6 re}gi_on ^ _^ which means the principal colors in PCR cover more than 95% colors from R'_b . PCR is thus efficient to describe large regions of distinctive colors.

A type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both OHR and PCR are used to characterize the type-1 CBR. Let R' _bi be the i-th type-1 CBR in the scene. Its contextual descriptors are Et _b\ and T _b\- A type-1 CBR has just two states: occluded (occupied) or not. The likelihood of observing a type-1 CBR is evaluated on the whole region. Suppose the contextual descriptors of the region R,(x) from the corresponding position of R! ^ in the current frame /,(x) are H, and T₁. The likelihood of R' _M being exposed can be evaluated by matching R_t(x) to R' _M. Based on OHR, the matching of R₍(x) and R' _bι is defined as

IfR' ₆i and R,(x) are similar, P_L{H_t\R!_b i) is close to 1, otherwise, it is close to 0.

IifceSihood of JK₄ (x) bekssgtag to U^₁ is

The second term in the sum is the weight of the principal color c* but in the PCR of R' bi, i.e., P(E^h _bλ,i\f bi) =p^k _b i/pbu- The f^{*lrst term} is the likelihood based on the partition evidence of principal color c^A' _bι,i- It is evaluated from the PCRs of R' _b\ and R,(x) as _.

10

P(TtIEt₁J = 4- mm U₁,, ∑ .*(<&,, c?)tf 1

(8a) Then there is

m'^! i=i I ».=ι J (9a)

P(T _tι\T_c) can be obtained in a similar way. Now the matching of R' _bx and R₁(X) based on PCR is defined as

Pt (Tt ^₁ ) = minf P(T₄ |7& ) . F(Hf₁ |T_έ)|

(10a)

Assuming that colors and the gradients are independent and different weights are used, the log likelihood of observing R'_bi at time t is

I%[ = Uf₁ logϋ (Et]Hi₁) + ( 1 -u>, J log P_t (T, |2& :

(Ha) where co₅ = 0.6 is chosen empirically.

The type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it. The likelihood of observing a type-2 CBR is evaluated loc-

l^{M )l} -- 'M (12a) where |/?,(x)| is the size of the window and

is a Boolean function defined as

B(i_t ^r(^\ _s) ^\^IH -*o.)^ = { i i Q'₍ ^ o_therw'is We '⁾'^ = ¹ _/H I

The log likelihood that the pixel x in the current frame belongs to R' _b2 is

£&(x) = log F(JJf(X)IZZk)

(Ha)

The appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR. For the *-th type-1 CBR R' _bU the appearance model is defined as M₀(R! _b 0 = (JHt _M, The spatial model of R' _bi is defined as its bounding box and the center point, i.e., M_s(R'_b 0 = (5'_{i l}, x' ₆₁). To adapt to lighting changes from day to night, besides the active appearance model Ma(R' b _I)_J a model base which contains up to K_b appearance models of R b\ is used. The models in the base are learned incrementally. The active appearance model is the one from the model base which best fits the current appearance of the CBR. The model base of

Natural lighting changes slowly and smoothly. Let D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal). The times of observing the Mh type-1 CBR during the period are accumulated as

t^eD (15a) and the average of the likelihood values is

*ή ι*D _(16a) where L''' _b\ > T_L\ means R' _b\ is visible at time t. If sufficient samples of R' _b\ have been observed during the previous (last) duration (e.g., z!'^p _b\ /D > 25%) and the average likelihood value is approaching the threshold T_LX (e.g., L_b''^p < 0.87i,i), a new appearance of R' _b\ may be observed. In the coming duration, a new appearance model M^tc _a (Rb i) = (H¹''' b\ , T¹''⁰ AI } is obtained from a frame in which R,(x) looks mostly like R' _b\, i.e., t_c = ^S ⁱnm^o j \L_hl ^~ L_hl | J- If the average likelihood values are low in the two consecutive durations, the active appearance model is replaced. First, the new appearance model M^c _a (R'_b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e. the similarity is larger than T_u+έ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.

Let 7* _bi be the PCR descriptor of the Mh type-2 CBR R'_b2, the appearance model of R_b'₂ is then defined as M_a(R'_{b 2}) ⁼ (X a)- The spatial model of R' _bι describes the range of the homogeneous region in the image. A binary mask If _b2(x) is used for it, i.e., M_s(R'b2) - (Jf A2(X))- The spatial model may have to be adjusted in initialization duration when sufficient samples have been observed according to the likelihood values.

Again, a model base is employed to deal with the appearance variations of the type-2 CBRs from day to night. The model base of the Mh type-2 CBR Rib 2 is MB(R _{b 2}) =

≤ Kh- \ . The models in the model base are learned incrementally through the time durations. First, at each time step /, the binary image of observed parts for R' _b2 is generated as V '^•' ₄₂ (x) = μτa. (L''' _hi (x))- The overlapping ratio between the exposed parts and the spatial model for R' _b2 at time t is i.t _ ^vto ^{n (}-to

^ ^{U Li2} (17a) where ' /9' means intersection and 'u' means union. The larger the ratio is, the more parts of R' _b2 axe exposed and less pixels of other objects would be involved. At the end of each duration, the times of observing the large part of R! _b2 during the period is

t^€β (18a) and the average similarity value between the observed parts and its active model can be computed as

where P_L(T_b ^l ₂ \ T_b'₂) is calculated according to (10a) with normalized PCRs and T_H =

75% is used. Like the operation for type-1 CBRs, if sufficient samples have been observed during the last duration (i.e., 2"^p _b2 /D > 25%) and the average similarity value is approaching the threshold T_L2 (e.g., S_b'₂ ^p < 0.8T_L2), a new appearance model M^c _α (R'_{b 2}) is generated from the current duration. If the average similarity values are low in the two consecutive durations, the active appearance model will be replaced. If there is a model in the base which is close enough to the new appearance model M° _α (R'_{b 2}) (i-β- the similarity is larger than T_u+ε), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.

Let (R' _b }^Nb ,=i be the CBRs of a scene. Given a coming frame /,(x) and a local region R,(x) centered at x in/_t(x), the posterior probability of R₁(X) belonging to a CBR R'_b is

wiftW) - — _P(S^ — _(20a)

The prior probability P(R,(x)) is the same for every pixel in an image. Then the log posterior probability of R,(x) belonging to R'_b in the current frame /,(x) is defined as G2007/000205

13

(21a)

The position of a type-1 CBR is already determined by its spatial model. The prior probability P(R' _b |x) is 1 for the position. and 0 otherwise. Then, the log posterior probability is equivalent to the log likelihood at the position, i.e., Q'' _bi = V

)) = L'^Λ _b\ for K _bi. A rate of occluded times over recent frames for each type-1 CBR is used. For R'_{b \}, the rate is computed as

>ti = ^4rjr¹ +^(l - ^βK^l -Wfcι«&⁾) _(22a) where β is a smooth factor and β = 0.5 is chosen. A high rate value (close to 1) indicates that R'b_\ has been occluded in recent frames.

From the spatial model If _b2(x) of the i-th type-2 CBR R'_b2, the prior probability of a pixel x belonging to the region R'_b2 can be defined as

Combining (21a), (14a), and (23a), the log posterior probability of that x is an exposed point of R'_b2is

A rate of occluded times over recent frames at each pixel for each type-2 CBR is used. First, to be robust to noise and effect of boundaries, an occluded pixel of a type-2 CBR is confirmed on the local neighborhood R,(x). Let T₁ be the proportion of pixels belonging to R'_b2 in the neighborhood region, i.e.,

and r-i be the proportion of exposed pixels of R'_b2 in R_t(x) according to the posteri estimates, i.e.,

where T_Q is chosen as slightly lower than 2Ta. Then, an occluded pixel of R'_b2 is confirmed if the majority of the pixels within its neighborhood are of R'_b2 and less of them are observed in the current frame. Now the rate is computed as rg(x) = &&-¹(x)+(l-#|μr,r(ri)-μr*(l->l_)]

(27a) where r_# = 75% is chosen in the example embodiment.

According to the result a contextual interpretation, three learning rates can be applied at each pixel for different situations in the example embodiment;

Normal learning rate to exposed background pixels with small variations;

Low learning rate to occluded background pixels;

High learning rate to exposed background pixels with significant changes.

An image of control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for noπnal learning rate for non-context pixels used for display). First, for the pixels not associated with any contextual background region, C,(x) = 0 is set. The rest of C_t(x) is deteπnined according to the results of contextual interpretation. For a pixel x within the z-th type-1 CBR RΗ, if r '^■' _b\ ≥ 0.7 that means the CBR is being blocked by a foreground object, C_t(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C₁(X) = 2 is set since the CBR is exposed and no significant appearance change is found, but if /,(x) is detected as a foreground point by background subtraction, a high rate should be applied since it is detected as an exposed CBR point with significant appearance change, i.e., C₁(X) = 2. For a pixel of the Mh type-2 CBR Kyi, if r'¹' a (x) ≥ 0.7 that means the patch of the CBR is being occluded by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C_t(x) = 2 is set for exposed part of the type-2 CBR with no significant appearance change. But if /,(x) is detected as a foreground point by background subtraction, C,(x) = 3 is set for the an exposed neighborhood of the type-2 CBR with significant appearance change.

To smoothen the control code temporally at each pixel, four control code images are used. The first two are the previous and current control code images described above, i.e., C_t~ι(x) and C[(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C*_r_i(x) and C*_t (x). The applied control code to the current frame at pixel x is determined by the votes from three other control codes C,-ι(x), C,(x), and C*,- i(x). If at least two of the three codes are the same, the control code of high votes is selected. If the three codes are different from each other, the normal learning rate is used, i.e., C*, (x) = 2.

To evaluate the effect of context-controlled background maintenance on adaptive background subtraction, the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC MoG), PFR, and Context-Controlled PFR (CC PFR) were compared. In the test, the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero. In Figure 1, the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g. 104, which are type-2 CBRs in this example. In the snapshot image 102, the type-2 CBRs are surrounded by polygon boundaries e.g. 106 of different colors. The second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground. The rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120. The three images in the lower row 122 are the segmented, results . of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128. In the control images 120, 128, the black regions e.g 130 do not belong to any CBR, the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes, and the white regions e.g. 134 are occluded parts of the CBRs.

According to the example embodiment, for pixels in the regions of exposed parts of the CBRs with no significant appearance changes, the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used. For pixels in regions of exposed parts of CBRs with significant changes (not applicable in the scene shown in Figure 1), the high learning rate would be used as described above. The scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities. Therefore, the center parts of the scene were frequently occluded by persons. Using a constant learning rate in the unmodified ABS methods, some appearance features of the persons were learned into the background models, and then the background subtraction failed to extract the complete figures of the persons in the incoming frames (see images 116, 124). One example frame, Frame #102810, is displayed in Fig. 1.

By using context-controlled background maintenance of the example embodiment applied to the ABS methods, the persons were segmented satisfactorily (see images 118, 126). A quantitative evaluation on 12 frames sampled from the sequence every 200 frames started from Frame #101410 (empty frames were skipped) is listed in Table 1, where the metric value is defined as the ratio between the intersection and union of the ground truth and the segmented regions. According to [2], the performance is good if the metric value is larger than 0.5 and nearly perfect if the metric value is larger than 0.8. From Table 1, it can be seen that, by using the context-controlled background maintenance of the example embodiment applied to the existing ABS methods, the performance of adaptive background subtraction on situations of complex foreground activities can be improved significantly.

Table 1

The contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to' characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.

The example embodiment exploits contextual interpretation to control the pixel-level background maintenance for adaptive background subtraction. Experimental results show that the example embodiment can improve the performance of adaptive background subtraction for at least situations of high foreground complexities.

Figure 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment. At step 202, one or more contextual background representation types are defined. At step 204, an image of a scene in the video signal is segmented into foreground and background regions. At step 206, each background region is classified as belonging to one of the contextual background representation types. At step 208, an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region. At step 210, a current image of the scene in the video signal is received. At step 212, it is determined whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed. At step 214, different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

While the described example embodiment started from manually cropped out contextual background regions in a snapshot, image segmentation and background object recognition for automatic initialization of contextual models may be performed in different embodiments. In the example embodiment, principal color representation (PCR) is applied for efficient appearance-based multi-object tracking. In a video surveillance system, object tracking may be applied to a sequence of segmented images generated by background subtraction. In such a case, each segmented image may contain one or several isolated foreground regions. Further, each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point). The example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions. For an image sequence captured from a natural public site, each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap. It has been recognized by the inventors that these issues make the shape-based object tracking a rather challenging task. However, the inventors have recognized that it is much less likely that a a target object change colors in a sequence from a surveillance camera. Hence, using global color features of an individual object can provide a relatively stable and constant way for object appearance description. This can also lead to a better discrimination of multiple target objects in the scene.

In video surveillance, an object of interest (e.g., a person, vehicle, luggage, etc.) may render a few dominant colors which only span a small portion of the entire color space. Let the nth foreground region detected from the frame at time t be R" , (x), where x = (x, y)^τ denotes the position of a pixel in the region. Then the corresponding principal color representation (PCR) can be defined as

where s_n is the size of the region (or the total number of the pixels within the region), c' „ ⁼ (^r'_« s'_n b' _n)^τ is the RGB values of the /th most significant color under the original color resolution (i.e., 256 levels for each channel), and s' „ is the significance of c' „ for the region. The components E'_n are sorted in descending order according to the significance values of the principal colors. Let the current frame of input color images be I₍(x), then the significance of ith principal color can be defined as

*€fl? (₂) where ω(x) is a weight function and <5(-, ^•) is a delta function. In the example embodiment, ω(x) = 1 is chosen for isolated objects or regions. When locating an object in a group, ω(x) may not be equal to 1. If necessary, other weight functions can be used, e.g. Gaussian kernel to suppress the noise around the object's boundary [5]. (5(C₁, c₂) equals to 1 when ci = C₂, otherwise it is equal 0. However, in the example embodiment a color distance is used which is not sensitive to noise and illumination changes

2 < C_{1 J} C₂ > d{c_u c<ι) = 1 -

INP + |cfc||»

(3) where < ^•, ^• > denotes the dot product. The color distance in (3) is then applied to compute the delta function in (2) as

J l, if d(ci, C₂) < ε ϋ. otherwise k (4) ε = 0.005 is chosen in the example embodiment.

The PCR 7" , contains the first N significant colors and their statistics for the region R" _t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number Nto approximate the color features of the region, i.e.,

^2 S_n > η's_n with 7 = 90%

(5)

In the example embodiments, using N= 50 in (5) lead to satisfactory results for almost all the regions containing one or a group of objects.

Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons. The PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.

TABLE 2 THE ALGORITHM TO GENERATE PCR FOR REGION R" _t (x)

The aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance. To achieve this, the likelihood, or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated. In the following, the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.

Let O^m _t- ₁ be ^me with tracked object described by its PCR

'j-rn __ f j, J βi __ /_ci <.« Vl. N i ^; t-i \-m"- x m v W^mZJ t=U obtained previously when the torched object was an isolated object, and R" , be the nth foreground region detected at time t. According to the law of the total probability, the likelihood of the object O^m ,_ i in the region R" , can be defined as

w (6) where each P(R" , \E'_m ) is the likelihood of that the object O^m ,_ _{ appearing in the region R" , based on the partition evidence Ef_n, , and P(E¹ _m \O^m _t~ i) is the conditional probability of the evidence E'_m given the object O^m ,. _{. Using the PCR T ^m ,_ i of the object O^m ,. _u the conditional probability P(E' _m \O^m _t- _\) can be defined as the weight of the principal color c' _m for its appearance,

(7) Using the, PCRs T ^m _t. _x of the object O^m ,. j and T _t for the region R" , , the likelihood

can be computed according to their significance values of the same color component c'_m

Substituting (7) and (8) into (6) yields

It is noted that the above likelihood (9) is evaluated under the assumption that the size variation of the object is small. However, if the size change is large the likelihood value will be affected. Therefore, a definition of likelihood based on the normalized PCRs is used in the example embodiment. Let

G^L₁, where s*,. = s_m* /s_m, and TJ¹ = {l, {iξ = (<£. #„)}£, } be the normalized PCR of the region R£ Λvith P_n = j£/«n- The likeliliood based on the normalized PCRs becomes

If the region R" , only contains a single object, the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs. However, if <9"',_ i is one of the objects in the group R", , the likelihood on original PCRs is better. Hence, the scale-invariant likelihood of observing a given object O^m _t- ₁ in the region R"_t is defined as,

Heuristically, if the region R" is a single object, the likelihood computed from normalized PCRs appears more reliable. However, if O₁", is one of the objects in a group R" , the likelihood from un-normalized PCRs appears better since the object is smaller than the group. Equ (11) can provide a suitable measurement for these two cases in the example embodiment.

Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene. When multiple target objects frequently merge and separate from one another in a public site, tracking one individual object is no longer an independent process. Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects. With the segmented foreground regions provided by background subtraction, in the example embodiment the global MAP problem can be approximately decomposed as two subproblems: assignment and location. Using the principal color representation (PCR) and the associated likelihood function, the example embodiments uses sequential solutions to these two subproblems, as detailed below.

Let O_t-i = {O_j"!,}^"_! ¹ be the set of tracked target objects in the previous fraine I_t_i(x), and θt-i

be the set of state parameters describing their positions at time t — 1. The task of multi-object tracking is to estimate the states θ₄ of tracked objects in the current frame I*(x) given their previous appearance models O_t-ι and states θ_t_i. This can be formulated as the Maximum A Posterior (MAP) estimation for the state parameters θ£, θ; = argiuax P(θ_f|I.(x). O_t-u θ«_i) θt (12)

When several objects overlap one another, they cannot be tracked as independent objects. With foreground regions 7Z_t = {iϊ{f}_fc!.i provided by background subtraction, (12) can be simplified as θ* = aig max P(GtIK₄, Ot-i- θ_t_i) θ^t (13) where the tracked objects [O_tL₁

from the previous frame. If a region (e.g., H_f-ι) only contains one object (e.g., OJl₁) it is an isolated object, otherwise the region is a group. Objects belonging to different regions in the previous frame may merge into a new group region (e.g., Rf) ϊii the current frame. Also, the objects m a group region (e.g., R_t-i) ^m ^ie previous franie may separate to several regions in the current frame. For real-time processing with moderate or high frame rate of image acquisition, the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in the consecutive frames. Exploiting such a relation, the problem (13) can be further decomposed by using directed acyclic graphs (DAGs).

The directed acyclic graphes (DAGs) for the regions detected m the consecutive frames lt-i(x) and I_f(x) are constructed is the following way. Let the regions from the previous and current frames be denoted as the nodes and be kid in two layers: the parent layer and the child layer. The parent layer consists of nodes representing the regions {#_t_ι}_jiϊ^ι ώ the previous frame Ii_i(x). and the child layer consists of nodes denoting the regions 4 -Rf)J[I_j ^{in me} current frame

I{(x). Suppose β^__j and R^ are the jfth and kϋx regions in the previous and current frames. respectively, then the directional link from B^__Ύ to RJ; can be defined as

/ 1, if {Iξ-_i rιR^κ;) ≠ & l}k = <

I 0, otherwise k (14)

This implies that there is a link only when the two regions have some overlap. A directed acyclic graph (BAG) is formed by a set of nodes in which every node connects to one or more nodes in the . same group. A set of DAGs (graphs) can be generated. An example of graphs for two consecutive frames is illustrated in Figure 4. The notations for the DAGs are defined as follows. For the ith graph, the parent nodes are denoted as {v^f}^ where each node njf represents one of the regions |i?f_₁}^*7¹, and the child nodes are denoted as {nV'K-'i where each node n*{^q represents one of the regions The ϊth DAG can thus be denoted as

If Λ/<_j, = 1 then the aode is a single object, otherwise it is a group of M₁^ objects. The object o2l{' is one of the objects { OJIi K_mUJ_j ¹- The objects in a child node n¹,'⁹. which may be newly generated objects or objects tracked from its parent nodes, are denoted as {«V'*

After processing all graphs, the objects in the child nodes are reordered as (0"J-^₁. They are the set of tracked target objects in the current frame.

Since there is no link between different DAGs, the objects in the parent nodes of one graph caa be tracked independently of the other graphs. Let 0^₁ represent the set of objects in the parent nodes of graph Gu i.e., Oj__t = {{o^}^^}^. Then the probability of the states for all the tracked objects in. image I_t(x) becomes

(15) where θ_j = (©} . • • • . θf^G) and L_c is the number of DAGs. According to (15). (13) can be decomposed as finding a {β\Y for each graph such that

If there are several parent and child nodes in a graph, and some parent nodes represent groups, (16) is still a nontrivial problem. The example embodiment solves the problem in two sequential steps from coarse to fine. The problem is decomposed approximately as two sub- problems: assignment and location. The coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.

Assignment: In this step, the tracked objects in each parent node is assigned to its child nodes based on the largest. posterior probability. Let

be the parameter vector describing the assignment of the tracked objects

The posterior probability of the assignment for graph G_t can be expressed as

(17) where O^¹ ₁ = (Cf }^₌₁ are the tracked objects in node r#\ and ,Vj-*⁰* - {n{^Λ : I^ = l}^ are the child nodes

of n^f. The parameter vector is θj = (θj'¹ < • ■ - > &/"*^>' ) . The best assignment for die tracked objects {o™'f }^_l'^lP _=l in n'-f i& chosen such that it results in the best observation of the objects in the corresponding child nodes, that is

Here θj^p can be considered as the coarse tracking parameters indicating in which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.

Location:

In this step, the new states of the tracked objects assigned to each child node (e.g.. region R_t) are determined. Let O]:^q = K'¹'¹}^_! be the objects assigned to the child node nf from its pareat nodes. That is, Ol⁴ is a subset of Ol__y according to the assignment parameters (θ<)*

After the assignment, objects in each child node can be tracked independently of objects in the other child nodes. Hence, the posterior probability of the new states for the tracked objects in the graph Gi can be evaluated as

where Θ|

objects in n^^q. From (19), locating the objects in the child node n^ is expressed as

(θ|^Λr = aⁱsπⁱaxP(θ!>ϊ*,0{*^, θ&)

^¹⁹ (20)

Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below. Assuming thai {i?* J-^L₁ are the foreground regions and {J3* J-^L₁ are their bounding boxes defected- at time t, then their PCRs can be obtained as

Let

be the set of directed acyclic graphes (DAGs) for the foreground region? between the consecutive frames.. If there ΪS only one object in a graph G_iy then the object will be tracked as an isolated object. Otherwise, multi-abject tracking will be performed according to Eqs. (IS) and (20). For tracking multiple objects in a group, the posterior probability of the new state for each object is. determined on both spatial position and depth relationship. Hence. 2^D state is. used for each object. The state vector for an object Of is θ% = (&", t>£) where δ" is the bounding box describing its spatial position and ιγ is the likelihood value describing its depth position in the group.

Tracking Isolated Objects

If the itk graph consists of only one child node (i.e., Qi — (H₁'¹)). a new object appears and is initialized as

in Gi with a new id number. Suppose the node n{* ¹ represents the region Rf, then the PCR and bounding box of

are set as T* = Tf and b] = J3*. Since o* is an isolated object, it is not occluded by any other objects. The depth state is set as ι>l =

= 1. It's 2|D state parameter vector is

= (b]_t ι£).

If the tth graph contains one parent node and one child node, and the parent node is associated with one object, the graph represents the simple case of isolated object tracking. Let the graph be Gj = fa_g, n|, ^₁), the object in the parent node be OfI₁, and the child node n[ represent the region Rf, then the object 0^L₁ is updated as o] in n\ (Le., OJl₁ and o] have the same id number). Its state becomes θ\ — (HM) wifh b\ = B* and V| = 1. In addition, its PCR is updated as T^ — T_t ^fc to follow the gradual variation of the object.

If the ?^'th graph only contains one parent node which has no child nodes, then the previoiis objects in the parent node are assumed to have disappeared in the αurent frame. Tracking is terminated for these objects.

Tracking Multiple Objects in a Graph

If the ith graph G,- contains multiple parent nodes or child nodes, the operations of assignment and location will be performed. In the following description of the operations for one graph, for the sake of notational convenience the index i for the graph G,- is omitted below.

Assignment: Let n£ be a parent node in the graph G, Of_-1 = { OJJL₁ }_mLi be *l^ie associated objects, and ΛV

be its child nodes. If the parent node has more than one child node, the assignment of objects Of_-1 is determined by Eq. (18). However, with varying numbers of objects and child nodes., Eq. (IS) is a nontrivial problem of optimal configuration. To make the problem tractable, a sequential solution is proposed based on their PCBLs and the depth relations among the objects.

In each group, the close and non-occluded objects have richer visible information than the distant or occluded objects. This means that an occluded object less affects the tracking of the objects occluding it. Hence, the assignment cart be solved sequentially from the most visible one to die least visible one. Let the objects {o™ i}_mLi in the parent node rig be ordered according to their visible sizes. Assuming that the correct assignment of the object o"i_x is θ"¹ = ςf,_n which assigns 0Jl₁ to the child node n|^m (nsf"¹ € A*_j ), and the child node nf^m represents the region ' uj™, then the posterior probability of assignment for objects Of_-1 = {ofi_j J-^₁JL₁ is computed as

where S[^m(nι - 1) = i^"¹ — ∑u^¹ δ$ = <?_m)4-i represents the region after excluding the objects previously assigned to it before OfI₁. Note that the assignments of the objects

are not independent The assignment of one object is affected by the previous objects with higher ranks. This means the assignment of each object can be performed one-by-one sequentially from the most to the least visible ones. For each object, the posterior probability of assignment can be evaluated using Bayes rule,

The first term on the righthand side of (22) is the likelihood of observing the object 0"I₁ in the region i?_j with the exclusion of previously assigned objects, while, the second term is the prior probability of θ™ = q_m given the previous state 6¹Jl₁. For assignment, (22) can be evaluated on PCR. Assuming that a child node

be the PCRs of B% and 0Jl₁, respectively. Using Eqs. (21) and (22), the best assignment of the objects

can be achieved one-by-one sequentially according to their depth order by

sequential solution to Eq. (18) using Eq. (23) is computed in two steps. First, the objects

parent node n£ are sorted in a list according to their visible parts. An iterative process is then performed from the most visible to the least visible object. In each iteration, the object in the top of the list is assigned to one of the child nodes according to (23). Once an object is assigned to a child node, it is removed from the list and its visual evidence is excluded from the PCR of the child region. Details are described below.

Location:

Let Of

node in the graph G. The aew states of the objects should be determined by solving Eq. (20).

Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones. Hence, in the example embodiment objects in the node are located one by one from the most visible to the least visible ones based on their visible parts. The posterior probability of new states for all the objects in the node can be expressed as

(25) where θf = {θ}. - ° - , $**)■■• R_t ^{is the} region associated with the node nf, and Rf; (n - 1) = Ri — 4. represents the region tn xvhich the visual evidence of the first n — 1 objects (t?_j ^l , • • • ,C)"^"1) have been excluded at the located positions. According to (25), locating objects Of according to (20) is equivalent to locating them one by one sequentially according to

(θ?r = argmaxP (θf\^(n - 1 W__lf 0JL₁)

' (26) where {of J_nJj₁ are sorted in descendent order according to their visible sizes.

The sequential solution to the problem Eq. (20) and Eq. (26) contains two steps. In the first step, the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes. In the second step, an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.

Assuming that an object o" in the child node

o^ assigned from the parent ?4 then there is 0Jl₁ =

}_n=i ^be *^he likelihoods of IeJL₁ }£_!, computed from the prevkrøs frame. The likelihood of observing object of in the child node nf (or region R^) in the current frame can be evaluated as

according to Eq. (11). Since the motion of object between the two consecutive frames is assumed small, the visible parts of the object o" in region R^ can be estimated as where S_n and S_k are the sizes of the object o" and the region R*, respectively. In Eq. (27), η is a weight to smooth the estimates from consecutive frames (η — 0.5 is chosen in this εtudj')- {o"}_ΛJj are then sorted in descendent order according to the values of {ζ^J-^'jJ?_! and then placed in a list. To perform exclusion for R_t(n~ 1) = ϋf — YJ ^~*

a weight image u-v(x) is used. If the pixel x is likely to belong to one of the previously located objects (oj . • * • . o"^"1). α_%_i(x) is low (ft: Q), otherwise, it is high («* 1). For initialization, set ωo(x) = 1 for all the pixels belonging to the region R^ , and U-Ό(X) = 0 otherwise.

In each iteration, the top object in the list is pop up. Assuming that it is the nth object o" with the initial position represented by the previous bounding box 23»^O) = 6JL₁ centered at x» \ and its PCR is 1? = {s_n, {££

Locating

the object o" in the region Rf according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object. This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7]. A two-stage mean-shift procedure is proposed based on the evidence of the object^'s principal colors. In the first stage, the gravity center of the pixels of each principal coior-∞inponent-is- computed as.-—

(Z8) withy = 1, ^{• • •} ,N, where r indicates a current step of mean-shift iteration. In the second stage, the new position of the object o", is generated as the weighted average of the gravity centers

^{* ~} X-> (29,

where the weight of evidence firαni principal color c£ is defined as the backprojection

The weight u\, implies that if

then only

proportion of the pixels with, color r£ in the bounding box B^ belong to the object oj\ otherwise all the pixels of color c£ belong to the object. The mean-shift procedure is terminated once

is satisfied, or the maximum number of iterations is reached (6 in the example) The new location of object of is the boimding box hf — BI?⁺¹^ centered at be the PCR of the part of region

within the bounding box 6", where sj, is the size of the pan and the significance is The likelihood of the object tζ in the group can be estimated

according to Eq. (11). The new state parameter of the object of is then obtained as

The last operation of the iteration is exclusion which removes the visual information of the tracked object o" from the region R^ at its position, or obtaining

jf( f ∑r=i 4 This operation is done by updating

) For a pixel J

and we caa obtain the significance of r_x in both and as

Let us define Aω(x) = min(l, s_n(c_x)/^(c_x)). The value of __w(x) can be considered as the probability of x belonging to the object of. Hence, the updating of the weight image can be performed as

The complete algorithm of multi-object tracking based on PCR in the example embodiment is summarized in Table 3.

TABLE 3

ZHE SUMMARY OF THE MULTI-OBJECT TRACKING ALGORITHM Input: color image I,(x) and segmented image S,(x); Preprocessing: generate graphs (G₁ , i = 1, . . . LQ); For G_t , i = I, . . . La , do: Assignment: for each parent node in G_b p = 1, . . . , M₁, _p< do:

a. I: has no child node, the objects in it are deleted:

a.2: if

, all the objects in it are assigned to rif ; a.3: if Π_Q' ^P has multiple child nodes; a.3.1: sort the objects {o™__x} ^f₁ in Π_Q' ^P , and then assign them one-by-one from the first to the last as follows: a.3.1.1: assign o_t™, to the child node n['^qm according to (23); a.3.1.2: exclude the visual information of o_t™ , from the PCR of n['^qm

Location: for each child node n[^p in G_b q — I, . . . , M_q , do:

1.1: if no object is assigned to the node, check if it is a disappeared object; if not, set it as a new object;

1.2: if only one object is assigned to the node, update the state and PCR of the object; 1.3: if multiple objects are assigned to the node:

1.3.1: sort the objects {o" } ^₁ in the node using (27), and then locate the objects one-by-one as follows:

1.3.1.1: apply mean-shift to locate o" using (28) and (29);

1.3.1.2: exclude the visual evidence of o" at the location in R_t using (31)

1.3.2: if the likelihood of observing an object in the region is less than 0.1, set the object as disappeared.

Clearance: if an object disappeared for more than 50 frames, delete the object. End

The algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location. In the assignment phase, each parent node in the DAG is processed. In the location phase, assigned objects in each child node are tracked. To be robust to the separation of small parts from the tracked object due to segmentation errors, small objects in a group with likelihood values less than 0.1 are set as disappeared. To prevent the losing of small or heavily occluded objects in a group, the records of disappeared objects are kept for 50 frames. When a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created. In the example embodiment, segmenting individual persons in a group with domain knowledge will be preferred. For example, in the example embodiment knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.

Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment. At step 502, first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked. At step 504, one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node. At step 506, for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.

At step 508, for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image. At step 510, for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node. When an object stops moving and stays in the same position in the scene for a while, the object would be absorbed into the background gradually with existing background updating techniques. That means the object would be lost in the segmented foreground images. On the other hand, in e:g. crowd scenes, if one can separate the moving objects from the stationary objects in the scene, one can reduce the overlapping of multiple foreground objects. This would make the tracking of each individual easier and more robust. In the described example embodiment, a layer tracking algorithm is designed to track stationary objects through even frequent occlusions. When the object starts moving, the objected is identified as moving object and tracked by a moving object tracking algorithm. In the example embodiment, the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.

When an object stops moving and stays in a scene in frame of a video signal, the appearance variation of the object is typically small through a sequence of frames. A template image of the object is used to represent such a stationary object in the example embodiment.

Let \B_j}'_M__τ be a sequence of bounding boxes of the ith tracked object in the τ_b most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other. For a selected length parameter τ_b , if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, τ_b is set as 10 frames, corresponding to about 1 second. To track the stationary object in e.g. a busy site in which the object may be occluded frequently by moving foreground objects, a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as

where A\ is the template image of the object maintained at time t, T₁ is the Principal

Colour Representation (PCR) of the object stored when the object was detected as a stationary object. That is, the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object. d{ is the difference measure between the template A/ and the frame I_j (s) for the corresponding region of Aj , d_c ^J is the difference measure between the consecutive frames /_y-__[(s) and /_y(s) for the region of the template, d_p ^J is the visibility measure of the object from the corresponding region in the frame / (s) , and s_k is an estimated state of the stationary object at time k. Measures in τ_d most current frames and states in τ_s most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.

In e.g. a busy public site, if there are some objects staying in the scene, they will often merge with moving objects and the result in a high complexity for object tracking. In separating the layer or stationary objects from moving objects and track the stationary and moving objects separately, the example embodiment can greatly enhance object tracking much. Let c = /_t(s) be the color of a foreground point in the region of zth template image. According to Bayesian rule, the probability of the point belonging to the background is

where p(c | b) can be obtained from the Principal Feature Representation (PFR) of the background. The PFR at each pixel is used to characterize the background. Let s = (x, y) be a pixel of the image. For each type of feature, a table which records the principal feature vectors and their statistics at S

is built, where p_v' (b) is the learned probability of S belonging to the background (P_s(b) ) based on the observation of the feature V and S_v' (i) records the statistics of the M_v most frequent feature vectors at s , Each S_v' (i) contains three components

S_v' (i) = p,'_lV, = P.(y, \ b) (4b) vι = (^vip -»^vα>. ) where £>_v is the dimension of the vector v. The S_y' (i) in the table are sorted in descending order with respect to the value p\_r Hence, the first N_y elements are used as principal features. Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts. Three tables are used to learn the possible principal features of the three types for the background. They are T_c (s) , T_e (s) , and T_cc (s) . The color vector is c = (R₁ ,G_nB₁) from the input color frame. The gradient vector is e = (g-_v,g ) obtained by Sobel operator. The color co-occurrence vector is cc = (R_t__{ , G,_, , -δ₍__j ,R_nG_t ,B_t) with 32 levels for each color component.

The probability of the pixel becoming a background point at the current frame can be calculated as

N_s(b) p(b) = (5b)

where N_s(b) is the background points in a small window W_s centered at s in the previous frame, and M_s is the number of points within the window. Similarly, the probabilities of s belonging to the layer (stationery) object or moving foreground object are

m_c)= ^pmfλ_{miHf ]c}^^p(ήfMΔ _m

Pi^c) />(c)

respectively. The probabilities p(c \ I) and p(c \ f) can be calculated with Gaussian kernels. Let c'_x be the color of point x in the template Λ$^~ι within the window W_s . Then p(c I /) can be calculated as

p(c I /) = m xeatrx. {fc_c(c'_x - c)*,(x - s)} (7b) where the Gaussian kernels can be written as with g=c or 5

indicating the kernel for color or spatial vector, respectively. Again, let c{ be the color of a point x in the window W_s and in the region of moving foreground objects from the last frame I_t__x (s) . The probability p(c \ f) can be calculated as

p(c I /) = max(fc_c(^c _χ - c)k,(x - s)} (8b) xeK',

The priors can be calculated as

where N_s(/) and N_s(f) are the number of points belonging to the layer object and moving objects within the window W_s in the previous frame.

Comparing Equ (2b) and (6b), it can be found that p(c) has become a common normalization factor. Hence, the likelihoods of S belonging to background, the layer object, or the moving object can be defined as

p'(b I c) = ^(c I b)p(b), p\l I c) = p(c I l)p(l), and p\f \ c) = p(c \ f)p(f) (10b)

respectively. The pixel s would be assigned according to the greatest likelihood value. The mask for the moving objects is used as the input for moving object tracking.

Stationary objects may also be involved in several changes and interactions with other objects through the sequence. For a non-living object, it may e.g. undergo illumination changes, be occluded and removed by other objects. For a living object, the object may change pose or move bodyparts, or start moving again. During tracking the stationary object, the object's states are estimated and the template image updated correspondently in the example embodiment. hi the example embodiment, five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving. The state is estimated according to various change measures from a short sequence of most recent frames. Let s be a point in template A\ (s), of the zth layer object. The difference between the template and a current frame at s can be evaluated as

d,_{{s) (πb)}

where Th_d is the threshold according to image noise. Then, the difference measure between the template and current frame for the layer object is defined as

where S_A' is the size of the template.

Similarly, for a point s in the template A\ (s), the difference between consecutive frames at the point is evaluated as

The difference measure between consecutive frames for the layer object is defined as

The difference measures are calculated on color vectors.

If the changes over the region of the template are caused by motion of the object itself, even if the differences dj and dj. would be large, the visibility (visibility measure d_p ^J ) of the object in the current frame based on PCR would still be high since the PCR is a global representation not related to spatial information. On the other hand, if the changes are caused by occlusion of other objects, the visibility of the layer object in the current frame would be low. Let T₁ be the PCR of the layer object in Ii₁ that was stored when the object was detected as a stationary object, and Tj be the PCR from the region overlapped by the template A\ in the current frame. Then the visibility measure of the layer object in the current can be evaluated as 7 000205

37

d_p' = P(Tj I Ti) . More particular, Let O'^~l be an object in /,__j(s) , and O_n' be a region in /,(s) . According to Bayesian law, the probability of observing O'^~l in O_n' can be computed as

P(O_n' I O¹-¹) = ∑P(O_n' I EJ¹ P(El I C) (15b)

(=1

From the definition of PCR, the significance of c'_m for O'^~x is P(E_m ^l \ O'^~x) = p_m' /p_{m >} and the likelihood of observing O'^~[ in O_n' according to the evidence of c'_m is

P(O_n' \E_m ⁱ ) = mi4,p_c,_m]Jp_m ⁱ }=^_rvmn{p_m ⁱ ,p_ciJ (16b)

Pm where p^ is the significance of c'_m from the region O_n' . Let C(c'_m) is the subset of the

principal colors from O_n' which match c'_m . p , can be calculated as

Now the visibility measure becomes

With the change measures in a short sequence of Tj most current frames (i.e. image frames from I_t__Tj (x) to /,(x) ) evaluated above, with x_d normally set to 10 frames in the example^' embodiment, the states of the tracked layer object are estimated by heuristic rules in the example embodiment:

Rule 1: motionless: If both dj and d_c ^J are low through the sequence, it is motionless;

Rule 2: occluded: If both dj and dj. turn to moderate or high and d_p ^J turns low through the sequence, as well as there are moving objects overlapping the region of the template A! as determined from the bounding boxes of such moving objects in a moving object tracking algorithm applied, the layer object is occluded;

Rule 3: removed: If both dj and dj turn to high and d_p ^J turns low, and then dj turns low through the sequence with no moving object overlapping the region of the template, the layer object is removed; 00205

38

Rule 4: inner-motion: If both dj and dj turn to moderate and then dj turns low through the sequence, while d_p ^J keeps being high, this means the layer object has changed its pose or moved part of its body but still stays there;

Rule 5; start moving: If both d{ and dj turn and keeps being moderate, and d_p' keeps being high through the sequence, as well as there is a shift of the layer object, this means the layer object starts moving again.

The parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment. In the example embodiment, but not limiting, for the above rules, the difference measures for dj and dj. are low if they are less than 0.25, they are of moderate if they are within

(0.25, 0.75), and they are high if they are larger than 0.75. The visibility measure d_p ^J is low if it is less than 0.6, otherwise, it is high. The measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template A' . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.

To track the layer object more robustly in the example embodiment, the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene. The five most recent states for each layer object (τ_s = 5 ) are recorded. However, it will be appreciated that other values may be used in different embodiments. If one state has more than 3 supports, the state is confirmed. For the corresponding state, the following updating is performed.

If the layer object is confirmed as being motionless, a smooth operation is perfoπned to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by combining the example embodiment with the moving object tracking algorithm described for the example embodiment.

Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment. At step 602, it is detected that a tracked moving object has become stationary over a sequence of frames. At step 604, a template image of the stationary object is generated based at least one of the frames in the sequence. At step 606, a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.

Event detection:

The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.

The foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8]. The background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.

The moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702. As described above, to deal with great variations of target objects in shapes and scales as well as complex occlusions, moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object. When a tracked object has been detected as stopping moving, a layer representation, or a template, for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment. At each time step, the states of templates for the objects are estimated with fuzzy reasoning. The template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed. When a template for an object is detected as starting to move, the template for the object will be deleted and the object will be shifted as a moving object and then tracked by the moving object tracking module 702.

In the event detection module 706, semantic models based on Finite State Machines (FSM) are designed to detect suspected scenarios. In the system 700 of the example 05

40

implementation, four types of unusual events are detected. They are unattended objects, theft, loitering persons, unattended vehicles or unconscious persons.

An "event" is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene. Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from inteπnediate level processing. It is a step that bridges the numerical level and the symbolic level. The fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.

In implementations based on the described example embodiment, unusual events are described by the spatio-temporal evolution of object's states, movements, and actions. On a semantic level, each event can be defined as a sequential succession of a few well-defined states. An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time. When a specific state is reached, the event is declared. State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state. The semantic representation can be modelled based on Finite State Machines (FSM). The FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.

Using Finite State Machine, each specific event can be represented by a directed graph

Gf = \S_j ,EΛ^e , where Sf is the set of nodes representing the states and E? is the set of edges representing the transitions. One example of a FSM 800 is described in Figure 8. Any new object is initiated to be state "0" 802 for all the events defined. This is the initial state. The FSM 800 is truly started only when some conditions are met and the active node transits to the next intermediate state, i.e., state "1" 804. There can be more than one intermediate state for the FSM 800 of an event, depending on the complexity of an event. The FSM 800 reaches the final state "End"406 when all the conditions are met and then the corresponding specific event is triggered. The FSM 800 is updated at each new frame. The FSM 800 could have the self-loop transition for each state. Although the FSM 800 could remain at the same state, some or all properties of the object may have changed. At least, a time counter is incremented for each frame. The more complicated an event, the bigger is N, i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.

The input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7). The visual cues of each tracked object can include shape, position, motion, and relations with others. The visual cues in the example implementation are:

- Object ID: the identity number of each tracked foreground object;

- Box: bounding box of the tracked object in current frame;

- Size: the area of the object in current frame;

- Status: indicates whether the tracked object is moving around or stationary;

- StayTime: indicates how long the object has stayed in the scene;

- InGroup: indicates whether the object is an isolated one or merged with others;

- Visibility: a measure within [0,1] indicates the degree of occlusion when overlapping with others;

- Motion: a measure within [0,1] indicates the degree of interior motion of a stationary object.

The general processing flow for event detection in the example implementation is shown in Table III.

An advantage of the tracking modules (compare 702, 704 in Figure 7) is the capability to resume tracking of some objects that are lost for a few frames. The two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation. Thus when an active object does not appear in the track records of the current frame, one preferably determines whether it is temporarily lost or whether there is a genuine disappearance. To achieve this, a first-in-first-out (FIFO) stack is built to contain the track records of N frames. O_Tracked are the track records of the previous N-th frame and the triggered event is delayed by N frames. As such, in the example implementation it is possible to 'look forward' to check the case of disappearance of an object, with N= 30 in the example implementation. With a processing rate of 8 frames/sec or above, this represents a delay of less than 4 seconds. It will be appreciated that the delay can be balanced against the accuracy of detection in different implementations.

Loitering Detection 00205

42

Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > T_Loitering . The FSM is initialized for each new object. The FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay":

- The object is classified as human;

- The object moves in the scene (moving around or staying somewhere with frequent interior motion).

In state "Stay", a time counter t is continuously incremented as new frames are coming in. When t > T_Loilering , the FSM transits from state "Stay" to state "Loiter" and a loitering event is triggered.

Unconscious Person Detection

As defined in the example implementation, this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > T_Smtic . The FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion. The second intermediate state of the FSM is "S", which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S":

- The position of the person does not change compared to the previous frame;

- The interior motion of the person m < T_lmMoύon .

In state "S", a time counter t is continuously incremented as new frames are coming in. When t > T_Smic , the FSM transits from state "S" to state "UP", indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person. It will be appreciated that similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.

Unattended Object Detection

This event as defined in the example implementation involves two objects. The FSM is initialized for each new object. When the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. If the owner leaves the scene covered by the camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.

Theft Detection

This event as defined in the example implementation involves three objects. The FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.

The method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.

The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.

The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).

The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.

The components ■ of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by the processor 918. Intermediate storage of program data maybe accomplished using RAM 920. It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

References

[1] D. Lowe. Distinctive image features from scale-invariant key-points, hit 7 J. Computer Vision, 60(2):91-110, 2004.

[2] L. Li₅W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complex background for foreground object detection. IEEE Trans. Image Processing, 13(11): 1459-1472, 2004.

[3] L. Li and M. K. H. Leung. Integrating intensity and texture differences for robust change detection. IEEE Trans. Image Processing, 11 (2): 105-112, 2002.

[4] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):747-757, August 2000.

[5] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-Based Object Tracking," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577, 2003.

[6] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach Toward Feature Space Analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603- 619, 2002.

[7] Y. Cheng, "Mean Shift, Mode Seeking, and Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.

[8] Liyuan Li, et. al. IEEE T-IP, vol. 13, no. 11, pp. 1459-1472, 2004.

[9] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland. P finder. Real-time tracking of the human body. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997.

Claims

1. A method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of: defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

2. The method as claimed in claim 1, wherein a first learning rate for the pixels that are occluded is lower than a second learning rate for the pixels that are exposed.

3. The method as claimed in claim 2, further comprising the steps of: determining whether said respective pixels that are exposed are detected as a background point or as a foreground point in a current background subtraction for the current image; and setting different learning rates for the adaptive background subtraction for exposed pixels that are detected as foreground points and for exposed pixels that are detected as background points respectively.

4. The method as claimed in claim 3, wherein a third learning rate for the exposed pixels that are detected as foreground points is higher than the second learning rate for the exposed pixels that are detected as background points.

5. The method as claimed in claim 1, wherein one contextual background representation type A comprises a facility for the public such as a counter or a bench, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed comprises the steps of: evaluating, for each image region spatially corresponding to a type A background region, whether said each image region is occluded based on matching OHRs of the type A background region and of said each image region respectively and based on matching PCRs of the type A background region and of said each image region respectively; and determining all pixels of said each image region as either occluded or exposed depending on said evaluation.

6. The method as claimed in claim 5, wherein all pixels are determined as exposed if a match likelihood in said evaluation is above a threshold value, and are determined as occluded otherwise.

7. The method as claimed in claim 1, wherein one contextual background representation type B comprises a large homogeneous region such as a ground plane or a wall surface, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed comprises the steps of: evaluating, for each image region spatially corresponding to a type B background region, whether neighborhood regions around respective pixels in said each image region are occluded based on matching PCRs of the type B background region and of the respective neighborhood regions; and determining pixels of said each image region as either occluded or exposed depending on the respective evaluations.

8. The method as claimed in claim 7, wherein each pixel is determined as occluded if a majority of neighborhood pixels in the neighborhood region of said each pixel are within said type B background region and less of the neighborhood pixels themselves are evaluated as exposed based on a match likelihood being above a threshold value, and is determined as exposed otherwise.

9. The method as claimed in claim 1, further comprising setting a zero learning rate for pixels belonging to foreground regions.

10. The method as claimed in claim 1, further comprising the step of performing adaptive background subtraction using said set rates for the respective pixels.

11. The method as claimed in claim 10, wherein the adaptive background subtraction is based on, in one example embodiment, Mixture of Gaussian or Principle Feature Representation.

12 The method as claimed in claim 1, further comprising maintaining a model base for the contextual background representation types, the model base including models for different illumination conditions.

13 The method as claimed in claim 12, further comprising adjusting an appearance, a spatial characteristic, or both, of the models in the model base over a long duration compared with a frame duration in the video signal.

14. A system for background updating for adaptive background subtraction in a video signal, the system comprising: means for defining one or more contextual background representation types; means for segmenting an image of a scene in the video signal into contextual background regions; means for classifying each contextual background region as belonging to one of the contextual background representation types; means for determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; means for receiving a current image of the scene in the video signal; means for determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and means for setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

15. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of: defining one or more contextual background representation types; 00205

48

segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.