WO2008008045A1 - Method and system for context-controlled background updating - Google Patents
Method and system for context-controlled background updating Download PDFInfo
- Publication number
- WO2008008045A1 WO2008008045A1 PCT/SG2007/000205 SG2007000205W WO2008008045A1 WO 2008008045 A1 WO2008008045 A1 WO 2008008045A1 SG 2007000205 W SG2007000205 W SG 2007000205W WO 2008008045 A1 WO2008008045 A1 WO 2008008045A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- background
- pixels
- region
- image
- regions
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/174—Segmentation; Edge detection involving the use of two or more images
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/254—Analysis of motion involving subtraction of images
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/28—Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
Definitions
- the present invention relates broadly to a method and system for background updating, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating.
- Adaptive background subtraction is typically the first fundamental step for video surveillance.
- a typical surveillance system consists of a stationary camera directed at the scene of interest.
- a pixel-level background model is then generated and maintained to keep track of time evolving background. Background maintenance is the crucial part that may affect the performance of background subtraction in time-varying situations.
- the methods of basic background subtraction employ a single reference image corresponding to the empty scene as the background model.
- a Kalman filter is usually used to follow the slow illumination changes. However, it has been realized that such simple model was not suitable for surveillance in real- world situations.
- Adaptive background subtraction (ABS) techniques based on statistical models to characterize the background appearances at each pixel were developed for various complex backgrounds.
- Wren [9] employed a single Gaussian to model the color distribution for each pixel.
- MOG mixture of Gaussians
- MOG multiple states, e.g., normal and shadow appearance, and complex variations, e.g., the bush under winds.
- Many enhanced variants of MoG have been proposed in recent years.
- Some of the enhancements integrated the gradients, depthes, or local features into the Gaussians, and others employed the non-parametric models, e.g. kernels, to replace the Gaussians.
- a method of background updating for adaptive background subtraction in a video signal comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
- OCR orientation histogram representation
- PCR principle colour representation
- a first learning rate for the pixels that are occluded may be lower than a second learning rate for the pixels that are exposed.
- the method may further comprise the steps of determining whether said respective pixels that are exposed are detected as a background point or as a foreground point in a current background subtraction for the current image; and setting different learning rates for the adaptive background subtraction for exposed pixels that are detected as foreground points and for exposed pixels that are detected as background points respectively.
- a third learning rate. for the exposed pixels that are detected as foreground points may be higher than the second learning rate for the exposed pixels that are detected as background points.
- One contextual background representation type A may comprise a facility for the public such as a counter or a bench, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type A background region, whether said each image region is occluded based on matching OHRs of the type A background region and of said each image region respectively and based on matching PCRs of the type A background region and of said each image region respectively; and determining all pixels of said each image region as either occluded or exposed depending on said evaluation.
- All pixels may be determined as exposed if a match likelihood in said evaluation is above a threshold value, and are determined as occluded otherwise.
- One contextual background representation type B may comprise a large homogeneous region such as a ground plane or a wall surface, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type B background region, whether neighborhood regions around respective pixels in said each image region are occluded based on matching PCRs of the type B background region and of the respective neighborhood regions; and determining pixels of said each image region as either occluded or exposed depending on the respective evaluations.
- Each pixel may be determined as occluded if a majority of neighborhood pixels in the neighborhood region of said each pixel are within said type B background region and less of the neighborhood pixels themselves are evaluated as exposed based on a match likelihood being above a threshold value, and is determined as exposed otherwise.
- the method may further comprise setting a zero learning rate for pixels belonging to foreground regions.
- the method may further comprise the step of performing adaptive background subtraction using said set rates for the respective pixels.
- the adaptive background subtraction may be based on a Mixture of Gaussian or Principle Feature Representation.
- the method may further comprise maintaining a model base for the contextual background representation types, the model base including models for different illumination conditions.
- the method may further comprise adjusting an appearance, a spatial characteristic, or both, of the models in the model base over a long duration compared with a frame duration in the video signal.
- a system for background updating for adaptive background subtraction in a video signal comprising means for defining one or more contextual background representation types; means for segmenting an image of a scene in the video signal into contextual background regions; means for classifying each contextual background region as belonging to one of the contextual background representation types; means for determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; means for receiving a current image of the scene in the video signal; means for determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and means for setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
- OCR orientation histogram representation
- PCR principle colour representation
- a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
- OCR orientation histogram representation
- PCR principle colour representation
- Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.
- Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.
- Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.
- FIG. 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.
- DAGs directed acrylic graphs
- Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.
- Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.
- Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment.
- Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.
- FSM finite state machines
- Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.
- the described embodiment provides a novel 2/4D method of multi-object tracking for real-time video surveillance.
- An appearance model principal color representation (PCR)
- PCR principal color representation
- the PCR model characterizes the appearance of an object or a region with a few most significant colors.
- the likelihood of observing a tracked object in a foreground region is derived according to their PCRs.
- multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location.
- MAP Maximum A Posterior
- each tracked object is assigned to a foreground region in the coming frame.
- its visual information will be excluded from the PCR of the region.
- multiple objects assigned to one region are located one-by-one according to their depth order.
- a two-phase mean-shift algorithm based on PCR is derived for locating objects.
- an object When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.
- the present specification also discloses apparatus for performing the operations of the methods.
- Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer.
- the algorithms and displays presented herein are not inherently related to any particular computer or other apparatus.
- Various general purpose machines may be used with programs in accordance with the teachings herein.
- the construction of more specialized apparatus to perform the required method steps may be appropriate.
- the structure of a conventional general purpose computer will appear from the description below.
- the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code.
- the computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.
- the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
- Such a computer program may be stored on any computer readable medium.
- the computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer.
- the computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system.
- the computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
- the invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.
- ASIC Application Specific Integrated Circuit
- the distinctive background objects (regions) in the example embodiment are classified into two categories:
- Type-1 CBR a facility for the public in the scene
- Type-2 CBR a large homogenous region.
- Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions.
- the example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors.
- OCR Orientation Histogram Representation
- PCR Principal Color Representation
- the OHR H b is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.
- the PCR for R' b is defined as
- ⁇ 5(cl, c2) is a delta function. It equals to 1 when the color distance d(C[, C 2 ) is smaller than a small threshold ⁇ , otherwise, it is 0.
- the color distance used here is
- IMI 2 + IMh 5a ⁇ where ⁇ •,•> denotes the dot product [2, 3].
- the principal color components E k ,• are sorted in descendent order according to their significance values p k t .
- a type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both OHR and PCR are used to characterize the type-1 CBR.
- R' b i be the i-th type-1 CBR in the scene. Its contextual descriptors are Et b ⁇ and T b ⁇ -
- a type-1 CBR has just two states: occluded (occupied) or not.
- the likelihood of observing a type-1 CBR is evaluated on the whole region.
- the contextual descriptors of the region R,(x) from the corresponding position of R! ⁇ in the current frame /,(x) are H, and T 1 .
- the likelihood of R' M being exposed can be evaluated by matching R t (x) to R' M .
- R ( (x) and R' b ⁇ are similar, P L ⁇ H t ⁇ R! b i) is close to 1, otherwise, it is close to 0.
- the f *lrst term is the likelihood based on the partition evidence of principal color c A' b ⁇ ,i- It is evaluated from the PCRs of R' b ⁇ and R,(x) as .
- the type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it. The likelihood of observing a type-2 CBR is evaluated loc-
- the appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR.
- M s (R' b 0 (5' i l , x' 61 ).
- a model base which contains up to K b appearance models of R b ⁇ is used.
- the models in the base are learned incrementally.
- the active appearance model is the one from the model base which best fits the current appearance of the CBR.
- D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal).
- the times of observing the Mh type-1 CBR during the period are accumulated as
- the new appearance model M c a (R' b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e.
- the similarity is larger than T u + ⁇ ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
- a model base is employed to deal with the appearance variations of the type-2 CBRs from day to night.
- the models in the model base are learned incrementally through the time durations.
- the overlapping ratio between the exposed parts and the spatial model for R' b2 at time t is i.t _ v to n ( -to
- a new appearance model M c ⁇ (R' b 2 ) is generated from the current duration. If the average similarity values are low in the two consecutive durations, the active appearance model will be replaced. If there is a model in the base which is close enough to the new appearance model M° ⁇ (R' b 2 ) (i- ⁇ - the similarity is larger than T u + ⁇ ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
- the prior probability P(R,(x)) is the same for every pixel in an image. Then the log posterior probability of R,(x) belonging to R' b in the current frame /,(x) is defined as G2007/000205
- the position of a type-1 CBR is already determined by its spatial model.
- x) is 1 for the position. and 0 otherwise.
- a rate of occluded times over recent frames for each type-1 CBR is used. For R' b ⁇ , the rate is computed as
- a rate of occluded times over recent frames at each pixel for each type-2 CBR is used.
- an occluded pixel of a type-2 CBR is confirmed on the local neighborhood R,(x).
- T 1 be the proportion of pixels belonging to R' b2 in the neighborhood region, i.e., and r-i be the proportion of exposed pixels of R' b2 in R t (x) according to the posteri estimates, i.e.,
- control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for no ⁇ nal learning rate for non-context pixels used for display).
- C,(x) 0 is set.
- control code images are used.
- the first two are the previous and current control code images described above, i.e., C t ⁇ (x) and C[(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C* r _i(x) and C* t (x).
- the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC MoG), PFR, and Context-Controlled PFR (CC PFR) were compared.
- MoG MoG
- C MoG Context-Controlled MoG
- PFR Context-Controlled PFR
- CC PFR Context-Controlled PFR
- the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero.
- the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g.
- the second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground.
- the rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120.
- the three images in the lower row 122 are the segmented, results . of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128.
- the black regions e.g 130 do not belong to any CBR
- the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes
- the white regions e.g. 134 are occluded parts of the CBRs.
- the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used.
- the high learning rate would be used as described above.
- the scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities.
- the contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to' characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.
- the example embodiment exploits contextual interpretation to control the pixel-level background maintenance for adaptive background subtraction.
- Experimental results show that the example embodiment can improve the performance of adaptive background subtraction for at least situations of high foreground complexities.
- FIG. 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment.
- one or more contextual background representation types are defined.
- an image of a scene in the video signal is segmented into foreground and background regions.
- each background region is classified as belonging to one of the contextual background representation types.
- an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region.
- OCR orientation histogram representation
- PCR principle colour representation
- a current image of the scene in the video signal is received.
- it is determined whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed.
- different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
- PCR principal color representation
- object tracking may be applied to a sequence of segmented images generated by background subtraction.
- each segmented image may contain one or several isolated foreground regions.
- each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point).
- the example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions.
- each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap. It has been recognized by the inventors that these issues make the shape-based object tracking a rather challenging task. However, the inventors have recognized that it is much less likely that a a target object change colors in a sequence from a surveillance camera. Hence, using global color features of an individual object can provide a relatively stable and constant way for object appearance description. This can also lead to a better discrimination of multiple target objects in the scene.
- an object of interest e.g., a person, vehicle, luggage, etc.
- an object of interest may render a few dominant colors which only span a small portion of the entire color space.
- PCR principal color representation
- s n is the size of the region (or the total number of the pixels within the region)
- c' seriously ( r ' « s' n b' n )
- ⁇ is the RGB values of the /th most significant color under the original color resolution (i.e., 256 levels for each channel)
- s' chorus is the significance of c' chorus for the region.
- the components E' n are sorted in descending order according to the significance values of the principal colors. Let the current frame of input color images be I ( (x), then the significance of ith principal color can be defined as
- ⁇ (x) is a weight function and ⁇ 5(-, • ) is a delta function.
- other weight functions can be used, e.g. Gaussian kernel to suppress the noise around the object's boundary [5].
- a color distance is used which is not sensitive to noise and illumination changes
- the PCR 7" contains the first N significant colors and their statistics for the region R" t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number Nto approximate the color features of the region, i.e.,
- Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons.
- the PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.
- the aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance.
- the likelihood or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated.
- the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.
- each P(R" , ⁇ E' m ) is the likelihood of that the object O m ,_ ⁇ appearing in the region R" , based on the partition evidence Ef n ,
- P(E 1 m ⁇ O m t ⁇ i) is the conditional probability of the evidence E' m given the object O m ,. ⁇ .
- the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs.
- the likelihood on original PCRs is better.
- the scale-invariant likelihood of observing a given object O m t - 1 in the region R" t is defined as,
- Equ (11) can provide a suitable measurement for these two cases in the example embodiment.
- Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene.
- Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects.
- MAP Maximum A Posterior
- the global MAP problem can be approximately decomposed as two subproblems: assignment and location.
- PCR principal color representation
- likelihood function the example embodiments uses sequential solutions to these two subproblems, as detailed below.
- the objects m a group region e.g., R t -i
- the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in the consecutive frames. Exploiting such a relation, the problem (13) can be further decomposed by using directed acyclic graphs (DAGs).
- DAGs directed acyclic graphs
- the directed acyclic graphes (DAGs) for the regions detected m the consecutive frames lt-i(x) and I f (x) are constructed is the following way.
- the regions from the previous and current frames be denoted as the nodes and be kid in two layers: the parent layer and the child layer.
- the parent layer consists of nodes representing the regions ⁇ # t _ ⁇ j i ⁇ ⁇ ⁇ the previous frame Ii_i(x).
- the child layer consists of nodes denoting the regions 4 -Rf)J[I j in me current frame
- a directed acyclic graph (BAG) is formed by a set of nodes in which every node connects to one or more nodes in the . same group.
- a set of DAGs (graphs) can be generated.
- An example of graphs for two consecutive frames is illustrated in Figure 4.
- the notations for the DAGs are defined as follows. For the ith graph, the parent nodes are denoted as ⁇ v ⁇ f ⁇ where each node njf represents one of the regions
- the ⁇ th DAG can thus be denoted as
- the aode is a single object, otherwise it is a group of M 1 ⁇ objects.
- the object o2l ⁇ ' is one of the objects ⁇ OJIi K m UJ j 1 -
- the objects in a child node n 1 ,' 9 . which may be newly generated objects or objects tracked from its parent nodes, are denoted as ⁇ «V'*
- the objects in the child nodes are reordered as (0"J- ⁇ 1 . They are the set of tracked target objects in the current frame.
- the example embodiment solves the problem in two sequential steps from coarse to fine.
- the problem is decomposed approximately as two sub- problems: assignment and location.
- the coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.
- ⁇ j p can be considered as the coarse tracking parameters indicating in which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.
- the new states of the tracked objects assigned to each child node are determined.
- O]: q K' 1 ' 1 ⁇ ! be the objects assigned to the child node nf from its pareat nodes. That is, Ol 4 is a subset of Ol_ y according to the assignment parameters ( ⁇ )*
- objects in each child node can be tracked independently of objects in the other child nodes.
- the posterior probability of the new states for the tracked objects in the graph Gi can be evaluated as
- Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below. Assuming thai ⁇ i?* J- ⁇ L 1 are the foreground regions and ⁇ J3* J- ⁇ L 1 are their bounding boxes defected- at time t, then their PCRs can be obtained as Let be the set of directed acyclic graphes (DAGs) for the foreground region? between the consecutive frames.. If there ⁇ S only one object in a graph G iy then the object will be tracked as an isolated object. Otherwise, multi-abject tracking will be performed according to Eqs. (IS) and (20).
- DAGs directed acyclic graphes
- the posterior probability of the new state for each object is. determined on both spatial position and depth relationship.
- 2 ⁇ D state is. used for each object.
- the itk graph consists of only one child node (i.e., Qi — (H 1 ' 1 )). a new object appears and is initialized as in Gi with a new id number.
- the node n ⁇ * 1 represents the region Rf
- D state parameter vector is (b] t ⁇ £).
- the graph represents the simple case of isolated object tracking.
- Gj fa g , n
- the object in the parent node be OfI 1
- the child node n[ represent the region Rf
- the object 0 ⁇ L 1 is updated as o] in n ⁇ (Le., OJl 1 and o] have the same id number).
- 1.
- its PCR is updated as T ⁇ — T t fc to follow the gradual variation of the object.
- the previoiis objects in the parent node are assumed to have disappeared in the ⁇ urent frame. Tracking is terminated for these objects.
- the ith graph G,- contains multiple parent nodes or child nodes, the operations of assignment and location will be performed.
- the index i for the graph G,- is omitted below.
- n£ be a parent node in the graph G, Of -1 ⁇ OJJL 1 ⁇ m Li be *l ie associated objects, and ⁇ V be its child nodes. If the parent node has more than one child node, the assignment of objects Of -1 is determined by Eq. (18). However, with varying numbers of objects and child nodes., Eq. (IS) is a nontrivial problem of optimal configuration. To make the problem tractable, a sequential solution is proposed based on their PCBLs and the depth relations among the objects.
- the close and non-occluded objects have richer visible information than the distant or occluded objects. This means that an occluded object less affects the tracking of the objects occluding it.
- the assignment cart be solved sequentially from the most visible one to die least visible one. Let the objects ⁇ oTM i ⁇ m Li in the parent node rig be ordered according to their visible sizes.
- the assignments of the objects are not independent The assignment of one object is affected by the previous objects with higher ranks. This means the assignment of each object can be performed one-by-one sequentially from the most to the least visible ones. For each object, the posterior probability of assignment can be evaluated using Bayes rule,
- (22) can be evaluated on PCR. Assuming that a child node be the PCRs of B% and 0Jl 1 , respectively. Using Eqs. (21) and (22), the best assignment of the objects can be achieved one-by-one sequentially according to their depth order by
- Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones.
- objects in the node are located one by one from the most visible to the least visible ones based on their visible parts.
- the posterior probability of new states for all the objects in the node can be expressed as
- the sequential solution to the problem Eq. (20) and Eq. (26) contains two steps.
- the first step the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes.
- an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.
- ⁇ is a weight to smooth the estimates from consecutive frames ( ⁇ — 0.5 is chosen in this ⁇ tudj')- ⁇ o" ⁇ ⁇ Jj are then sorted in descendent order according to the values of ⁇ J- ' jJ? ! and then placed in a list.
- R t (n ⁇ 1) ⁇ f — YJ ⁇ * a weight image u-v(x) is used. If the pixel x is likely to belong to one of the previously located objects (oj . • * • . o" "1 ). ⁇ % _i(x) is low (ft: Q), otherwise, it is high («* 1).
- the object o" in the region Rf according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object.
- This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7].
- the weight of evidence fir ⁇ ni principal color c£ is defined as the backprojection
- the weight u ⁇ implies that if then only proportion of the pixels with, color r£ in the bounding box B ⁇ belong to the object oj ⁇ otherwise all the pixels of color c£ belong to the object.
- the mean-shift procedure is terminated once is satisfied, or the maximum number of iterations is reached (6 in the example)
- the new location of object of is the boimding box hf — BI? +1 ⁇ centered at be the PCR of the part of region within the bounding box 6", where sj, is the size of the pan and the significance is
- the likelihood of the object t ⁇ in the group can be estimated according to Eq. (11).
- the new state parameter of the object of is then obtained as
- I has no child node, the objects in it are deleted: a.2: if , all the objects in it are assigned to rif ; a.3: if ⁇ Q ' P has multiple child nodes; a.3.1: sort the objects ⁇ oTM_ x ⁇ ⁇ f 1 in ⁇ Q ' P , and then assign them one-by-one from the first to the last as follows: a.3.1.1: assign o t TM, to the child node n[' qm according to (23); a.3.1.2: exclude the visual information of o t TM , from the PCR of n[' qm
- the algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location.
- DAG Directed Acyclic Graph
- assignment phase each parent node in the DAG is processed.
- location phase assigned objects in each child node are tracked.
- small objects in a group with likelihood values less than 0.1 are set as disappeared.
- the records of disappeared objects are kept for 50 frames.
- a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created.
- segmenting individual persons in a group with domain knowledge will be preferred. For example, in the example embodiment knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.
- Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment.
- first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked.
- one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node.
- DAGs directed acrylic graphs
- step 506 for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.
- step 508 for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image.
- step 510 for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
- a layer tracking algorithm is designed to track stationary objects through even frequent occlusions.
- the objected is identified as moving object and tracked by a moving object tracking algorithm.
- the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.
- a template image of the object is used to represent such a stationary object in the example embodiment.
- ⁇ B j ⁇ ' M _ ⁇ be a sequence of bounding boxes of the ith tracked object in the ⁇ b most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other.
- ⁇ b if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, ⁇ b is set as 10 frames, corresponding to about 1 second.
- a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as
- T 1 is the Principal
- PCR Colour Representation
- the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object.
- d ⁇ is the difference measure between the template A/ and the frame I j (s) for the corresponding region of Aj
- d c J is the difference measure between the consecutive frames / y -_ [ (s) and / y (s) for the region of the template
- d p J is the visibility measure of the object from the corresponding region in the frame / (s)
- s k is an estimated state of the stationary object at time k. Measures in ⁇ d most current frames and states in ⁇ s most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.
- the example embodiment can greatly enhance object tracking much.
- c / t (s) be the color of a foreground point in the region of zth template image. According to Bayesian rule, the probability of the point belonging to the background is
- b) can be obtained from the Principal Feature Representation (PFR) of the background.
- PFR Principal Feature Representation
- s (x, y) be a pixel of the image.
- p v ' (b) is the learned probability of S belonging to the background (P s (b) ) based on the observation of the feature V
- S v ' (i) records the statistics of the M v most frequent feature vectors at s ,
- Each S v ' (i) contains three components
- the first N y elements are used as principal features.
- Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts.
- T c (s) T e (s)
- T cc (s) T e (s)
- N s (b) is the background points in a small window W s centered at s in the previous frame
- M s is the number of points within the window.
- the probabilities p(c ⁇ I) and p(c ⁇ f) can be calculated with Gaussian kernels. Let c' x be the color of point x in the template ⁇ $ ⁇ within the window W s . Then p(c I /) can be calculated as
- p(c I /) m xeatrx. ⁇ fc c (c' x - c)*,(x - s) ⁇ (7b)
- c ⁇ be the color of a point x in the window W s and in the region of moving foreground objects from the last frame I t _ x (s) .
- the probability p(c ⁇ f) can be calculated as
- the priors can be calculated as
- N s (/) and N s (f) are the number of points belonging to the layer object and moving objects within the window W s in the previous frame.
- the pixel s would be assigned according to the greatest likelihood value.
- the mask for the moving objects is used as the input for moving object tracking.
- Stationary objects may also be involved in several changes and interactions with other objects through the sequence.
- a non-living object it may e.g. undergo illumination changes, be occluded and removed by other objects.
- the object may change pose or move bodyparts, or start moving again.
- the object's states are estimated and the template image updated correspondently in the example embodiment.
- five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving.
- the state is estimated according to various change measures from a short sequence of most recent frames. Let s be a point in template A ⁇ (s), of the zth layer object. The difference between the template and a current frame at s can be evaluated as
- Th d is the threshold according to image noise.
- S A ' is the size of the template.
- the difference measure between consecutive frames for the layer object is defined as
- the difference measures are calculated on color vectors.
- the visibility (visibility measure d p J ) of the object in the current frame based on PCR would still be high since the PCR is a global representation not related to spatial information.
- the visibility of the layer object in the current frame would be low.
- T 1 be the PCR of the layer object in Ii 1 that was stored when the object was detected as a stationary object
- Tj be the PCR from the region overlapped by the template A ⁇ in the current frame. Then the visibility measure of the layer object in the current can be evaluated as 7 000205
- O' ⁇ l be an object in /,_ j (s)
- O n ' be a region in /,(s) .
- the probability of observing O' ⁇ l in O n ' can be computed as
- the states of the tracked layer object are estimated by heuristic rules in the example embodiment:
- the parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment.
- the difference measures for dj and dj. are low if they are less than 0.25, they are of moderate if they are within
- the visibility measure d p J is low if it is less than 0.6, otherwise, it is high.
- the measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template A' . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.
- the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene.
- the layer object is confirmed as being motionless, a smooth operation is perfo ⁇ ned to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by combining the example embodiment with the moving object tracking algorithm described for the example embodiment.
- Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment.
- step 602 it is detected that a tracked moving object has become stationary over a sequence of frames.
- a template image of the stationary object is generated based at least one of the frames in the sequence.
- step 606 a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.
- FIG. 7 The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.
- the foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8].
- the background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.
- PFR Principal Feature Representation
- the moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702.
- moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object.
- a layer representation, or a template for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment.
- the states of templates for the objects are estimated with fuzzy reasoning.
- the template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed.
- semantic models based on Finite State Machines are designed to detect suspected scenarios.
- FSM Finite State Machines
- Event is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene.
- Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from inte ⁇ nediate level processing. It is a step that bridges the numerical level and the symbolic level.
- the fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.
- unusual events are described by the spatio-temporal evolution of object's states, movements, and actions.
- each event can be defined as a sequential succession of a few well-defined states.
- An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time.
- State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state.
- the semantic representation can be modelled based on Finite State Machines (FSM).
- FSM Finite State Machines
- the FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.
- each specific event can be represented by a directed graph
- FSM 800 ⁇ S j ,E ⁇ e , where Sf is the set of nodes representing the states and E? is the set of edges representing the transitions.
- Sf is the set of nodes representing the states
- E? is the set of edges representing the transitions.
- the FSM 800 could have the self-loop transition for each state. Although the FSM 800 could remain at the same state, some or all properties of the object may have changed. At least, a time counter is incremented for each frame. The more complicated an event, the bigger is N, i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.
- the input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7).
- the visual cues of each tracked object can include shape, position, motion, and relations with others.
- the visual cues in the example implementation are:
- - Status indicates whether the tracked object is moving around or stationary
- - InGroup indicates whether the object is an isolated one or merged with others
- a measure within [0,1] indicates the degree of occlusion when overlapping with others
- - Motion a measure within [0,1] indicates the degree of interior motion of a stationary object.
- An advantage of the tracking modules is the capability to resume tracking of some objects that are lost for a few frames.
- the two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation.
- a first-in-first-out (FIFO) stack is built to contain the track records of N frames.
- O Tracked are the track records of the previous N-th frame and the triggered event is delayed by N frames.
- N 30 in the example implementation.
- Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > T Loitering .
- the FSM is initialized for each new object.
- the FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay”:
- the object is classified as human
- the object moves in the scene (moving around or staying somewhere with frequent interior motion).
- this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > T Smtic .
- the FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion.
- the second intermediate state of the FSM is "S”, which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S”:
- a time counter t is continuously incremented as new frames are coming in.
- t > T Smic the FSM transits from state "S” to state "UP", indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person.
- similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.
- This event as defined in the example implementation involves two objects.
- the FSM is initialized for each new object.
- the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects.
- the FSM transits from state "INIT” to state "Station”. In the state, the object is associated with its owner. If the owner leaves the scene covered by the camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.
- This event as defined in the example implementation involves three objects.
- the FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects.
- the FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.
- the method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.
- the computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
- the computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
- LAN Local Area Network
- WAN Wide Area Network
- the computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922.
- the computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
- I/O Input/Output
- the components ⁇ of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.
- the application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930.
- the application program is read and controlled in its execution by the processor 918.
- Intermediate storage of program data maybe accomplished using RAM 920.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
A method and system for background updating for adaptive background subtraction in a video signal. The method comprises the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
Description
Method And System For Context-Controlled Background Updating
FIELD OF INVENTION
The present invention relates broadly to a method and system for background updating, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating.
BACKGROUND
Adaptive background subtraction is typically the first fundamental step for video surveillance. A typical surveillance system consists of a stationary camera directed at the scene of interest. A pixel-level background model is then generated and maintained to keep track of time evolving background. Background maintenance is the crucial part that may affect the performance of background subtraction in time-varying situations. The methods of basic background subtraction employ a single reference image corresponding to the empty scene as the background model. A Kalman filter is usually used to follow the slow illumination changes. However, it has been realized that such simple model was not suitable for surveillance in real- world situations.
Adaptive background subtraction (ABS) techniques based on statistical models to characterize the background appearances at each pixel were developed for various complex backgrounds. Wren [9] employed a single Gaussian to model the color distribution for each pixel. In [4], a mixture of Gaussians (MOG) is proposed to model the background of multiple states, e.g., normal and shadow appearance, and complex variations, e.g., the bush under winds. Many enhanced variants of MoG have been proposed in recent years. Some of the enhancements integrated the gradients, depthes, or local features into the Gaussians, and others employed the non-parametric models, e.g. kernels, to replace the Gaussians.
In [2], a model of principal feature representation (PFR) to characterize each background pixel was proposed. Using PFR, multiple features of the background, such as color, gradient, and color co-occurrence, can be learned automatically and integrated in the classification of background and foreground. Employing various statistical models and multiple features for background modelling, the ABS methods become more and more robust with respect to a variety of complex backgrounds. Most of the existing methods of adaptive background subtraction
employ a constant learning rate for background updating. Some existing methods update the background model in a constant period of time.
With a constant learning rate or a constant periodic update, existing methods gradually forget the old background and absorb the new background appearance into the background model. The foremost assumption behind it is that the most frequently observed features at a pixel should come from the background. This assumption is valid for situations of simple foreground activities even through the background is highly complex, e.g., a scene of various dynamic properties. However, when some background pixels are frequently occluded by foreground objects, e.g., by a person staying motionless or by frequent heavy crowds, this assumption becomes violated.
Some proposed approaches tried to control the learning rate according to the results of segmentation or tracking. However, this control is based on positive feedback since it depends on the results of background subtraction. It may not be able to correct the errors caused by background subtraction itself.
A need therefore exists to provide a method and system for background updating that seek to address at least one of the above disadvantages.
SUMMARY
In accordance with a first aspect of the present invention there is provided a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
A first learning rate for the pixels that are occluded may be lower than a second learning rate for the pixels that are exposed.
The method may further comprise the steps of determining whether said respective pixels that are exposed are detected as a background point or as a foreground point in a current background subtraction for the current image; and setting different learning rates for the adaptive background subtraction for exposed pixels that are detected as foreground points and for exposed pixels that are detected as background points respectively.
A third learning rate. for the exposed pixels that are detected as foreground points may be higher than the second learning rate for the exposed pixels that are detected as background points.
One contextual background representation type A may comprise a facility for the public such as a counter or a bench, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type A background region, whether said each image region is occluded based on matching OHRs of the type A background region and of said each image region respectively and based on matching PCRs of the type A background region and of said each image region respectively; and determining all pixels of said each image region as either occluded or exposed depending on said evaluation.
All pixels may be determined as exposed if a match likelihood in said evaluation is above a threshold value, and are determined as occluded otherwise.
One contextual background representation type B may comprise a large homogeneous region such as a ground plane or a wall surface, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed may comprise the steps of evaluating, for each image region spatially corresponding to a type B background region, whether neighborhood regions around respective pixels in said each image region are occluded based on matching PCRs of the type B background region and of the respective neighborhood regions; and determining pixels of said each image region as either occluded or exposed depending on the respective evaluations.
Each pixel may be determined as occluded if a majority of neighborhood pixels in the neighborhood region of said each pixel are within said type B background region and less of the neighborhood pixels themselves are evaluated as exposed based on a match likelihood being above a threshold value, and is determined as exposed otherwise.
The method may further comprise setting a zero learning rate for pixels belonging to foreground regions.
The method may further comprise the step of performing adaptive background subtraction using said set rates for the respective pixels.
The adaptive background subtraction may be based on a Mixture of Gaussian or Principle Feature Representation.
The method may further comprise maintaining a model base for the contextual background representation types, the model base including models for different illumination conditions.
The method may further comprise adjusting an appearance, a spatial characteristic, or both, of the models in the model base over a long duration compared with a frame duration in the video signal.
In accordance with a second aspect of the present invention there is provided a system for background updating for adaptive background subtraction in a video signal, the system comprising means for defining one or more contextual background representation types; means for segmenting an image of a scene in the video signal into contextual background regions; means for classifying each contextual background region as belonging to one of the contextual background representation types; means for determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; means for receiving a current image of the scene in the video signal; means for determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and means for setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
In accordance with a third aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.
Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.
Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.
Figure 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.
Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.
Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.
Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment.
Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.
Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.
DETAILED DESCRIPTION
The described embodiment provides a novel 2/4D method of multi-object tracking for real-time video surveillance. An appearance model, principal color representation (PCR), is applied to multi-object tracking. The PCR model characterizes the appearance of an object or a region with a few most significant colors. The likelihood of observing a tracked object in a foreground region is derived according to their PCRs. Based on the Bayesian estimation theory, multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location.
By exploiting that the close and unoccluded objects have richer visual information than the distant or occluded ones, sequential solutions to the subproblems which process the objects in a group from the most visible to the least visible ones are derived according to the likelihoods estimated based on PCR. In the assignment step, each tracked object is assigned to a foreground region in the coming frame. When an object is assigned, its visual information will be excluded from the PCR of the region.
In the location step, multiple objects assigned to one region are located one-by-one according to their depth order. A two-phase mean-shift algorithm based on PCR is derived for locating objects. When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.
Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring
physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "calculating", "determining", "excluding", "generating", "assigning", "locating", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.
The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.
In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile
telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
The invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.
The distinctive background objects (regions) in the example embodiment are classified into two categories:
Type-1 CBR: a facility for the public in the scene;
Type-2 CBR: a large homogenous region.
Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions. The example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors. Let R'b be the f-th. CBR in the empty scene /(x), and G(x) and O(x) be the gradient and orientation images of /(x), respectively. If the orientation values are quantized into 12 bins each covering 30°, the orientation histogram forR'b is defined as
where μτ () is a binary function on the threshold T and δk() is a delta function defined as
*- ^- " \ O, otherwise .„ .
^ (2a)
The OHR Hb is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.
By scanning the region R'b , a table of the PCR for the region can be obtained. The PCR for R'b is defined as
^ = {pι, {^ = (c?,l*)}{£α} (3a)
where pt is the size of R!b, cA,- is the Ar-th most significant color of R'b and pk t is its significance value. The significance value is computed by
<5(cl, c2) is a delta function. It equals to 1 when the color distance d(C[, C2) is smaller than a small threshold ε, otherwise, it is 0. The color distance used here is
,, , , 2 < C1 ( C2 >
0( CI v C2) = 1 — γ jpj
IMI2 + IMh (5a^ where <•,•> denotes the dot product [2, 3]. The principal color components Ek ,• are sorted in descendent order according to their significance values pk t . The first N,- components which satisfy ' ∑k=oP" i — 0.9opt are use(j as tjje p^ of ^6 region ^ ^ which means the principal colors in PCR cover more than 95% colors from R'b . PCR is thus efficient to describe large regions of distinctive colors.
A type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both OHR and PCR are used to characterize the type-1 CBR. Let R' bi be the i-th type-1 CBR in the scene. Its contextual descriptors are Et b\ and T b\- A type-1 CBR has just two states: occluded (occupied) or not. The likelihood of observing a type-1 CBR is evaluated on the whole region. Suppose the contextual descriptors of the region R,(x) from the corresponding position of R! ^ in the current frame /,(x) are H, and T1. The likelihood of R' M being exposed can be evaluated by matching Rt(x) to R' M. Based on OHR, the matching of R((x) and R' bι is defined as
IfR' 6i and R,(x) are similar, PL{Ht\R!b i) is close to 1, otherwise, it is close to 0.
The second term in the sum is the weight of the principal color c* but in the PCR of R' bi, i.e., P(Eh bλ,i\f bi) =pk b i/pbu- The f*lrst term is the likelihood based on the partition evidence of principal color cA' bι,i- It is evaluated from the PCRs of R' b\ and R,(x) as
.
10
P(TtIEt1J = 4- mm U1,, ∑ .*(<&,, c?)tf 1
(8a) Then there is
m'! i=i I ».=ι J (9a)
P(T tι\Tc) can be obtained in a similar way. Now the matching of R' bx and R1(X) based on PCR is defined as
Pt (Tt ^1 ) = minf P(T4 |7& ) . F(Hf1 |Tέ)|
(10a)
Assuming that colors and the gradients are independent and different weights are used, the log likelihood of observing R'bi at time t is
I%[ = Uf1 logϋ (Et]Hi1) + ( 1 -u>, J log Pt (T, |2& :
(Ha) where co5 = 0.6 is chosen empirically.
The type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it. The likelihood of observing a type-2 CBR is evaluated loc-
B(it r(\ s) ^\IH -*o.)^ = { i i Q'( ^ otherw'is We ')'^ = 1 /H I
The log likelihood that the pixel x in the current frame belongs to R' b2 is
£&(x) = log F(JJf(X)IZZk)
(Ha)
The appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR. For the *-th type-1 CBR R' bU the appearance model is defined as M0(R! b 0 = (JHt M, The spatial model of R' bi is defined as its bounding box and the center point, i.e., Ms(R'b 0 = (5'i l, x' 61).
To adapt to lighting changes from day to night, besides the active appearance model Ma(R' b I)J a model base which contains up to Kb appearance models of R b\ is used. The models in the base are learned incrementally. The active appearance model is the one from the model base which best fits the current appearance of the CBR. The model base of
Natural lighting changes slowly and smoothly. Let D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal). The times of observing the Mh type-1 CBR during the period are accumulated as
teD (15a) and the average of the likelihood values is
*ή ι*D (16a) where L''' b\ > TL\ means R' b\ is visible at time t. If sufficient samples of R' b\ have been observed during the previous (last) duration (e.g., z!'p b\ /D > 25%) and the average likelihood value is approaching the threshold TLX (e.g., Lb''p < 0.87i,i), a new appearance of R' b\ may be observed. In the coming duration, a new appearance model Mtc a (Rb i) = (H1''' b\ , T1''0 AI } is obtained from a frame in which R,(x) looks mostly like R' b\, i.e., tc = ^S inm^o j \Lhl ~ Lhl | J- If the average likelihood values are low in the two consecutive durations, the active appearance model is replaced. First, the new appearance model Mc a (R'b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e. the similarity is larger than Tu+έ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
Let 7* bi be the PCR descriptor of the Mh type-2 CBR R'b2, the appearance model of Rb'2 is then defined as Ma(R'b 2) = (X a)- The spatial model of R' bι describes the range of the homogeneous region in the image. A binary mask If b2(x) is used for it, i.e., Ms(R'b2) - (Jf A2(X))- The spatial model may have to be adjusted in initialization duration when sufficient samples have been observed according to the likelihood values.
Again, a model base is employed to deal with the appearance variations of the type-2 CBRs from day to night. The model base of the Mh type-2 CBR Rib 2 is MB(R b 2) =
≤ Kh- \ . The models in the model base are learned incrementally through the time durations. First, at each time step /, the binary image of observed parts for R' b2 is generated as V '•' 42 (x) = μτa. (L''' hi (x))- The overlapping ratio between the exposed parts and the spatial model for R' b2 at time t is i.t _ vto n (-to
^ U Li2 (17a) where ' /9' means intersection and 'u' means union. The larger the ratio is, the more parts of R' b2 axe exposed and less pixels of other objects would be involved. At the end of each duration, the times of observing the large part of R! b2 during the period is
t€β (18a) and the average similarity value between the observed parts and its active model can be computed as
where PL(Tb l 2 \ Tb'2) is calculated according to (10a) with normalized PCRs and TH =
75% is used. Like the operation for type-1 CBRs, if sufficient samples have been observed during the last duration (i.e., 2"p b2 /D > 25%) and the average similarity value is approaching the threshold TL2 (e.g., Sb'2 p < 0.8TL2), a new appearance model Mc α (R'b 2) is generated from the current duration. If the average similarity values are low in the two consecutive durations, the active appearance model will be replaced. If there is a model in the base which is close enough to the new appearance model M° α (R'b 2) (i-β- the similarity is larger than Tu+ε), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
Let (R' b }Nb ,=i be the CBRs of a scene. Given a coming frame /,(x) and a local region R,(x) centered at x in/t(x), the posterior probability of R1(X) belonging to a CBR R'b is
wiftW) - — P(S^ — (20a)
The prior probability P(R,(x)) is the same for every pixel in an image. Then the log posterior probability of R,(x) belonging to R'b in the current frame /,(x) is defined as
G2007/000205
13
(21a)
The position of a type-1 CBR is already determined by its spatial model. The prior probability P(R' b |x) is 1 for the position. and 0 otherwise. Then, the log posterior probability is equivalent to the log likelihood at the position, i.e., Q'' bi = V
)) = L'Λ b\ for K bi. A rate of occluded times over recent frames for each type-1 CBR is used. For R'b \, the rate is computed as
>ti = 4rjr1 +(l - βKl -Wfcι«&)) (22a) where β is a smooth factor and β = 0.5 is chosen. A high rate value (close to 1) indicates that R'b\ has been occluded in recent frames.
From the spatial model If b2(x) of the i-th type-2 CBR R'b2, the prior probability of a pixel x belonging to the region R'b2 can be defined as
Combining (21a), (14a), and (23a), the log posterior probability of that x is an exposed point of R'b2is
A rate of occluded times over recent frames at each pixel for each type-2 CBR is used. First, to be robust to noise and effect of boundaries, an occluded pixel of a type-2 CBR is confirmed on the local neighborhood R,(x). Let T1 be the proportion of pixels belonging to R'b2 in the neighborhood region, i.e.,
and r-i be the proportion of exposed pixels of R'b2 in Rt(x) according to the posteri estimates, i.e.,
where TQ is chosen as slightly lower than 2Ta. Then, an occluded pixel of R'b2 is confirmed if the majority of the pixels within its neighborhood are of R'b2 and less of them are observed in the current frame. Now the rate is computed as
rg(x) = &&-1(x)+(l-#|μr,r(ri)-μr*(l->l_)]
(27a) where r# = 75% is chosen in the example embodiment.
According to the result a contextual interpretation, three learning rates can be applied at each pixel for different situations in the example embodiment;
Normal learning rate to exposed background pixels with small variations;
Low learning rate to occluded background pixels;
High learning rate to exposed background pixels with significant changes.
An image of control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for noπnal learning rate for non-context pixels used for display). First, for the pixels not associated with any contextual background region, C,(x) = 0 is set. The rest of Ct(x) is deteπnined according to the results of contextual interpretation. For a pixel x within the z-th type-1 CBR RΗ, if r '■' b\ ≥ 0.7 that means the CBR is being blocked by a foreground object, Ct(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C1(X) = 2 is set since the CBR is exposed and no significant appearance change is found, but if /,(x) is detected as a foreground point by background subtraction, a high rate should be applied since it is detected as an exposed CBR point with significant appearance change, i.e., C1(X) = 2. For a pixel of the Mh type-2 CBR Kyi, if r'1' a (x) ≥ 0.7 that means the patch of the CBR is being occluded by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, Ct(x) = 2 is set for exposed part of the type-2 CBR with no significant appearance change. But if /,(x) is detected as a foreground point by background subtraction, C,(x) = 3 is set for the an exposed neighborhood of the type-2 CBR with significant appearance change.
To smoothen the control code temporally at each pixel, four control code images are used. The first two are the previous and current control code images described above, i.e., Ct~ι(x) and C[(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C*r_i(x) and C*t (x). The applied control code to the current frame at pixel x is determined by the votes from three other control codes C,-ι(x), C,(x), and C*,- i(x). If at least two of the three codes are the same, the control code of high votes is selected. If the three codes are different from each other, the normal learning rate is used, i.e., C*, (x) = 2.
To evaluate the effect of context-controlled background maintenance on adaptive background subtraction, the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC
MoG), PFR, and Context-Controlled PFR (CC PFR) were compared. In the test, the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero. In Figure 1, the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g. 104, which are type-2 CBRs in this example. In the snapshot image 102, the type-2 CBRs are surrounded by polygon boundaries e.g. 106 of different colors. The second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground. The rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120. The three images in the lower row 122 are the segmented, results . of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128. In the control images 120, 128, the black regions e.g 130 do not belong to any CBR, the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes, and the white regions e.g. 134 are occluded parts of the CBRs.
According to the example embodiment, for pixels in the regions of exposed parts of the CBRs with no significant appearance changes, the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used. For pixels in regions of exposed parts of CBRs with significant changes (not applicable in the scene shown in Figure 1), the high learning rate would be used as described above. The scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities. Therefore, the center parts of the scene were frequently occluded by persons. Using a constant learning rate in the unmodified ABS methods, some appearance features of the persons were learned into the background models, and then the background subtraction failed to extract the complete figures of the persons in the incoming frames (see images 116, 124). One example frame, Frame #102810, is displayed in Fig. 1.
By using context-controlled background maintenance of the example embodiment applied to the ABS methods, the persons were segmented satisfactorily (see images 118, 126). A quantitative evaluation on 12 frames sampled from the sequence every 200 frames started from Frame #101410 (empty frames were skipped) is listed in Table 1, where the metric value is defined as the ratio between the intersection and union of the ground truth and the segmented regions. According to [2], the performance is good if the metric value is larger than 0.5 and nearly perfect if the metric value is larger than 0.8. From Table 1, it can be seen that, by using the
context-controlled background maintenance of the example embodiment applied to the existing ABS methods, the performance of adaptive background subtraction on situations of complex foreground activities can be improved significantly.
Table 1
The contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to' characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.
The example embodiment exploits contextual interpretation to control the pixel-level background maintenance for adaptive background subtraction. Experimental results show that the example embodiment can improve the performance of adaptive background subtraction for at least situations of high foreground complexities.
Figure 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment. At step 202, one or more contextual background representation types are defined. At step 204, an image of a scene in the video signal is segmented into foreground and background regions. At step 206, each background region is classified as belonging to one of the contextual background representation types. At step 208, an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region. At step 210, a current image of the scene in the video signal is received. At step 212, it is determined whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed. At step 214, different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
While the described example embodiment started from manually cropped out contextual background regions in a snapshot, image segmentation and background object recognition for automatic initialization of contextual models may be performed in different embodiments.
In the example embodiment, principal color representation (PCR) is applied for efficient appearance-based multi-object tracking. In a video surveillance system, object tracking may be applied to a sequence of segmented images generated by background subtraction. In such a case, each segmented image may contain one or several isolated foreground regions. Further, each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point). The example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions. For an image sequence captured from a natural public site, each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap. It has been recognized by the inventors that these issues make the shape-based object tracking a rather challenging task. However, the inventors have recognized that it is much less likely that a a target object change colors in a sequence from a surveillance camera. Hence, using global color features of an individual object can provide a relatively stable and constant way for object appearance description. This can also lead to a better discrimination of multiple target objects in the scene.
In video surveillance, an object of interest (e.g., a person, vehicle, luggage, etc.) may render a few dominant colors which only span a small portion of the entire color space. Let the nth foreground region detected from the frame at time t be R" , (x), where x = (x, y)τ denotes the position of a pixel in the region. Then the corresponding principal color representation (PCR) can be defined as
where sn is the size of the region (or the total number of the pixels within the region), c' „ = (r'« s'n b' n)τ is the RGB values of the /th most significant color under the original color resolution (i.e., 256 levels for each channel), and s' „ is the significance of c' „ for the region. The components E'n are sorted in descending order according to the significance values of the principal colors. Let the current frame of input color images be I((x), then the significance of ith principal color can be defined as
*€fl? (2) where ω(x) is a weight function and <5(-, •) is a delta function. In the example embodiment, ω(x) = 1 is chosen for isolated objects or regions. When locating an object in a
group, ω(x) may not be equal to 1. If necessary, other weight functions can be used, e.g. Gaussian kernel to suppress the noise around the object's boundary [5]. (5(C1, c2) equals to 1 when ci = C2, otherwise it is equal 0. However, in the example embodiment a color distance is used which is not sensitive to noise and illumination changes
2 < C1 J C2 > d{cu c<ι) = 1 -
INP + |cfc||»
(3) where < •, • > denotes the dot product. The color distance in (3) is then applied to compute the delta function in (2) as
J l, if d(ci, C2) < ε ϋ. otherwise k (4) ε = 0.005 is chosen in the example embodiment.
The PCR 7" , contains the first N significant colors and their statistics for the region R" t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number Nto approximate the color features of the region, i.e.,
^2 Sn > η'sn with 7 = 90%
(5)
In the example embodiments, using N= 50 in (5) lead to satisfactory results for almost all the regions containing one or a group of objects.
Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons. The PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.
TABLE 2 THE ALGORITHM TO GENERATE PCR FOR REGION R" t (x)
The aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance. To achieve this, the likelihood, or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated. In the following, the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.
Let Om t- 1 be me with tracked object described by its PCR
'j-rn __ f j, J βi __ /ci <.« Vl. N i ; t-i \-m"- x m v W^mZJ t=U obtained previously when the torched object was an isolated object, and R" , be the nth foreground region detected at time t. According to the law of the total probability, the likelihood of the object Om ,_ i in the region R" , can be defined as
w (6) where each P(R" , \E'm ) is the likelihood of that the object Om ,_ { appearing in the region R" , based on the partition evidence Efn, , and P(E1 m \Om t~ i) is the conditional probability of the evidence E'm given the object Om ,. {. Using the PCR T m ,_ i of the object Om ,. u the conditional probability P(E' m \Om t- \) can be defined as the weight of the principal color c' m for its appearance,
(7) Using the, PCRs T m t. x of the object Om ,. j and T t for the region R" , , the likelihood
can be computed according to their significance values of the same color component c'm
Substituting (7) and (8) into (6) yields
It is noted that the above likelihood (9) is evaluated under the assumption that the size variation of the object is small. However, if the size change is large the likelihood value will be
affected. Therefore, a definition of likelihood based on the normalized PCRs is used in the example embodiment. Let
G^L1, where s*,. = sm* /sm, and TJ1 = {l, {iξ = (<£. #„)}£, } be the normalized PCR of the region R£ Λvith Pn = j£/«n- The likeliliood based on the normalized PCRs becomes
If the region R" , only contains a single object, the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs. However, if <9"',_ i is one of the objects in the group R", , the likelihood on original PCRs is better. Hence, the scale-invariant likelihood of observing a given object Om t- 1 in the region R"t is defined as,
Heuristically, if the region R" is a single object, the likelihood computed from normalized PCRs appears more reliable. However, if O1", is one of the objects in a group R" , the likelihood from un-normalized PCRs appears better since the object is smaller than the group. Equ (11) can provide a suitable measurement for these two cases in the example embodiment.
Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene. When multiple target objects frequently merge and separate from one another in a public site, tracking one individual object is no longer an independent process. Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects. With the segmented foreground regions provided by background subtraction, in the example embodiment the global MAP problem can be approximately decomposed as two subproblems: assignment and location. Using the principal color representation (PCR) and the associated likelihood function, the example embodiments uses sequential solutions to these two subproblems, as detailed below.
Let Ot-i = {Oj"!,}^"! 1 be the set of tracked target objects in the previous fraine It_i(x), and θt-i
be the set of state parameters describing their positions at time t — 1. The task of multi-object tracking is to estimate the states θ4 of tracked objects in the current frame I*(x) given their previous appearance models Ot-ι and states θt_i. This can be formulated as the Maximum A Posterior (MAP) estimation for the state parameters θ£,
θ; = argiuax P(θf|I.(x). Ot-u θ«_i) θt (12)
When several objects overlap one another, they cannot be tracked as independent objects. With foreground regions 7Zt = {iϊ{f}fc!.i provided by background subtraction, (12) can be simplified as θ* = aig max P(GtIK4, Ot-i- θt_i) θt (13) where the tracked objects [OtL1
from the previous frame. If a region (e.g., Hf-ι) only contains one object (e.g., OJl1) it is an isolated object, otherwise the region is a group. Objects belonging to different regions in the previous frame may merge into a new group region (e.g., Rf) ϊii the current frame. Also, the objects m a group region (e.g., Rt-i) m ^ie previous franie may separate to several regions in the current frame. For real-time processing with moderate or high frame rate of image acquisition, the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in the consecutive frames. Exploiting such a relation, the problem (13) can be further decomposed by using directed acyclic graphs (DAGs).
The directed acyclic graphes (DAGs) for the regions detected m the consecutive frames lt-i(x) and If(x) are constructed is the following way. Let the regions from the previous and current frames be denoted as the nodes and be kid in two layers: the parent layer and the child layer. The parent layer consists of nodes representing the regions {#t_ι}jiϊι ώ the previous frame Ii_i(x). and the child layer consists of nodes denoting the regions 4 -Rf)J[Ij in me current frame
I{(x). Suppose β^_j and R^ are the jfth and kϋx regions in the previous and current frames. respectively, then the directional link from B^_Ύ to RJ; can be defined as
/ 1, if {Iξ-i rιRκ;) ≠ & l}k = <
I 0, otherwise k (14)
This implies that there is a link only when the two regions have some overlap. A directed acyclic graph (BAG) is formed by a set of nodes in which every node connects to one or more nodes in the . same group. A set of DAGs (graphs) can be generated. An example of graphs for two consecutive frames is illustrated in Figure 4. The notations for the DAGs are defined as follows. For the ith graph, the parent nodes are denoted as {v^f}^ where each node njf represents one of the regions |i?f_1}^*71, and the child nodes are denoted as {nV'K-'i where each node n*{q represents one of the regions
The ϊth DAG can thus be denoted as
If Λ/<j, = 1 then the aode is a single object, otherwise it is a group of M1^ objects. The object o2l{' is one of the objects { OJIi KmUJj 1- The objects in a child node n1,'9. which may be newly generated objects or objects tracked from its parent nodes, are denoted as {«V'*
After processing all graphs, the objects in the child nodes are reordered as (0"J-^1. They are the set of tracked target objects in the current frame.
Since there is no link between different DAGs, the objects in the parent nodes of one graph caa be tracked independently of the other graphs. Let 0^1 represent the set of objects in the parent nodes of graph Gu i.e., Oj_t = {{o^}^^}^. Then the probability of the states for all the tracked objects in. image It(x) becomes
(15) where θj = (©} . • • • . θfG) and Lc is the number of DAGs. According to (15). (13) can be decomposed as finding a {β\Y for each graph such that
If there are several parent and child nodes in a graph, and some parent nodes represent groups, (16) is still a nontrivial problem. The example embodiment solves the problem in two sequential steps from coarse to fine. The problem is decomposed approximately as two sub- problems: assignment and location. The coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.
Assignment:
In this step, the tracked objects in each parent node is assigned to its child nodes based on the largest. posterior probability. Let
be the parameter vector describing the assignment of the tracked objects
The posterior probability of the assignment for graph Gt can be expressed as
(17) where O^1 1 = (Cf }^=1 are the tracked objects in node r#\ and ,Vj-*0* - {n{Λ : I^ = l}^ are the child nodes
of n^f. The parameter vector is θj = (θj'1 < • ■ - > &/"*>' ) . The best assignment for die tracked objects {o™'f }^l'lP =l in n'-f i& chosen such that it results in the best observation of the objects in the corresponding child nodes, that is
Here θjp can be considered as the coarse tracking parameters indicating in which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.
Location:
In this step, the new states of the tracked objects assigned to each child node (e.g.. region Rt) are determined. Let O]:q = K'1'1}^! be the objects assigned to the child node nf from its pareat nodes. That is, Ol4 is a subset of Ol_y according to the assignment parameters (θ<)*
After the assignment, objects in each child node can be tracked independently of objects in the other child nodes. Hence, the posterior probability of the new states for the tracked objects in the graph Gi can be evaluated as
(θ|Λr = aisπiaxP(θ!>ϊ*,0{*, θ&)
^19 (20)
Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below.
Assuming thai {i?* J-^L1 are the foreground regions and {J3* J-^L1 are their bounding boxes defected- at time t, then their PCRs can be obtained as
Let
be the set of directed acyclic graphes (DAGs) for the foreground region? between the consecutive frames.. If there ΪS only one object in a graph Giy then the object will be tracked as an isolated object. Otherwise, multi-abject tracking will be performed according to Eqs. (IS) and (20). For tracking multiple objects in a group, the posterior probability of the new state for each object is. determined on both spatial position and depth relationship. Hence. 2^D state is. used for each object. The state vector for an object Of is θ% = (&", t>£) where δ" is the bounding box describing its spatial position and ιγ is the likelihood value describing its depth position in the group.
Tracking Isolated Objects
If the itk graph consists of only one child node (i.e., Qi — (H1'1)). a new object appears and is initialized as
in Gi with a new id number. Suppose the node n{* 1 represents the region Rf, then the PCR and bounding box of
are set as T* = Tf and b] = J3*. Since o* is an isolated object, it is not occluded by any other objects. The depth state is set as ι>l =
= 1. It's 2|D state parameter vector is
= (b]t ι£).
If the tth graph contains one parent node and one child node, and the parent node is associated with one object, the graph represents the simple case of isolated object tracking. Let the graph be Gj = fag, n|, ^1), the object in the parent node be OfI1, and the child node n[ represent the region Rf, then the object 0^L1 is updated as o] in n\ (Le., OJl1 and o] have the same id number). Its state becomes θ\ — (HM) wifh b\ = B* and V| = 1. In addition, its PCR is updated as T^ — Tt fc to follow the gradual variation of the object.
If the ?'th graph only contains one parent node which has no child nodes, then the previoiis objects in the parent node are assumed to have disappeared in the αurent frame. Tracking is terminated for these objects.
Tracking Multiple Objects in a Graph
If the ith graph G,- contains multiple parent nodes or child nodes, the operations of assignment and location will be performed. In the following description of the operations for one graph, for the sake of notational convenience the index i for the graph G,- is omitted below.
Assignment:
Let n£ be a parent node in the graph G, Of-1 = { OJJL1 }mLi be *lie associated objects, and ΛV
be its child nodes. If the parent node has more than one child node, the assignment of objects Of-1 is determined by Eq. (18). However, with varying numbers of objects and child nodes., Eq. (IS) is a nontrivial problem of optimal configuration. To make the problem tractable, a sequential solution is proposed based on their PCBLs and the depth relations among the objects.
In each group, the close and non-occluded objects have richer visible information than the distant or occluded objects. This means that an occluded object less affects the tracking of the objects occluding it. Hence, the assignment cart be solved sequentially from the most visible one to die least visible one. Let the objects {o™ i}mLi in the parent node rig be ordered according to their visible sizes. Assuming that the correct assignment of the object o"ix is θ"1 = ςf,n which assigns 0Jl1 to the child node n|m (nsf"1 € A*j ), and the child node nfm represents the region ' uj™, then the posterior probability of assignment for objects Of-1 = {ofij J-^1JL1 is computed as
where S[m(nι - 1) = i^"1 — ∑u^1 δ$ = <?m)4-i represents the region after excluding the objects previously assigned to it before OfI1. Note that the assignments of the objects
are not independent The assignment of one object is affected by the previous objects with higher ranks. This means the assignment of each object can be performed one-by-one sequentially from the most to the least visible ones. For each object, the posterior probability of assignment can be evaluated using Bayes rule,
The first term on the righthand side of (22) is the likelihood of observing the object 0"I1 in the region i?j with the exclusion of previously assigned objects, while, the second term is the prior probability of θ™ = qm given the previous state 61Jl1. For assignment, (22) can be evaluated on PCR. Assuming that a child node
be the PCRs of B% and 0Jl1, respectively. Using Eqs. (21) and (22), the best assignment of the objects
can be achieved one-by-one sequentially according to their depth order by
sequential solution to Eq. (18) using Eq. (23) is computed in two steps. First, the objects
parent node n£ are sorted in a list according to their visible parts. An iterative process is then performed from the most visible to the least visible object. In each iteration, the object in the top of the list is assigned to one of the child nodes according to (23). Once an object is assigned to a child node, it is removed from the list and its visual evidence is excluded from the PCR of the child region. Details are described below.
Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones. Hence, in the example embodiment objects in the node are located one by one from the most visible to the least visible ones based on their visible parts. The posterior probability of new states for all the objects in the node can be expressed as
(25) where θf = {θ}. - ° - , $**)■■• Rt is the region associated with the node nf, and Rf; (n - 1) = Ri — 4. represents the region tn xvhich the visual evidence of the first n — 1 objects (t?j l , • • • ,C)""1) have been excluded at the located positions. According to (25), locating objects Of according to (20) is equivalent to locating them one by one sequentially according to
(θ?r = argmaxP (θf\^(n - 1 W_lf 0JL1)
' (26) where {of JnJj1 are sorted in descendent order according to their visible sizes.
The sequential solution to the problem Eq. (20) and Eq. (26) contains two steps. In the first step, the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes. In the second step, an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.
Assuming that an object o" in the child node
o^ assigned from the parent ?4 then there is 0Jl1 =
}n=i be *he likelihoods of IeJL1 }£_!, computed from the prevkrøs frame. The likelihood of observing object of in the child node nf (or region R^) in the current frame can be evaluated as
according to Eq. (11). Since the motion of object between the two consecutive frames is assumed small, the visible parts of the object o" in region R^ can be estimated as
where Sn and Sk are the sizes of the object o" and the region R*, respectively. In Eq. (27), η is a weight to smooth the estimates from consecutive frames (η — 0.5 is chosen in this εtudj')- {o"}ΛJj are then sorted in descendent order according to the values of {ζ^J-'jJ?! and then placed in a list. To perform exclusion for Rt(n~ 1) = ϋf — YJ ~*
a weight image u-v(x) is used. If the pixel x is likely to belong to one of the previously located objects (oj . • * • . o""1). α%_i(x) is low (ft: Q), otherwise, it is high («* 1). For initialization, set ωo(x) = 1 for all the pixels belonging to the region R^ , and U-Ό(X) = 0 otherwise.
In each iteration, the top object in the list is pop up. Assuming that it is the nth object o" with the initial position represented by the previous bounding box 23»O) = 6JL1 centered at x» \ and its PCR is 1? = {sn, {££
Locating
the object o" in the region Rf according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object. This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7]. A two-stage mean-shift procedure is proposed based on the evidence of the object's principal colors. In the first stage, the gravity center of the pixels of each principal coior-∞inponent-is- computed as.-—
(Z8) withy = 1, • • • ,N, where r indicates a current step of mean-shift iteration. In the second stage, the new position of the object o", is generated as the weighted average of the gravity centers
* ~ X-> (29,
where the weight of evidence firαni principal color c£ is defined as the backprojection
The weight u\, implies that if
then only
proportion of the pixels with, color r£ in the bounding box B^ belong to the object oj\ otherwise all the pixels of color c£ belong to the object. The mean-shift procedure is terminated once
is satisfied, or the maximum number of iterations is reached (6 in the example) The new location of object of is the boimding box hf — BI?+1^ centered at be the PCR of the part of region
within the bounding box 6", where sj, is the size of the pan and the significance is The likelihood of the object tζ in the group can be estimated
according to Eq. (11). The new state parameter of the object of is then obtained as
The last operation of the iteration is exclusion which removes the visual information of the tracked object o" from the region R^ at its position, or obtaining
jf( f ∑r=i 4 This operation is done by updating
) For a pixel J
and we caa obtain the significance of rx in both and as
Let us define Aω(x) = min(l, sn(cx)/^(cx)). The value of __w(x) can be considered as the probability of x belonging to the object of. Hence, the updating of the weight image can be performed as
The complete algorithm of multi-object tracking based on PCR in the example embodiment is summarized in Table 3.
TABLE 3
ZHE SUMMARY OF THE MULTI-OBJECT TRACKING ALGORITHM Input: color image I,(x) and segmented image S,(x); Preprocessing: generate graphs (G1 , i = 1, . . . LQ); For Gt , i = I, . . . La , do: Assignment: for each parent node in Gb p = 1, . . . , M1, p< do:
a. I: has no child node, the objects in it are deleted:
a.2: if
, all the objects in it are assigned to rif ; a.3: if ΠQ' P has multiple child nodes; a.3.1: sort the objects {o™_x} ^f1 in ΠQ' P , and then assign them one-by-one from the first to the last as follows: a.3.1.1: assign ot™, to the child node n['qm according to (23); a.3.1.2: exclude the visual information of ot™ , from the PCR of n['qm
Location: for each child node n[p in Gb q — I, . . . , Mq , do:
1.1: if no object is assigned to the node, check if it is a disappeared object; if not, set it as a new object;
1.2: if only one object is assigned to the node, update the state and PCR of the object; 1.3: if multiple objects are assigned to the node:
1.3.1: sort the objects {o" } ^1 in the node using (27), and then locate the objects one-by-one as follows:
1.3.1.1: apply mean-shift to locate o" using (28) and (29);
1.3.1.2: exclude the visual evidence of o" at the location in Rt using (31)
1.3.2: if the likelihood of observing an object in the region is less than 0.1, set the object as disappeared.
Clearance: if an object disappeared for more than 50 frames, delete the object. End
The algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location. In the assignment phase, each parent node in the DAG is processed. In the location phase, assigned objects in each child node are tracked. To be robust to the separation of small parts from the tracked object due to segmentation errors, small objects in a group with likelihood values less than 0.1 are set as disappeared. To prevent the losing of small or heavily occluded objects in a group, the records of disappeared objects are kept for 50 frames. When a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created.
In the example embodiment, segmenting individual persons in a group with domain knowledge will be preferred. For example, in the example embodiment knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.
Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment. At step 502, first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked. At step 504, one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node. At step 506, for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.
At step 508, for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image. At step 510, for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
When an object stops moving and stays in the same position in the scene for a while, the object would be absorbed into the background gradually with existing background updating techniques. That means the object would be lost in the segmented foreground images. On the other hand, in e:g. crowd scenes, if one can separate the moving objects from the stationary objects in the scene, one can reduce the overlapping of multiple foreground objects. This would make the tracking of each individual easier and more robust. In the described example embodiment, a layer tracking algorithm is designed to track stationary objects through even frequent occlusions. When the object starts moving, the objected is identified as moving object and tracked by a moving object tracking algorithm. In the example embodiment, the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.
When an object stops moving and stays in a scene in frame of a video signal, the appearance variation of the object is typically small through a sequence of frames. A template image of the object is used to represent such a stationary object in the example embodiment.
Let \Bj}'M_τ be a sequence of bounding boxes of the ith tracked object in the τb most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other. For a selected length parameter τb , if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, τb is set as 10 frames, corresponding to about 1 second. To track the stationary object in e.g. a busy site in which the object may be occluded frequently by moving foreground objects, a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as
Colour Representation (PCR) of the object stored when the object was detected as a stationary object. That is, the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object. d{ is the difference measure between the template A/
and the frame Ij (s) for the corresponding region of Aj , dc J is the difference measure between the consecutive frames /y-_[(s) and /y(s) for the region of the template, dp J is the visibility measure of the object from the corresponding region in the frame / (s) , and sk is an estimated state of the stationary object at time k. Measures in τd most current frames and states in τs most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.
In e.g. a busy public site, if there are some objects staying in the scene, they will often merge with moving objects and the result in a high complexity for object tracking. In separating the layer or stationary objects from moving objects and track the stationary and moving objects separately, the example embodiment can greatly enhance object tracking much. Let c = /t(s) be the color of a foreground point in the region of zth template image. According to Bayesian rule, the probability of the point belonging to the background is
where p(c | b) can be obtained from the Principal Feature Representation (PFR) of the background. The PFR at each pixel is used to characterize the background. Let s = (x, y) be a pixel of the image. For each type of feature, a table which records the principal feature vectors and their statistics at S
is built, where pv' (b) is the learned probability of S belonging to the background (Ps(b) ) based on the observation of the feature V and Sv' (i) records the statistics of the Mv most frequent feature vectors at s , Each Sv' (i) contains three components
Sv' (i) = p,'lV, = P.(y, \ b) (4b) vι = (vip -»vα>. )
where £>v is the dimension of the vector v. The Sy' (i) in the table are sorted in descending order with respect to the value p\r Hence, the first Ny elements are used as principal features. Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts. Three tables are used to learn the possible principal features of the three types for the background. They are Tc (s) , Te (s) , and Tcc (s) . The color vector is c = (R1 ,GnB1) from the input color frame. The gradient vector is e = (g-v,g ) obtained by Sobel operator. The color co-occurrence vector is cc = (Rt_{ , G,_, , -δ(_j ,RnGt ,Bt) with 32 levels for each color component.
The probability of the pixel becoming a background point at the current frame can be calculated as
Ns(b) p(b) = (5b)
where Ns(b) is the background points in a small window Ws centered at s in the previous frame, and Ms is the number of points within the window. Similarly, the probabilities of s belonging to the layer (stationery) object or moving foreground object are
mc)= pmfλmiHf ]c^p(ήfMΔ m
Pic) />(c)
respectively. The probabilities p(c \ I) and p(c \ f) can be calculated with Gaussian kernels. Let c'x be the color of point x in the template Λ$~ι within the window Ws . Then p(c I /) can be calculated as
p(c I /) = m xeatrx. {fcc(c'x - c)*,(x - s)} (7b)
where the Gaussian kernels can be written as with g=c or 5
indicating the kernel for color or spatial vector, respectively. Again, let c{ be the color of a point x in the window Ws and in the region of moving foreground objects from the last frame It_x (s) . The probability p(c \ f) can be calculated as
p(c I /) = max(fcc(c χ - c)k,(x - s)} (8b) xeK',
The priors can be calculated as
where Ns(/) and Ns(f) are the number of points belonging to the layer object and moving objects within the window Ws in the previous frame.
Comparing Equ (2b) and (6b), it can be found that p(c) has become a common normalization factor. Hence, the likelihoods of S belonging to background, the layer object, or the moving object can be defined as
p'(b I c) = ^(c I b)p(b), p\l I c) = p(c I l)p(l), and p\f \ c) = p(c \ f)p(f) (10b)
respectively. The pixel s would be assigned according to the greatest likelihood value. The mask for the moving objects is used as the input for moving object tracking.
Stationary objects may also be involved in several changes and interactions with other objects through the sequence. For a non-living object, it may e.g. undergo illumination changes, be occluded and removed by other objects. For a living object, the object may change pose or move bodyparts, or start moving again. During tracking the stationary object, the object's states are estimated and the template image updated correspondently in the example embodiment. hi the example embodiment, five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving. The state is estimated according to various change measures from a short sequence of most recent frames.
Let s be a point in template A\ (s), of the zth layer object. The difference between the template and a current frame at s can be evaluated as
where Thd is the threshold according to image noise. Then, the difference measure between the template and current frame for the layer object is defined as
where SA' is the size of the template.
Similarly, for a point s in the template A\ (s), the difference between consecutive frames at the point is evaluated as
The difference measures are calculated on color vectors.
If the changes over the region of the template are caused by motion of the object itself, even if the differences dj and dj. would be large, the visibility (visibility measure dp J ) of the object in the current frame based on PCR would still be high since the PCR is a global representation not related to spatial information. On the other hand, if the changes are caused by occlusion of other objects, the visibility of the layer object in the current frame would be low. Let T1 be the PCR of the layer object in Ii1 that was stored when the object was detected as a stationary object, and Tj be the PCR from the region overlapped by the template A\ in the current frame. Then the visibility measure of the layer object in the current can be evaluated as
7 000205
37
dp' = P(Tj I Ti) . More particular, Let O'~l be an object in /,_j(s) , and On' be a region in /,(s) . According to Bayesian law, the probability of observing O'~l in On' can be computed as
P(On' I O1-1) = ∑P(On' I EJ1 P(El I C) (15b)
(=1
From the definition of PCR, the significance of c'm for O'~x is P(Em l \ O'~x) = pm' /pm > and the likelihood of observing O'~[ in On' according to the evidence of c'm is
P(On' \Em i ) = mi4,pc,m]Jpm i }=^rvmn{pm i ,pciJ (16b)
Pm where p^ is the significance of c'm from the region On' . Let C(c'm) is the subset of the
With the change measures in a short sequence of Tj most current frames (i.e. image frames from It_Tj (x) to /,(x) ) evaluated above, with xd normally set to 10 frames in the example' embodiment, the states of the tracked layer object are estimated by heuristic rules in the example embodiment:
Rule 1: motionless: If both dj and dc J are low through the sequence, it is motionless;
Rule 2: occluded: If both dj and dj. turn to moderate or high and dp J turns low through the sequence, as well as there are moving objects overlapping the region of the template A! as determined from the bounding boxes of such moving objects in a moving object tracking algorithm applied, the layer object is occluded;
Rule 3: removed: If both dj and dj turn to high and dp J turns low, and then dj turns low through the sequence with no moving object overlapping the region of the template, the layer object is removed;
00205
38
Rule 4: inner-motion: If both dj and dj turn to moderate and then dj turns low through the sequence, while dp J keeps being high, this means the layer object has changed its pose or moved part of its body but still stays there;
Rule 5; start moving: If both d{ and dj turn and keeps being moderate, and dp' keeps being high through the sequence, as well as there is a shift of the layer object, this means the layer object starts moving again.
The parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment. In the example embodiment, but not limiting, for the above rules, the difference measures for dj and dj. are low if they are less than 0.25, they are of moderate if they are within
(0.25, 0.75), and they are high if they are larger than 0.75. The visibility measure dp J is low if it is less than 0.6, otherwise, it is high. The measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template A' . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.
To track the layer object more robustly in the example embodiment, the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene. The five most recent states for each layer object (τs = 5 ) are recorded. However, it will be appreciated that other values may be used in different embodiments. If one state has more than 3 supports, the state is confirmed. For the corresponding state, the following updating is performed.
If the layer object is confirmed as being motionless, a smooth operation is perfoπned to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by
combining the example embodiment with the moving object tracking algorithm described for the example embodiment.
Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment. At step 602, it is detected that a tracked moving object has become stationary over a sequence of frames. At step 604, a template image of the stationary object is generated based at least one of the frames in the sequence. At step 606, a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.
Event detection:
The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.
The foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8]. The background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.
The moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702. As described above, to deal with great variations of target objects in shapes and scales as well as complex occlusions, moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object. When a tracked object has been detected as stopping moving, a layer representation, or a template, for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment. At each time step, the states of templates for the objects are estimated with fuzzy reasoning. The template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed. When a template for an object is detected as starting to move, the template for the object will be deleted and the object will be shifted as a moving object and then tracked by the moving object tracking module 702.
In the event detection module 706, semantic models based on Finite State Machines (FSM) are designed to detect suspected scenarios. In the system 700 of the example
05
40
implementation, four types of unusual events are detected. They are unattended objects, theft, loitering persons, unattended vehicles or unconscious persons.
An "event" is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene. Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from inteπnediate level processing. It is a step that bridges the numerical level and the symbolic level. The fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.
In implementations based on the described example embodiment, unusual events are described by the spatio-temporal evolution of object's states, movements, and actions. On a semantic level, each event can be defined as a sequential succession of a few well-defined states. An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time. When a specific state is reached, the event is declared. State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state. The semantic representation can be modelled based on Finite State Machines (FSM). The FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.
Using Finite State Machine, each specific event can be represented by a directed graph
Gf = \Sj ,EΛe , where Sf is the set of nodes representing the states and E? is the set of edges representing the transitions. One example of a FSM 800 is described in Figure 8. Any new object is initiated to be state "0" 802 for all the events defined. This is the initial state. The FSM 800 is truly started only when some conditions are met and the active node transits to the next intermediate state, i.e., state "1" 804. There can be more than one intermediate state for the FSM 800 of an event, depending on the complexity of an event. The FSM 800 reaches the final state "End"406 when all the conditions are met and then the corresponding specific event is triggered. The FSM 800 is updated at each new frame. The FSM 800 could have the self-loop transition for each state. Although the FSM 800 could remain at the same state, some or all properties of the object may have changed. At least, a time counter is incremented for each frame.
The more complicated an event, the bigger is N, i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.
The input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7). The visual cues of each tracked object can include shape, position, motion, and relations with others. The visual cues in the example implementation are:
- Object ID: the identity number of each tracked foreground object;
- Box: bounding box of the tracked object in current frame;
- Size: the area of the object in current frame;
- Status: indicates whether the tracked object is moving around or stationary;
- StayTime: indicates how long the object has stayed in the scene;
- InGroup: indicates whether the object is an isolated one or merged with others;
- Visibility: a measure within [0,1] indicates the degree of occlusion when overlapping with others;
- Motion: a measure within [0,1] indicates the degree of interior motion of a stationary object.
The general processing flow for event detection in the example implementation is shown in Table III.
An advantage of the tracking modules (compare 702, 704 in Figure 7) is the capability to resume tracking of some objects that are lost for a few frames. The two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation. Thus when an active object does not appear in the track records of the current frame, one preferably determines whether it is temporarily lost or whether there is a genuine disappearance. To achieve this, a first-in-first-out (FIFO) stack is built to contain the track records of N frames. OTracked are the track records of the previous N-th frame and the triggered event is delayed by N frames. As such, in the example implementation it is possible to 'look forward' to check the case of disappearance of an object, with N= 30 in the example implementation. With a processing rate of 8 frames/sec or above, this represents a delay of less than 4 seconds. It will be appreciated that the delay can be balanced against the accuracy of detection in different implementations.
Loitering Detection
00205
42
Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > TLoitering . The FSM is initialized for each new object. The FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay":
- The object is classified as human;
- The object moves in the scene (moving around or staying somewhere with frequent interior motion).
In state "Stay", a time counter t is continuously incremented as new frames are coming in. When t > TLoilering , the FSM transits from state "Stay" to state "Loiter" and a loitering event is triggered.
Unconscious Person Detection
As defined in the example implementation, this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > TSmtic . The FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion. The second intermediate state of the FSM is "S", which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S":
- The position of the person does not change compared to the previous frame;
- The interior motion of the person m < TlmMoύon .
In state "S", a time counter t is continuously incremented as new frames are coming in. When t > TSmic , the FSM transits from state "S" to state "UP", indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person. It will be appreciated that similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.
Unattended Object Detection
This event as defined in the example implementation involves two objects. The FSM is initialized for each new object. When the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. If the owner leaves the scene covered by the
camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.
Theft Detection
This event as defined in the example implementation involves three objects. The FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.
The method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.
The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
The components ■ of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.
The application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by the processor 918. Intermediate storage of program data maybe accomplished using RAM 920.
It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.
References
[1] D. Lowe. Distinctive image features from scale-invariant key-points, hit 7 J. Computer Vision, 60(2):91-110, 2004.
[2] L. Li5W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complex background for foreground object detection. IEEE Trans. Image Processing, 13(11): 1459-1472, 2004.
[3] L. Li and M. K. H. Leung. Integrating intensity and texture differences for robust change detection. IEEE Trans. Image Processing, 11 (2): 105-112, 2002.
[4] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):747-757, August 2000.
[5] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-Based Object Tracking," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577, 2003.
[6] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach Toward Feature Space Analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603- 619, 2002.
[7] Y. Cheng, "Mean Shift, Mode Seeking, and Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.
[8] Liyuan Li, et. al. IEEE T-IP, vol. 13, no. 11, pp. 1459-1472, 2004.
[9] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland. P finder. Real-time tracking of the human body. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997.
Claims
1. A method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of: defining one or more contextual background representation types; segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
2. The method as claimed in claim 1, wherein a first learning rate for the pixels that are occluded is lower than a second learning rate for the pixels that are exposed.
3. The method as claimed in claim 2, further comprising the steps of: determining whether said respective pixels that are exposed are detected as a background point or as a foreground point in a current background subtraction for the current image; and setting different learning rates for the adaptive background subtraction for exposed pixels that are detected as foreground points and for exposed pixels that are detected as background points respectively.
4. The method as claimed in claim 3, wherein a third learning rate for the exposed pixels that are detected as foreground points is higher than the second learning rate for the exposed pixels that are detected as background points.
5. The method as claimed in claim 1, wherein one contextual background representation type A comprises a facility for the public such as a counter or a bench, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed comprises the steps of: evaluating, for each image region spatially corresponding to a type A background region, whether said each image region is occluded based on matching OHRs of the type A background region and of said each image region respectively and based on matching PCRs of the type A background region and of said each image region respectively; and determining all pixels of said each image region as either occluded or exposed depending on said evaluation.
6. The method as claimed in claim 5, wherein all pixels are determined as exposed if a match likelihood in said evaluation is above a threshold value, and are determined as occluded otherwise.
7. The method as claimed in claim 1, wherein one contextual background representation type B comprises a large homogeneous region such as a ground plane or a wall surface, and wherein the step of determining whether respective pixels in the image regions of the current image spatially corresponding to the background regions are occluded or exposed comprises the steps of: evaluating, for each image region spatially corresponding to a type B background region, whether neighborhood regions around respective pixels in said each image region are occluded based on matching PCRs of the type B background region and of the respective neighborhood regions; and determining pixels of said each image region as either occluded or exposed depending on the respective evaluations.
8. The method as claimed in claim 7, wherein each pixel is determined as occluded if a majority of neighborhood pixels in the neighborhood region of said each pixel are within said type B background region and less of the neighborhood pixels themselves are evaluated as exposed based on a match likelihood being above a threshold value, and is determined as exposed otherwise.
9. The method as claimed in claim 1, further comprising setting a zero learning rate for pixels belonging to foreground regions.
10. The method as claimed in claim 1, further comprising the step of performing adaptive background subtraction using said set rates for the respective pixels.
11. The method as claimed in claim 10, wherein the adaptive background subtraction is based on, in one example embodiment, Mixture of Gaussian or Principle Feature Representation.
12 The method as claimed in claim 1, further comprising maintaining a model base for the contextual background representation types, the model base including models for different illumination conditions.
13 The method as claimed in claim 12, further comprising adjusting an appearance, a spatial characteristic, or both, of the models in the model base over a long duration compared with a frame duration in the video signal.
14. A system for background updating for adaptive background subtraction in a video signal, the system comprising: means for defining one or more contextual background representation types; means for segmenting an image of a scene in the video signal into contextual background regions; means for classifying each contextual background region as belonging to one of the contextual background representation types; means for determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; means for receiving a current image of the scene in the video signal; means for determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and means for setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
15. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of background updating for adaptive background subtraction in a video signal, the method comprising the steps of: defining one or more contextual background representation types; 00205
48
segmenting an image of a scene in the video signal into contextual background regions; classifying each contextual background region as belonging to one of the contextual background representation types; determining an orientation histogram representation (OHR), a principle colour representation (PCR), or both, of each background region; receiving a current image of the scene in the video signal; determining whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed; and setting different learning rates for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US80696406P | 2006-07-11 | 2006-07-11 | |
US60/806,964 | 2006-07-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008008045A1 true WO2008008045A1 (en) | 2008-01-17 |
Family
ID=38923513
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SG2007/000206 WO2008008046A1 (en) | 2006-07-11 | 2007-07-11 | Method and system for multi-object tracking |
PCT/SG2007/000205 WO2008008045A1 (en) | 2006-07-11 | 2007-07-11 | Method and system for context-controlled background updating |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SG2007/000206 WO2008008046A1 (en) | 2006-07-11 | 2007-07-11 | Method and system for multi-object tracking |
Country Status (2)
Country | Link |
---|---|
SG (1) | SG150527A1 (en) |
WO (2) | WO2008008046A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009133377A1 (en) * | 2008-05-01 | 2009-11-05 | Pips Technology Limited | A video camera system |
US8483481B2 (en) | 2010-07-27 | 2013-07-09 | International Business Machines Corporation | Foreground analysis based on tracking information |
WO2014038924A3 (en) * | 2012-09-06 | 2014-06-26 | Mimos Berhad | A method for producing a background model |
US8934670B2 (en) | 2008-03-25 | 2015-01-13 | International Business Machines Corporation | Real time processing of video frames for triggering an alert |
US20170330050A1 (en) * | 2016-05-16 | 2017-11-16 | Axis Ab | Method and apparatus for updating a background model used for background subtraction of an image |
CN107368784A (en) * | 2017-06-15 | 2017-11-21 | 西安理工大学 | A kind of novel background subtraction moving target detecting method based on wavelet blocks |
US20210368112A1 (en) * | 2019-05-09 | 2021-11-25 | Tencent Technology (Shenzhen) Company Limited | Method for implanting information into video, computer device and storage medium |
CN117953015A (en) * | 2024-03-26 | 2024-04-30 | 武汉工程大学 | Multi-pedestrian tracking method, system, device and medium based on video super-resolution |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8572740B2 (en) | 2009-10-01 | 2013-10-29 | Kaspersky Lab, Zao | Method and system for detection of previously unknown malware |
AU2013242830B2 (en) | 2013-10-10 | 2016-11-24 | Canon Kabushiki Kaisha | A method for improving tracking in crowded situations using rival compensation |
CN103729861B (en) * | 2014-01-03 | 2016-06-22 | 天津大学 | A kind of multi-object tracking method |
KR101631955B1 (en) | 2014-12-10 | 2016-06-20 | 삼성전자주식회사 | Target object tracking apparatus and method of operations thereof |
GB2550858A (en) | 2016-05-26 | 2017-12-06 | Nokia Technologies Oy | A method, an apparatus and a computer program product for video object segmentation |
US10360456B2 (en) * | 2016-08-12 | 2019-07-23 | Qualcomm Incorporated | Methods and systems of maintaining lost object trackers in video analytics |
US10304207B2 (en) * | 2017-07-07 | 2019-05-28 | Samsung Electronics Co., Ltd. | System and method for optical tracking |
CN108399411B (en) * | 2018-02-26 | 2019-07-05 | 北京三快在线科技有限公司 | A kind of multi-cam recognition methods and device |
CN109143222B (en) * | 2018-07-27 | 2023-04-25 | 中国科学院半导体研究所 | 3D Maneuvering Target Tracking Method Based on Divide and Conquer Sampling Particle Filter |
CN111179304B (en) * | 2018-11-09 | 2024-04-05 | 北京京东尚科信息技术有限公司 | Target association method, apparatus and computer readable storage medium |
CN113168503B (en) * | 2018-12-03 | 2024-11-08 | 瑞典爱立信有限公司 | Distributed computing for real-time object detection and tracking |
CN112395920B (en) | 2019-08-16 | 2024-03-19 | 富士通株式会社 | Radar-based attitude recognition device, method and electronic equipment |
CN110889864B (en) * | 2019-09-03 | 2023-04-18 | 河南理工大学 | Target tracking method based on double-layer depth feature perception |
CN112991382B (en) * | 2019-12-02 | 2024-04-09 | 中国科学院国家空间科学中心 | Heterogeneous visual target tracking system and method based on PYNQ framework |
CN111178218B (en) * | 2019-12-23 | 2023-07-04 | 北京中广上洋科技股份有限公司 | Multi-feature joint video tracking method and system based on face recognition |
CN111340846B (en) * | 2020-02-25 | 2023-02-17 | 重庆邮电大学 | Multi-feature fusion anti-occlusion target tracking method |
CN111726264B (en) * | 2020-06-18 | 2021-11-19 | 中国电子科技集团公司第三十六研究所 | Network protocol variation detection method, device, electronic equipment and storage medium |
CN113870311B (en) * | 2021-09-27 | 2024-11-08 | 安徽清新互联信息科技有限公司 | A single target tracking method based on deep learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040239762A1 (en) * | 2003-05-21 | 2004-12-02 | Porikli Fatih M. | Adaptive background image updating |
WO2005036456A2 (en) * | 2003-05-12 | 2005-04-21 | Princeton University | Method and apparatus for foreground segmentation of video sequences |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6542621B1 (en) * | 1998-08-31 | 2003-04-01 | Texas Instruments Incorporated | Method of dealing with occlusion when tracking multiple objects and people in video sequences |
US6879705B1 (en) * | 1999-07-14 | 2005-04-12 | Sarnoff Corporation | Method and apparatus for tracking multiple objects in a video sequence |
US6826292B1 (en) * | 2000-06-23 | 2004-11-30 | Sarnoff Corporation | Method and apparatus for tracking moving objects in a sequence of two-dimensional images using a dynamic layered representation |
IL141650A (en) * | 2001-02-26 | 2005-12-18 | Elop Electrooptics Ind Ltd | Method and system for tracking an object |
JP4444583B2 (en) * | 2003-05-21 | 2010-03-31 | 富士通株式会社 | Object detection apparatus and program |
-
2007
- 2007-07-11 WO PCT/SG2007/000206 patent/WO2008008046A1/en active Application Filing
- 2007-07-11 SG SG200901121-4A patent/SG150527A1/en unknown
- 2007-07-11 WO PCT/SG2007/000205 patent/WO2008008045A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005036456A2 (en) * | 2003-05-12 | 2005-04-21 | Princeton University | Method and apparatus for foreground segmentation of video sequences |
US20040239762A1 (en) * | 2003-05-21 | 2004-12-02 | Porikli Fatih M. | Adaptive background image updating |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9129402B2 (en) | 2008-03-25 | 2015-09-08 | International Business Machines Corporation | Real time processing of video frames |
US9424659B2 (en) | 2008-03-25 | 2016-08-23 | International Business Machines Corporation | Real time processing of video frames |
US9418444B2 (en) | 2008-03-25 | 2016-08-16 | International Business Machines Corporation | Real time processing of video frames |
US9418445B2 (en) | 2008-03-25 | 2016-08-16 | International Business Machines Corporation | Real time processing of video frames |
US8934670B2 (en) | 2008-03-25 | 2015-01-13 | International Business Machines Corporation | Real time processing of video frames for triggering an alert |
US9142033B2 (en) | 2008-03-25 | 2015-09-22 | International Business Machines Corporation | Real time processing of video frames |
US9123136B2 (en) | 2008-03-25 | 2015-09-01 | International Business Machines Corporation | Real time processing of video frames |
US8934013B2 (en) | 2008-05-01 | 2015-01-13 | 3M Innovative Properties Company | Video camera and event detection system |
WO2009133377A1 (en) * | 2008-05-01 | 2009-11-05 | Pips Technology Limited | A video camera system |
US8934714B2 (en) | 2010-07-27 | 2015-01-13 | International Business Machines Corporation | Foreground analysis based on tracking information |
US8483481B2 (en) | 2010-07-27 | 2013-07-09 | International Business Machines Corporation | Foreground analysis based on tracking information |
US9460361B2 (en) | 2010-07-27 | 2016-10-04 | International Business Machines Corporation | Foreground analysis based on tracking information |
WO2014038924A3 (en) * | 2012-09-06 | 2014-06-26 | Mimos Berhad | A method for producing a background model |
US10152645B2 (en) * | 2016-05-16 | 2018-12-11 | Axis Ab | Method and apparatus for updating a background model used for background subtraction of an image |
US20170330050A1 (en) * | 2016-05-16 | 2017-11-16 | Axis Ab | Method and apparatus for updating a background model used for background subtraction of an image |
CN107368784A (en) * | 2017-06-15 | 2017-11-21 | 西安理工大学 | A kind of novel background subtraction moving target detecting method based on wavelet blocks |
US20210368112A1 (en) * | 2019-05-09 | 2021-11-25 | Tencent Technology (Shenzhen) Company Limited | Method for implanting information into video, computer device and storage medium |
EP3968627A4 (en) * | 2019-05-09 | 2022-06-29 | Tencent Technology (Shenzhen) Company Limited | Method for implanting information into video, computer device and storage medium |
US11785174B2 (en) | 2019-05-09 | 2023-10-10 | Tencent Technology (Shenzhen) Company Limited | Method for implanting information into video, computer device and storage medium |
CN117953015A (en) * | 2024-03-26 | 2024-04-30 | 武汉工程大学 | Multi-pedestrian tracking method, system, device and medium based on video super-resolution |
CN117953015B (en) * | 2024-03-26 | 2024-07-09 | 武汉工程大学 | Multi-row person tracking method, system, equipment and medium based on video super-resolution |
Also Published As
Publication number | Publication date |
---|---|
WO2008008046A1 (en) | 2008-01-17 |
SG150527A1 (en) | 2009-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008008045A1 (en) | Method and system for context-controlled background updating | |
Sheikh et al. | Bayesian modeling of dynamic scenes for object detection | |
US9230175B2 (en) | System and method for motion detection in a surveillance video | |
Hongeng et al. | Video-based event recognition: activity representation and probabilistic recognition methods | |
Godbehere et al. | Visual tracking of human visitors under variable-lighting conditions for a responsive audio art installation | |
Zhou et al. | Real time robust human detection and tracking system | |
Mittal et al. | Motion-based background subtraction using adaptive kernel density estimation | |
US9846810B2 (en) | Method, system and apparatus for tracking objects of a scene | |
Choudhury et al. | An evaluation of background subtraction for object detection vis-a-vis mitigating challenging scenarios | |
EP1836683B1 (en) | Method for tracking moving object in video acquired of scene with camera | |
Herrero-Jaraba et al. | Detected motion classification with a double-background and a neighborhood-based difference | |
Ekinci et al. | Silhouette based human motion detection and analysis for real-time automated video surveillance | |
Vu et al. | Audio-video event recognition system for public transport security | |
Cheng et al. | Segmentation of aerial surveillance video using a mixture of experts | |
Chen et al. | Vision-based traffic surveys in urban environments | |
Kim et al. | Unsupervised moving object segmentation and recognition using clustering and a neural network | |
Shahbaz et al. | Probabilistic foreground detector with camouflage detection for sterile zone monitoring | |
Thakoor et al. | Automatic video object shape extraction and its classification with camera in motion | |
Al Najjar et al. | A hybrid adaptive scheme based on selective Gaussian modeling for real-time object detection | |
Al Najjar et al. | Robust object tracking using correspondence voting for smart surveillance visual sensing nodes | |
Al Najjar et al. | Object detection | |
Ali | Feature-based tracking of multiple people for intelligent video surveillance. | |
Cuevas et al. | Tracking-based non-parametric background-foreground classification in a chromaticity-gradient space | |
Tavakkoli et al. | Background Learning with Support Vectors: Efficient Foreground Detection and Tracking for Automated Visual Surveillance | |
Huang et al. | Region-level motion-based foreground detection with shadow removal using MRFs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07794224 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07794224 Country of ref document: EP Kind code of ref document: A1 |