WO2010042068A1 - Method and system for object detection and tracking - Google Patents

Method and system for object detection and tracking Download PDF

Info

Publication number
WO2010042068A1
WO2010042068A1 PCT/SG2008/000386 SG2008000386W WO2010042068A1 WO 2010042068 A1 WO2010042068 A1 WO 2010042068A1 SG 2008000386 W SG2008000386 W SG 2008000386W WO 2010042068 A1 WO2010042068 A1 WO 2010042068A1
Authority
WO
WIPO (PCT)
Prior art keywords
object
based
frame
method
detection
Prior art date
Application number
PCT/SG2008/000386
Other languages
French (fr)
Inventor
Liyuan Li
Kah Eng Jerry Hoe
Xinguo Yu
Ruijiang Luo
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Priority to PCT/SG2008/000386 priority Critical patent/WO2010042068A1/en
Publication of WO2010042068A1 publication Critical patent/WO2010042068A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/277Analysis of motion involving stochastic approaches, e.g. using Kalman filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/285Analysis of motion using a sequence of stereo image pairs
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Abstract

A method and system for object detection and tracking in a stereo image sequence. The method comprises two or more steps of a group consisting of a) locating position of the object in a frame based on a stereo-based detection; b) locating position of the object in the frame based on an image-based detection; c) estimating position of the object in the frame based on a colour-based tracking; d) estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising e) detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.

Description

METHOD AND SYSTEM FOR OBJECT DETECTION AND

TRACKING

FIELD OF INVENTION

The invention relates broadly to a method and system for object detection and tracking, particularly human objects, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object detection and tracking.

BACKGROUND

A social service robot has to be able to interact with human beings around in a natural and friendly manner. One of the essential functionalities for such natural human robot interaction (HRI) is the ability to perceive human beings around, or to detect human objects around and track each individual in its view. With such ability, the service robot can identify and approach the person requiring service, interact with the person at a comfortable distance, and move together with the person in a natural environment. In the existing applications of service robots, such as delivering hospital meals, mowing lawns, or vacuuming floors, the HRI is traditionally minimal where people are more often treated as obstacles to be navigated around rather than as social beings with which to cooperate.

Many social service robot projects and commercial applications are emerging where the ability to interact with people is an important part for service and entertainment. To achieve efficiency, most existing systems employ a one-dimensional (1 D) laser range sensor and simple image models of humans (e.g. skin colour distribution, face detection, and motion detection) to find the person in front of the robot. The advantages of existing methods are efficiency and ease of implementation. However, the employed models, e.g. the face, skin colour, and head-shoulder contour, focus on a part of human body, e.g. the head. Hence, they are not robust in real world situations. For example, the Adaptive Boosting (AdaBoost) face detector, which is available in Open Source Computer Vision (OpenCV) library and employed in many systems, performs well to detect the front view of faces of certain sizes, but is sensitive to head pose, view angle, scale variation, and lighting conditions. The skin colour models are also sensitive to lighting conditions and certain background objects, e.g. the wooden objects. Furthermore, they are not suitable for detecting multiple humans over a large range of distances around the robot and track them through complex motion modes and occlusions.

Since human bodies are non-rigid and articulated objects, other conventional methods of human detection and tracking are based on motion detection, especially background subtraction. Such methods are efficient and have no assumptions on human poses and view angles. However, they are not applicable to mobile robots since in this case the background keeps changing when the robot is moving. Detecting human objects in a single image according to visual features is therefore a very difficult problem due to the variations of human appearances and cluttered backgrounds.

In other prior art approaches, high-performance detectors, e.g. Histogram of Oriented Gradients (HOG) detector and Adaboost detector, have been developed based on advanced machine learning techniques. In HOG-based detection, the HOG from a detection window is employed as a feature vector to describe the local-global visual features of a human appearance. A Support Vector Machine (SVM) classifier is trained from manually collected samples for human detection. For AdaBoost detector, the Haar wavelet-like features are employed to represent the local-global features of a human appearance. The recognition engine is a cascade of weak classifiers obtained using AdaBoost learning from manually collected samples. To detect standing and walking persons of different sizes in an image, multi-scale detection windows have to be applied to scan the image. Hence, computational cost is very high. In addition, while the false positive rate is low, these methods may still generate about 2 false positives every image at the detection rate between 80% and 90% and may also generate multiple detections around a true human object.

With the availability of low price commercial stereo cameras, stereo-based human detection has also become an alternative approach since the method can take the advantage of depth information. Most existing methods ~are based on bottom-up segmentation of human bodies from disparity image for human recognition. Since the disparity data is incomplete and inaccurate, such methods are not reliable to extract humans at various distances to the camera and segment human individuals in a group. In a different approach, a top-down method to detect human objects at various distances to the camera from a disparity image based on a scale-adaptive filtering technique has also been disclosed. The top-down model-driven method is robust to the incomplete and inaccurate disparity data, but the computational cost is still high.

For tracking, in existing approaches for visual human object tracking through image sequence, three components are commonly included in the tracking process, i.e. target representation, motion prediction, and object matching. Investigated visual models include blobs of homogeneous intensities, feature points, contours, templates, colour histograms, and joint colour-spatial distributions of the object regions. The frequently used motion predictors are Kalman filter, particle filter, and mean-shift tracker. The likelihood of observation is evaluated from the matching of the target model with the visual model from new position in the current image frame. Considering the trade-off between accuracy and efficiency, the colour-based mean-shift tracking algorithms have been widely used in various applications.

One sequential method for tracking multiple objects through occlusions for video surveillance has been disclosed in PCT/SG2007/000206, the contents of which are hereby incorporated. In the method, a Dominant Colour Histogram (DCH) is proposed as object model. Using DCH, the depth order of objects in a group is first estimated and the objects are then tracked in sequence from the closest to the farthest employing the DCH-based mean-shift and exclusion. In addition, a background subtraction is also applied to extract the foreground objects and filter out the cluttered background.

In existing approaches disclosed thus far, human object detection and tracking are treated separately due to the high computational cost, that is, a target object is first detected by a method (e.g. motion segmentation) and then tracked according to its colour features. Some methods have been proposed to integrate detection output in the tracking process in each frame. For example, in one method, the AdaBoost human detection and colour-histogram based tracking are integrated under a particle filter framework, and in another method, edgelet-based human detection and colour-based mean-shift tracking are integrated for human detection and tracking. However, in general, the visual tracker would often lose the target due to e.g. cluttered background, irregular motion, and complex occlusion. Furthermore, there are at least two major difficulties with such integration. Firstly, human detection methods require large computational resources. Secondly, the results of one method may not coincide with those of another due to e.g. false positives, missed detections, shifted position and inappropriate scales, etc.

A need therefore exists to provide a method and system for human object detection and tracking that seek to address at least one of the above problems

SUMMARY

In accordance with a first aspect of the present invention there is provided a method for object detection and tracking in a stereo image sequence, the method comprising two or more steps of a group consisting of: a) locating position of the object in a frame based on a stereo-based detection; b) locating position of the object in the frame based on an image- based detection; c) estimating position of the object in the frame based on a colour- based tracking; d) estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising e) detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.

The candidate tracked objects may be weighted based on respective association values for each of the candidate tracked objects with a true new position of the tracked object in the MAP framework. The method may be applied to tracking of multiple objects, and each of the tracked objects is assigned a priority, and wherein a tracking order is from the highest priority to the lowest priority.

The priority may be calculated based on a depth distance of the object in the stereo frame.

The priority may be calculated further based on a proportion of the object that is occluded in the image.

The priority may be calculated further based on a probability of observing the object from a proportion that is visible in the image.

The priority may be calculated further based on maximum association values of the candidate objects.

Step a) may comprise locating peaks in disparity information that are substantially close to a dimension of the object.

Step a) may further comprise segmenting the peaks into regions based on the Maximum A Posterior (MAP) framework.

The segmenting step may comprise using IPL functions.

Step b) may comprise using Histogram of Oriented Gradient (HOG) descriptors for distinguishing an appearance of the object.

The appearance may be selected from a group consisting of an upper body of the object, a 2/3 body of the object and a full body of the object.

An algorithm for step b) may comprise starting at a smallest depth distance in the stereo frame; applying detection windows for locating possible objects; and repeating the above step at the next depth distance. Step b) may further comprise using disparity information obtained from step a) for finding a detection window having the highest probability of containing the object.

Step c) may comprise generating a Dominant Colour Histogram based on disparity information obtained from step a).

Step d) may comprise calculating a probability of the object being occluded by at least one different object having a smaller depth distance in the stereo frame.

The position of the object in the frame may be estimated when the probability of being occluded exceeds a selected value.

The method may further comprise updating positions of all objects in the frame after all objects have been tracked.

In accordance with a second aspect of the present invention there is provided a system for object detection and tracking in a stereo image sequence, the system comprising two or more of a group consisting of: a) means for locating position of the object in a frame based on a stereo-based detection; b) means for locating position of the object in the frame based on an image-based detection; c) means for estimating position of the object in the frame based on a colour-based tracking; d) means for estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising e) means for detecting all possible candidate tracked objects in the frame from the two or more of means a) to d); and f) means for calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in an Maximum A Posterior

(MAP) framework. In accordance with a third aspect of the present invention there is provided a computer storage medium having stored thereon computer code means for instructing a computer system to execute a method of object detection and tracking in stereo image sequence, the method comprising two or more steps of a group consisting of: a) locating position of the object in a frame based on a stereo-based detection; b) locating position of the object in the frame based on an image- based detection; c) estimating position of the object in the frame based on a colour- based tracking; d) estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising e) detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a flow chart illustrating a method for object detection and tracking in a stereo image frame according to an example embodiment.

Figure 2 is a graph showing a relationship between K1 and disparity value.

Figures 3a-3c show comparisons of the stereo-based object detection results based on the approach as described in the example embodiment and a conventional approach.

Figures 4a-4c show illustrations of the half-body, 2/3 body and full body models of human appearance respectively according to an example embodiment. Figure 5 shows an illustration of an object size in relation to the detection window at various depth distances according to an example embodiment.

Figures 6a-6b show bounding boxes and core ellipses of head and torso of a human object on colour and disparity images respectively according to an example embodiment.

Figures 7a-7c show plan views of two objects on a ground plane in example events of no occlusion, partial occlusion and full occlusion respectively according to an example embodiment.

Figures 8a-8f show a series of images illustrating separate results of the detection and estimation models and the final result according to the method of Figure 1 in an example embodiment.

Figure 9 shows a schematic diagram of a computer system for executing a method according to an example embodiment.

Figure 10 shows an implementation of a system for object detection and tracking according to and example embodiment.

Figure 11 is a flow chart illustrating a method for object detection and tracking in a stereo image frame according to an example embodiment.

DETAILED DESCRIPTION

The example embodiment provides a novel method of object detection and tracking in a stereo image frame captured by a stereo camera. As will be appreciated by a person skilled in the art, in the conventional approaches, outputs of object detection models and estimation models may not coincide with one another due to various factors such as false detection, missed detection, cluttered scenes, irregular motion of the object, and partial or full occlusion. Hence, in such cases, one-to-one assignment of a detected object to a tracked object may be inaccurate. In the example embodiment, a MultiModel Joint Association (MMJA) approach is applied to combine results of different detection and estimation models.

Figure 1 shows a flow chart 100 illustrating a method for object detection and tracking in a stereo image frame according to an example embodiment. At step 102, position of the object in the frame is located based on a stereo-based detection. At step 104, position of the object in the frame is located based on an image-based detection. At step 106, position of the object in the frame is estimated based on a colour-based tracking. At step 108, position of the object in the frame is estimated based on motion history of the object in preceding frames. At step 110, all possible candidate tracked objects in the frame from the two or more of steps 102 to 108 are detected. At step 112, an expected new position of the object in the frame is calculated based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.

In the description that follows, the method is first described with respect to obtaining results in respective steps 102 to 108 based on different detection and estimation models, before the application of the MMJA approach in steps 110 and 112. The example embodiment described relates to the detection and tracking of human objects, or "persons", which are substantially complex. However, it should be understood that the method can be applied to other types of objects, for example, vehicles.

Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in thejdata processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "processing", "estimating", "calculating", "determining", "assigning", "generating", "computing", "locating", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.

In addition, Jhe present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.

Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.

The stereo-based object detection step 102 comprises the steps of scale- adaptive filtering, object segmentation and object verification based on stereo images from stereo cameras. Scale-adaptive filtering focuses the detection on significant peaks that are likely to correspond to objects of interest, such as humans, based on height, width and thickness information of human metrics. Object segmentation assigns disparity pixels to corresponding regions of the object. Object verification confirms the detection of the object by comparing edge evidence along the segmented region with a predetermined template.

In a stereo image, the theoretical relation between a depth distance z and a disparity value d is expressed as z = bf Id = Kx Id (1) where b is a base-line distance and f is a focal length of the stereo cameras.

In practice, K1 decreases when d decreases, or z increases. One example of the

K1 - d curve from a Stereo-on-a-Chip (STOC) stereo camera from VIDERE DESIGN is shown in Figure 2. In the example embodiment, a linear model is used for the K1 - d relationship such that z = Kx(d)ld with K1^d) = Kd + B (2)

where parameters K and β may be for each camera from an offline calibration. Using model (2), the depth range of effective human detection can be extended significantly.

Since humans stand on the ground surface, significant evidence of human objects can be observed from the projection of disparity measures on the ground plane. In the example embodiment, if d(x, y) is the disparity image of M x N pixels and L disparity levels, the projection of the stereo data on x - d plane is generated as h(x,d) = ∑δ(d(x,y),d) , where δ(d(x,y),d) = \ when d(x,y) = d and δ(d(x,y),d) = 0 otherwise. If a height of the stereo camera is -H6 by triangle calculation, a height limit y{d) for an object with z = Kλ(d)/d distance to the camera can be obtained as yt(d) = yc-£(HT-HE) = yc-K2(d)-AH-d (3) where K2(d) = f I Kλ(d) , yc is a vertical centre of image (yc∞N/2) and the direction of y axis is from top to bottom.

The disparity information for each object can be aggregated by scale- adaptive filtering h(x,d) = Gd*h(x,d) (4) where * denotes the convolution and Gd{-,-) is the adaptive two directional (2D) kernel functions. The subscript d indicates that the scale of the kernel is determined based on the depth information. Assuming that, at a given depth distance to the camera, the variations of the spatial and depth measures are independent, the scale-adaptive filter is chosen as

Gd ((uu,,vv)) == ee ^ ^Λ{ιd) σΛd)J = GσΛd)(u)-GσΛd)(y) (5) where σx(d) and σd(d) are adaptive scale parameters of the filter for both spatial (x) and depth (d) measures. Substituting (5) into (4), the adaptively filtered image can be expressed as h(x,d) = GσΛd)*GσΛd)*h(x,d) (6)

Equation (6) indicates that the 2D convolution in (4) can be decomposed as two cascade one-directional (1 D) convolutions. Considering the non-linear relation between z and d, a non-symmetry Gaussian like kernel is designed as

Figure imgf000013_0001
where σd+(d) = Cd(d+-d) and σd {d) = Cd{d-d_) with d± =Kl(d)d/(K1(d) + Dbd/2) and Cd =^-ln(0.5))'1 , where Db is an average of human body thickness, for example. In the example embodiment, the scale parameter for the adaptive spatial filter is σd(d) = 0.25Cdwb(d) with wb (d) = K2{d)Wbd , where Wb is the average width of human bodies. In h(x, d) , human evidence has been enhanced as significant, smooth and separated peaks.

In the segmentation step, a Maximum A Posterior (MAP) framework is applied in the example embodiment for robust segmentation from e.g. incomplete and uncertain depth measures. Potential objects such as humans in the view are located by searching significant peaks in the 2D histogram h(x,d) from the closest to the furthest depth distance relative to the stereo camera. A Gaussian model is used to characterize a distribution of a data cloud for each human object, wherein a

Gaussian distribution N(m,,∑,) can be obtained from the data h(x, d) within a bounding box R, for the i* object, where m, is the mean vector and Σ, is the covariance matrix. Further, in segmentation, a pixel d(x,y) in the disparity image with feature vector \ = (x,y,d(x,y)) is assigned to a human region according to a MAP probability. Let c, = (χ,,dt) be the position of the i* significant peak, then the cloud of disparity data for the i* person will be within an envelope or bounding box R1 = [x_,d_,x+, d+]\n the x-d plane (i.e. maximum extents of the object in 2D) where

χτ = jc, + w6 (</,)/ 2 and dτ = ' ' . The point assignment based on MAP

Kl(dl) ±dlDb /2 probability can be expressed as

(x,y,d(x,y)) -> /v = argmaxP,(v) & P1(V) > T1 (8) where

P1(V) (v-mI)E;-I(v -mI)r) (9)

Figure imgf000014_0001

Furthermore, in the example embodiment, when a significant peak in h(x,d) is located, the core part of the human object in the 2D histogram h(x,d) is suppressed. An example algorithm for suppression is as follows: for a point v = (x, d) in the x-d plane, if P1 > 2T1, set h(x,d) = 0 , where T1 is the threshold for the probability of the pixel belonging to the human body.

A template of a contour distinctive to the object of interest may be used for object verification. For example, a deformable head-shoulder template to distinguish humans from non-human objects is used in the example embodiment. The head- shoulder template may be generated manully using actual front view human images - from a database and may comprise a number of control points, e.g. 12 control points c, = (X1, y,) where i = 1, ... , 12.

A belt along the boundary of the human region segmented from the disparity image is generated. Edges in the colour image within this belt are extracted as E(x, y) which represent the potential coincident edges along the body boundary. Before matching the template to E(x,y), the control points are adjusted to the corresponding positions along the upper part of the segmented region from disparity image. The deformed template of scale γ for the k* region is represented by a point sequence Tζ = {x/'j.y,''} with n\ points, where two example scales γ = 1 and 1.2 are used to deal with the scale variation of persons. For Tζ , the support from the evidence is computed from the vertical position ys = yt - 0.2wb(dκ) to ye = yt + 0.2Wb(CIk). The likelihood of the upper part of the /c"1 detected human region to be a head-shoulder contour is evaluated as

PH k S = +W.r +W J d<>)

Figure imgf000015_0001
where xc is the body center in x dimension. If P^3 > Tm (e.g. 0.5), the object is identified as a human object.

Figures 3a-3c show comparisons of the stereo-based object detection results based on the approach as described above and a conventional approach. Pictures 302, 312 and 322 show colour images of example settings and 304, 314 and 324 are the corresponding disparity images. As shown in 306, 316 and 326, the conventional approach either fails to detect the object of interest (i.e. one or more humans in the example settings) or produces multiple detections around a single object, as shown in 316. On the other hand, the stereo-based detection according to the example embodiment accurately detects the objects as shown in 308, 318 and 328.

In step 104 (Figure 1 ), Histogram of Oriented Gradients (HOG) descriptors are used to detect object shapes such as human figures in cluttered backgrounds or under various illuminations. A HOG descriptor combines the local intensity gradients or edge directions to characterize the visual structure features. A Support Vector Machine (SVM) classifier is usually trained based on the HOG descriptors for human detection.

Additionally, the object shapes as captured by the camera may partially or completely correspond to the full shapes depending on the depth distance of the object from the camera. For example, when a human object is very close to the camera, only the head and upper torso are in the view of the camera, but when a person is far away from the camera, the full body is in the image. In the example embodiment, the image- based detection is based on three models of human appearances in the image, i.e. the upper body, 2/3 body, and full body models as shown in Figures 4a-4c respectively.

The upper body model shown in Figure 4a is used to detect human objects which are very close to the camera. In this case, only the head and upper torso are in the view of the camera. A detection window of 6x6 cells is assigned to cover the human figure and sufficient margin on the three sides except the lower side. The 2/3 body (Figure 4b) model is used to detect persons at a moderate distance to the camera, and the corresponding detection window comprises 9x6 cells which cover sufficient margin on the three sides except the bottom. In this case, the lower legs are out of the camera view. Similarly, for the full body model (Figure 4c) for persons far away from the camera, a 12x6 cell detection window, which covers sufficient margin around the four sides, is used. The cell size or the scale of the detection window is determined according to the scale of the expected person to be detected.

The HOG descriptor is generated from distributions of edge directions in each cell. In a detection window, each group of 2 x 2 cells forms a block. The blocks overlap each other in a sliding fashion, so a detection window of Kx L cells contains (K- 1) x (L - 1) blocks. The local texture feature in each cell is represented by a Histogram of Oriented Gradients (HOG) where the orientations of gradients from -90° to +90° are quantified into 9 bins. Thus, each block contains a combined 36-bin histogram from all of its cells. The histogram of each block is normalized in a L2 - norm or L2 - Hys scheme. The histograms of the blocks in a detection window are joined to form the HOG descriptor of (K - 1) x (L - 1) x 36 bins for the window. Multi-scale detection can be performed by varying the size of cells for the detection window.

Furthermore, in the example embodiment, an algorithm is applied to significantly reduce the number of detection windows necessary to detect all potential human objects, thereby reducing the amount of computation required. The detection is carried out from the smallest depth distance to the furthest. Let Zmιn be the closest distance of a person that could be detected in the image. From the linear model (2), the disparity value for Zmιn is dm = B/(Zmιn - K). The body width in the image can be computed as W0(OnJ = K2(dm)Wbdm. Set k = 0, dk = dm, and wb = wb(dk) , the HOG-based human detection is performed iteratively as follows:

(a) Detect potential persons with the same depth distance from the camera. From wk , the width of the detection window can be set, for example, wd = 1.8w>* and the cell size wc = w</6 correspondingly as all 3 models have 6 cells horizontally. The top position for a person of average height can be obtained as yt from (3). One HOG-model is selected according to the height of visible human body in the image, i.e. \yb - yt\ where yb is the bottom of the image. Human detection is performed from the horizontal centre position xc = M/2 to both sides of the image. The horizontal shift of detection window is set e.g. as one third of the body width (wb k /3 ) so as to be dense enough to detect all the possible humans in the image. To adapt to the variation of human heights, at each x position, the top of detection window is placed at three positions, i.e. yt - wc, yt, and yt + wc. The maximum response from the three detections is marked as the detection position.

(b) Move to the next depth distance. Set wb k+λ = γwb k with γ < 1 (for example, γ =

0.7 in the example embodiment), humans close to the km depth distance can be detected as the body width in the image decreases with increasing depth distance. From the linear model (2), the depth measure for the person of width being wb k+l can be obtained as dk+l = (Bwb k+ι)/(βVb - Kwb k+l) . If dk+1 > dmin, set k = k + 1 and go to step (a).

As shown in Figure 5, from the above algorithm human objects are in layers with increasing depth distances to the camera. It generates sparse detection windows, and the closer the depth distance for the potential persons, the fewer the detection windows required since the pictures of close persons are large in the image. For example, the size of a detected person 502 may be about one third of image 510, but in images 520 and 530, it is substantially smaller. In the k* step of iteration which detects persons close to the k* depth layer, the number of detection windows can be computed as nw k = (9W,)/wk , where W, is the width of the image and wb k =

Figure imgf000018_0001
, since the

W horizontal positions are ; - and 3 vertical positions are used at each horizontal wό /3 position to adapt to variation in human heights. Let

Figure imgf000018_0002
be the width for the persons around the closest depth layer and wb be the width of persons around the furthest

depth layer, the number of iterations can be calculated as K - I + b — ^- and the log/ number of total detection windows is nw = ^i=1 «£ •

In the example embodiment, with

Figure imgf000018_0003
= 180 pixels, wb = 24 pixels and the image size is 320x240 pixels, the maximum number of detection windows used is 287, as compared with over 200,000 by using conventional approaches. The number of detection windows can be further reduced, for example, by setting a furthest distance for reliable detection at 6m, about 119 to 188 detection windows are sufficient to detects all possible human objects. Furthermore, the efficiency can also be improved by exploiting the disparity image. For a detection window for the kth depth distance, if no disparity evidence between dΛ-1 and dk+i from the detection window, there is no need to apply

HOG-based human detection on the window.

To further speed up the HOG-based human object detection, suitable scales may be applied. For example, if wb k is larger than 88 pixels, HOG-based human detection is performed at the image of 1/4 size of the original image, if w* is larger than 44 pixels, it is performed at 1/2 size image, otherwise, it is performed at the original image. In this way, the computation for generating the HOG feature vectors for close persons is reduced substantially.

Additionally, in the example embodiment, the result of step 104 may be improved by using disparity information obtained from step 102 for finding a best detection window associated with a detected object, thereby minimising or eliminating multiple detections due to overlapping detection windows. For example, if an object is detected in a detection window W of the k"1 depth distance dk. The average depth measure of the object may be calculated as

Figure imgf000019_0001

where s = (x, y) is an image point, Wp is the core window of W with cells along the margin removed and Np is the number of pixels within the core window Wp such that the disparity values are within (dk+1, dk.i). The probability of the window W being associated with the object can be defined as dh - dk s

1 -- if dh ≥ dk

P, dk-\ ~ d

1 w,h, = k ' (12) i -A^-, a - dh < dk dk - dM whereby the closer the value of dh to dk, the higher the probability of the window being associated with the object. When the object is detected in multiple detection windows, only the detection window with the highest probability is retained as result and the others are suppressed.

In step 106 (Figure 1), the object is tracked based on a colour-based tracking model disclosed in PCT/SG2007/000206, in which the object in the foreground region is extracted using background subtraction for generating a Dominant Colour

Histogram (DCH), which is a list of a few most significant colours and their weights for a region. In the example embodiment, however, the DCH is generated using the disparity segmentation obtained in step 102 because the background may be changing. Figures 6a and 6b show bounding boxes and core ellipses of head and torso of a human object H on colour and disparity images respectively. From the disparity information of a human object H detected by stereo-based detection in a previous frame, the bounding boxes of head Bh and torso B,, and the depth range D = [dn, ctø may be determined. A DCH of torso or head is generated from two types of pixels, i.e. those within the bounding box and depth range or those within a core ellipse 602, 604 even though the disparity measures are unavailable. The core ellipses 602, 604 may be determined from average human metrics in a prior step. By applying the DCH-based mean-shift tracking method disclosed in PCT/SG2007/000206 on the DCH generated from disparity segmentation, the new position of the human object in the current frame is determined.

In step 108 (Figure 1), the position of the object in the current frame is estimated based on motion history in the preceding frame, particularly in instances where the person is occluded. Let X-Z plane be the ground plane, where X-direction is aligned with the image plane and Z-direction is aligned with optical axis of a camera. The probability of being occluded can be estimated according to the object's position on the ground plane. Figures 7a-7c show possible events for two human objects HA and HB. In Figure 7a, human objects HA and HB are completely visible to a camera 730(?). In Figure

7b, human object H6 is partially occluded by human object HA who is nearer to the camera 730 (i.e. ZA < ZB), while in Figure 7c, human object H8 is completed occluded by human object HA. Further, let the line between the camera and a person's left side and right side be /, and lr respectively and the angles of // and lr to the X-axis be α, and ar respectively. The probability P0 of HB being occluded by HA can be computed as

P0(B/ A) =

Figure imgf000020_0001
where P0 = 1 means complete occlusion, 0 < P0 < 1 means partial occlusion and P0 = 0 means no occlusion. Further, if a person is occluded by more than one person, the persons occluding him are classified into two groups, i.e. on the left and right sides. The maximum probabilities for both groups are selected and the final probability value is the sum of the two maximum values.

In the example embodiment, from the value of P0 calculated using Equation (20), if the person is completely visible (i.e. no occlusion) or slightly occluded, the position in the current time step is recorded and the velocity vector on the X-Z plane updated such that if S, = (XnZ1) is the position vector of the person in the current time step, the speed vector V = {Vx' , V^) is updated as

V = (1 - /?)VM + /?(S, - S,_,) (21)

where β is a smooth factor and β = 0.25 in the example embodiment.

If a substantial part of the person is occluded, e.g. P0 > 0.33, reliable detection and segmentation from visual features in the view might not be available and the possible position of the person in the scene is predicted, e.g. from the simple linear model. Let t0 be the time just before substantial occlusion, the current position of the person is estimated as

S, = S,O + (' -O)V'° (22)

The steps 102 to 108 (Figure 1) as described above provide separate results based on the respective detection, tracking and estimation models. In the MMJA approach, a MAP integration is applied based on the results of steps 102 to 108. For MAP integration, in the example embodiment, the results of steps 102 to 108 are standardised such that the results of each step comprises respectively an average disparity value d, a range of disparities d±, centre positions of head sh and torso st, bounding boxes of head Bh, torso B1 and body Bb as inputs to the MMJA model, as shown in Table 1.

Table 1

Figure imgf000022_0004

A MAP framework is established to combine the results. In the example embodiment, R human candidates (x15 — ,xΛ) are available from the human detection and estimation measurements in steps 102 to 108. R may not be the four times of the true human objects in the view due to possible false detections, missing detections, and multiple detections from both human detection in steps 102 and 104. The posterior probability of the /c"1 person's position at xl in the image can be expressed as P(x* k I X1 , • • • , xR ) . Using Bayesian law,

P(xk I x19-,xΛ) cc P(X1,-, xR I Xl)P(Xl) (23)

since P(X1, ---,Xx) can be assumed as a constant. Because the human candidates (x,,---,xΛ) are generated by different models from different measurements, it can be assumed that they are conditionally statistically independent. The joint likelihood probability can be written as

Figure imgf000022_0001

If xl is the true position of the /c"7 person in the image, the posterior probability reaches a maximum value, i.e. 9P(xI |

Figure imgf000022_0002
= 0. Combining (23) and (24):

dP(x* k \ xy---,xR) ^ S P(XI)ΠP(X, | XI)

(=1 dx* k 5x1 (25)

Figure imgf000022_0003
It will be appreciated that, when xk" is the true position of the k* person, both P(x*) and PO
Figure imgf000023_0001
) are at the maximum points. From dP(x \ | x, , • • • , xΛ ) / Sx^ = 0 , it can be derived that

R y — i i — a ™p(ιx*,, |wX;) = 0 (26) tf P(x, \ x'k) dxk In the example embodiment, P(x, | x^) follows a Gibbs distribution such that

P(x, I -O β(χ,-«:)2 (27)

Figure imgf000023_0002
where Zlk is a constant for normalisation, βlk indicates the association of detection x, to the k* tracked person, and E11(Xj = (X1 -x* k)2 is the energy term. It can also be shown that aP(!J Xl) = 2/?,A(x, -X;)P(x, | xl) (28)

Sx,

By substituting equation (28) into equation (26), the value of x* k can be obtained under the Maximum A Posterior (MAP) framework such that the expected true position is the weighted average of the detected and estimated positions according to their associations to the tracked person, i.e.

Y* β x

The association value βlk may be between 0 and 1 where a value close to 1 indicates strong association while a value close to 0 indicates a weak association.

Further, a sequential tracking scheme is applied in the example embodiment.

An order for tracking multiple objects in the view is determined according to the respective depth distances and visible clues from the image. A measure for the priority of a person in the tracking order at a relevant time step is defined as

Ok =

Figure imgf000023_0003
+ (l- P/)Pe* + As k + A,k (30) where Oz k = 1 -Zk /Zmax defines the priority on the depth distance with respect to the maximum distance (e.g. Zn^x = 6m in the example embodiment), P0* (from equation (20)) is the proportion of occluded body part, Pc k (from equation (19)) is the probability of observing the person from his visible part, and As k and A) are the maximum association values to the detected persons from stereo-based and HOG- based human detection steps respectively. The values in equation (30) are obtained from the tracking results in the previous time step such that a person who is at a close distance to the camera, has smaller occluded part and has high probabilities of being detected in pervious images has a high chance of being tracked reliably from visual information.

With the priority values Ofo k = 1 , ... , Nτ, for the tracked persons, the sequential tracking is performed individually from the person with maximum priority value to the person with minimum priority value. In each iteration, three operations are performed: colour-based tracking, application of MMJA model, and exclusion.

For the k* tracked person, the colour-based mean-shift tracking is first performed as described above and the new position is represented by the parameters shown in Table 1. The association value /3lk can be estimated from a correlation between a detected position and the position from colour-based tracking such that association between two positions is estimated from the overlaps on spatial position and depth dimension.

The spatial correlation of the i* person detected by stereo-based human detection with the k*1 tracked person can be defined as

Figure imgf000024_0001
where αs and αd are weights for 2D spatial and depth matches (αs + αd = 1), wh and

w, are the weights for the matches of head and torso parts (wh = and

Figure imgf000024_0002

w, with |B| being the area of the box), P* and P* are the overlapping

Figure imgf000024_0003
rates of head and torso bounding boxes between the detected and tracked persons, and P£ is the overlapping rate of disparity ranges. The overlapping rates are defined as
Figure imgf000025_0001
where f| represent intersection and U represents union, Df = [df_,df+] and Dl = [dl_,dl+ J are the depth ranges of the detected and tracked persons. To avoid the partially and fully occluded person being locked on the detected occluding person, the association of the detected person with the tracked person is finally defined as

^ =(i- ^; Xi -^;, 03) where P0* is from equation (20) and represents the proportion of occluded body, Pa'pt =

Figure imgf000025_0002
is the accumulated association of the detected person with previously tracked persons who may occlude the k* tracked person and P£ is the probability of the heights of the detected and tracked persons being close.

The spatial correlation of the j* detected person by image-based detection with the k"1 tracked person is defined as

aj ! k = asPJb k +adPoj (34) and the association value is defined as (35)

In the example embodiment, if the k* tracked person is occluded in the previous frame, the output of motion prediction is a good estimate of the new position since fewer visual clues are available in recent frames. The association value of the position determined by motion estimation with the true new position can be defined as

Figure imgf000025_0003

On the other hand, if strong visual evidence can be observed from the position located by colour-based tracking, the result of the colour-based tracking is a good estimate of the new position. The association value of the tracked position with the true new position is defined as β^ = (1 - Po k )Pk (35b),

where Pc k can be obtained from equation (19)

The MAP framework can be formulated based on Equations (33), (35), (35a) and

(35b) such that the expected true position can be calculated as

Figure imgf000026_0001
where N5 and N| are the number of stereo-based and image-based detections of persons in the current image frame, respectively.

Equation (36) is used to calculate the positions of parameters in Table 1 , e.g. head centre (where xk = s^), torso centre (xk = s^), average depth position (xk = dk) and depth range (xk = d), and the 2D sizes of head and torso. The centres of head and torso may be adjusted if they are separated too far away from each other according to average human metrics.

Additionally, in the example embodiment, an exclusion step is carried out for removing the visual feature of the k"1 tracked human object from its new position in the colour image for preventing the next persons from being trapped in a position of the k* tracked person when an occlusion event happens. The weight image ω(s) is updated such that if the colour of a pixel within the box of the new position has a high probability of being from the tracked person, the weight of the pixel is suppressed. Let T^011 be the

DCH from the region within the new bounding box of head or torso, and T^H be the DCH of the head or torso of the k* tracked person. For a pixel s e B, ω(s) > 0 and cs = /(s) , the significance of cs in both T^011 and T^H can be obtained as

e*(cs)

Figure imgf000026_0002
07) In the example embodiment, the probability of s belonging to the k* tracked object is defined as Aω(s) = τήm(\,ek(ca)/eB(cs)) . The algorithm for updating the weight image is β>(s) = max{θ,ø(s) -Δfi>(s)}. In addition, once all the objects in the tracking order have been tracked, their positions in 3D space are updated. For example, the measurements on the 2D image and disparity information are transformed as positions on the ground plane. The spatial relations of the objects in 3D space and the probabilities of being occluded are also updated.

Figures 8a - 8f show a series of images of a human object 802 in a scene, comprising respectively: a colour image of the scene (Figure 8a), a corresponding disparity image (Figure 8b), a stereo-based human detection and segmentation result

(Figure 8c), a colour image with HOG-based human detection result (Figure 8d), a colour-based tracking result (Figure 8e), and a colour image of the final expected position (Figure 8f). As shown in Figure 8c, the stereo-based detection successfully captures position of the human object 502 and segments the profile into head and torso bounding boxes. However, the image-based detection (Figure 8d) is not successful as image is a side view and not a front view of the human object 502, hence none of the three human appearance models can be matched. In Figure 8e, the colour-based tracking provides an estimation of the position of the human object 502 in which the position of head bounding box is offset from the true position. The final expected position in Figure 8f after applying the MMJA approach is a weighted average of the results in

Figures 8c-8e such that the positions of the bounding boxes are substantially the same as true positions of the head and torso in the frame.

Some steps of method for object detection and tracking as described above may be further improved in several ways. In step 102 of the example embodiment, Intel Image Processing Library (IPL) functions can be used for morphological operations to speed up the calculations. The computational cost of step 102 can also be reduced by reducing the size of image for detecting close objects. For example, the scale-adaptive filtering can be performed on images of half a normal size for detecting persons with small depth distances (e.g. wb(d) > 50 pixels) and on normal size images for persons who are further. The amount of computation can also be reduced by applying the segmentation and verification steps only within the bounding boxes of the image where detections are more likely, thereby minimising or avoiding image processing of unrelated positions in the image.

The method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system

900, and instructing the computer system 900 to conduct the method of the example embodiment.

The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.

The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).

The computer module 902 in the example includes a processor 918, a

Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard

904.

The components of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by the processor 918. Intermediate storage of program data maybe accomplished using RAM 920. Figure 10 shows an implementation of a system 1000 for object detection and tracking according to an example embodiment. The system 1000 comprises a stereo camera 1002, for example a STOC stereo camera from Videre Design, coupled to the computer system 900 via a data connection, e.g. a 1394 firewire connection. The computer system 900 contains computer code means for instructing the computer system 900 to execute the method of object detection and tracking as described above. The system 1000 may be coupled to a service robot (not shown) in an example embodiment such that the results of human detection and tracking are sent to a main control machine of the robot for making decisions.

Figure 11 shows a flowchart 1100 illustrating a method of object detection and tracking in a stereo image frame. At step 1102, two or more steps of a group consisting of locating position of the object in the frame based on a stereo- based detection; locating position of the object in the frame based on an image- based detection; estimating position of the object in the frame based on a colour- based tracking; estimating position of the object in the frame based on motion history of the object in a preceding frame, are carried out. At step 1104, all possible candidate tracked objects in the frame from the two or more steps are detected. At step 1106, an expected new position of the object in the frame is calculated based on a weighted average of the candidate tracked objects in an Maximum A Posterior (MAP) framework.

It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

For example, it will be appreciated that the present invention may be applied to detection and tracking of objects other than persons, including, but not limited, cars. In such embodiments, it will be appreciated that the parameters and models described for use with detection and tracking of persons would be changed or adjusted accordingly without departing from the spirit or scope of the present invention, including e.g. adjusting or choosing detection windows of suitable scales for the relevant objects.

Claims

1. A method for object detection and tracking in a stereo image sequence, the method comprising: two or more steps of a group consisting of: a) locating position of the object in a frame based on a stereo-based detection; b) locating position of the object in the frame based on an image- based detection; c) estimating position of the object in the frame based on a colour- based tracking; d) estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising: e) detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.
2. The method as claimed in claim 1 , wherein the candidate tracked objects are weighted based on respective association values for each of the candidate tracked objects with a true new position of the tracked object in the MAP framework.
3. The method as claimed in claims 1 or 2, wherein the method is applied to tracking of multiple objects, and each of the tracked objects is assigned a priority, and wherein a tracking order is from the highest priority to the lowest priority.
4. The method as claimed in claim 3, wherein the priority is calculated based on a depth distance of the object in the stereo frame.
5. The method as claimed in claims 3 or 4, wherein the priority is calculated further based on a proportion of the object that is occluded in the image.
6. The method as claimed in claims 3 to 5, wherein the priority is calculated further based on a probability of observing the object from a proportion that is visible in the image.
7. The method as claimed in claims 2 to 5, wherein the priority is calculated further based on maximum association values of the candidate objects.
8. The method as claimed in any one of the preceding claims, wherein step a) comprises locating peaks in disparity information that are substantially close to a dimension of the object.
9. The method as claimed in claim 8, wherein step a) further comprises segmenting the peaks into regions based on the Maximum A Posterior (MAP) framework.
10. The method as claimed in claim 9, wherein the segmenting step comprises using IPL functions.
11. The method as claimed in any one of the preceding claims, wherein step b) comprises using Histogram of Oriented Gradient (HOG) descriptors for distinguishing an appearance of the object.
12. The method as claimed in claim 13, wherein the appearance is selected from a group consisting of an upper body of the object, a 2/3 body of the object and a full body of the object.
13. The method as claimed in claims 11 or 12, wherein an algorithm for step b) comprises: starting at a smallest depth distance in the stereo frame; applying detection windows for locating possible objects; and repeating the above step at the next depth distance.
14. The method as claimed in claim 13, wherein step b) further comprises using disparity information obtained from step a) for finding a detection window having the highest probability of containing the object.
15. The method as claimed in any one of the preceding claims, wherein step c) comprises generating a Dominant Colour Histogram based on disparity information obtained from step a).
16. The method as claimed in any one of the preceding claims, wherein step d) comprises calculating a probability of the object being occluded by at least one different object having a smaller depth distance in the stereo frame.
17. The method as claimed in claim 16, wherein the position of the object in the frame is estimated when the probability of being occluded exceeds a selected value.
18. The method as claimed in any one of the preceding claims, further comprising updating positions of all objects in the frame after all objects have been tracked.
19. A system for object detection and tracking in a stereo image sequence, the system comprising: two or more of a group consisting of: a) means for locating position of the object in a frame based on a stereo-based detection; b) means for locating position of the object in the frame based on an image-based detection; c) means for estimating position of the object in the frame based on a colour-based tracking; d) means for estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising: e) means for detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) means for calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in an Maximum A Posterior (MAP) framework.
20. A computer storage medium having stored thereon computer code means for instructing a computer system to execute a method of object detection and tracking in stereo image sequence, the method comprising: two or more steps of a group consisting of: a) locating position of the object in a frame based on a stereo-based detection; b) locating position of the object in the frame based on an image- based detection; c) estimating position of the object in the frame based on a colour- based tracking; d) estimating position of the object in the frame based on motion history of the object in a preceding frame; further comprising: e) detecting all possible candidate tracked objects in the frame from the two or more of steps a) to d); and f) calculating an expected new position of each tracked object in the frame based on a weighted average of the candidate tracked objects in a Maximum A Posterior (MAP) framework.
PCT/SG2008/000386 2008-10-06 2008-10-06 Method and system for object detection and tracking WO2010042068A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2008/000386 WO2010042068A1 (en) 2008-10-06 2008-10-06 Method and system for object detection and tracking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2008/000386 WO2010042068A1 (en) 2008-10-06 2008-10-06 Method and system for object detection and tracking

Publications (1)

Publication Number Publication Date
WO2010042068A1 true WO2010042068A1 (en) 2010-04-15

Family

ID=42100834

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2008/000386 WO2010042068A1 (en) 2008-10-06 2008-10-06 Method and system for object detection and tracking

Country Status (1)

Country Link
WO (1) WO2010042068A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968886A (en) * 2010-09-09 2011-02-09 西安电子科技大学 Centroid tracking framework based particle filter and mean shift cell tracking method
CN102779347A (en) * 2012-06-14 2012-11-14 清华大学 Method and device for tracking and locating target for aircraft
CN102930555A (en) * 2011-08-11 2013-02-13 深圳迈瑞生物医疗电子股份有限公司 Method and device for tracking interested areas in ultrasonic pictures
JP2013122763A (en) * 2011-12-12 2013-06-20 Samsung Electronics Co Ltd Video processor and video processing method
CN103473757A (en) * 2012-06-08 2013-12-25 株式会社理光 Object tracking method in disparity map and system thereof
WO2019025729A1 (en) * 2017-08-02 2019-02-07 Kinestesia Analysis of a movement and/or of a posture of at least a portion of the body of a person

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5633705A (en) * 1994-05-26 1997-05-27 Mitsubishi Denki Kabushiki Kaisha Obstacle detecting system for a motor vehicle

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5633705A (en) * 1994-05-26 1997-05-27 Mitsubishi Denki Kabushiki Kaisha Obstacle detecting system for a motor vehicle

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968886A (en) * 2010-09-09 2011-02-09 西安电子科技大学 Centroid tracking framework based particle filter and mean shift cell tracking method
CN102930555A (en) * 2011-08-11 2013-02-13 深圳迈瑞生物医疗电子股份有限公司 Method and device for tracking interested areas in ultrasonic pictures
CN102930555B (en) * 2011-08-11 2016-09-14 深圳迈瑞生物医疗电子股份有限公司 A kind of method and device that area-of-interest in ultrasonoscopy is tracked
JP2013122763A (en) * 2011-12-12 2013-06-20 Samsung Electronics Co Ltd Video processor and video processing method
CN103473757A (en) * 2012-06-08 2013-12-25 株式会社理光 Object tracking method in disparity map and system thereof
CN103473757B (en) * 2012-06-08 2016-05-25 株式会社理光 Method for tracing object in disparity map and system
CN102779347A (en) * 2012-06-14 2012-11-14 清华大学 Method and device for tracking and locating target for aircraft
WO2019025729A1 (en) * 2017-08-02 2019-02-07 Kinestesia Analysis of a movement and/or of a posture of at least a portion of the body of a person
FR3069942A1 (en) * 2017-08-02 2019-02-08 Kinestesia Analysis of a movement and / or a posture of at least one part of the body of an individual

Similar Documents

Publication Publication Date Title
Lin et al. Shape-based human detection and segmentation via hierarchical part-template matching
Jafari et al. Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras
Ikemura et al. Real-time human detection using relational depth similarity features
Janoch et al. A category-level 3d object dataset: Putting the kinect to work
Chaudhry et al. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions
Negri et al. A cascade of boosted generative and discriminative classifiers for vehicle detection
Nedevschi et al. Stereo-based pedestrian detection for collision-avoidance applications
Ahmad et al. Human action recognition using shape and CLG-motion flow from multi-view image sequences
Parekh et al. A survey on object detection and tracking methods
US20050094879A1 (en) Method for visual-based recognition of an object
Kendall et al. Modelling uncertainty in deep learning for camera relocalization
JP2008310796A (en) Computer implemented method for constructing classifier from training data detecting moving object in test data using classifier
US9542626B2 (en) Augmenting layer-based object detection with deep convolutional neural networks
US20090296989A1 (en) Method for Automatic Detection and Tracking of Multiple Objects
US7587064B2 (en) Active learning system for object fingerprinting
Lin et al. Hierarchical part-template matching for human detection and segmentation
Sato et al. Temporal spatio-velocity transform and its application to tracking and interaction
JP4972201B2 (en) Man-machine interaction based on sign
US20120308124A1 (en) Method and System For Localizing Parts of an Object in an Image For Computer Vision Applications
US8971572B1 (en) Hand pointing estimation for human computer interaction
Wu et al. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors
US10055013B2 (en) Dynamic object tracking for user interfaces
JP2009026314A (en) Multi-pose face tracking using multiple appearance models
US20100027845A1 (en) System and method for motion detection based on object trajectory
US20100027892A1 (en) System and method for circling detection based on object trajectory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08813735

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08813735

Country of ref document: EP

Kind code of ref document: A1