WO2005050543A1 - Adaptive probabilistic visual tracking with incremental subspace update - Google Patents

Adaptive probabilistic visual tracking with incremental subspace update Download PDF

Info

Publication number
WO2005050543A1
WO2005050543A1 PCT/US2004/038189 US2004038189W WO2005050543A1 WO 2005050543 A1 WO2005050543 A1 WO 2005050543A1 US 2004038189 W US2004038189 W US 2004038189W WO 2005050543 A1 WO2005050543 A1 WO 2005050543A1
Authority
WO
WIPO (PCT)
Prior art keywords
digital images
image
eigenbasis
model
location
Prior art date
Application number
PCT/US2004/038189
Other languages
French (fr)
Inventor
Ming-Hsuan Yang
Jongwoo Lim
David Ross
Ruei-Sung Lin
Original Assignee
Honda Motor Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honda Motor Co., Ltd. filed Critical Honda Motor Co., Ltd.
Priority to EP04811059A priority Critical patent/EP1704509A4/en
Priority to JP2006539984A priority patent/JP4509119B2/en
Publication of WO2005050543A1 publication Critical patent/WO2005050543A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/755Deformable models or variational models, e.g. snakes or active contours
    • G06V10/7557Deformable models or variational models, e.g. snakes or active contours based on appearance, e.g. active appearance models [AAM]

Definitions

  • the present invention generally relates to the field of computer vision, and more specifically, to visual tracking of objects within a motion video.
  • a digital image is a computer readable representation of an image of a subject taken by a digital imaging device, e.g. a camera, video camera, or the like.
  • a computer readable representation, or digital image typically includes a number of picture elements, or pixels, arranged in an image file or document according to one of many available graphic formats.
  • some graphic file formats include, without limitation, bitmap, Graphics Interchange Format (GIF), Joint Photographic Experts Group (JPEG) format, and the like.
  • a subject is anything that can be imaged, i.e., photographed, videotaped, or the like.
  • a subject may be an object or part thereof, a person or a part thereof, a scenic view, an animal, or the like.
  • An image of a subject typically comprises viewing conditions that, to some extent, make the image unique. In imaging, viewing conditions typically refer to the relative orientation between the camera and the object (i.e., the pose), and the external illumination under which the images are acquired.
  • Motion video is generally captured as a series of still images, or frames.
  • a concept generally referred to as visual tracking.
  • Example applications include without limitation intelligence gathering, whereby the location and description of the target object over time are of interest, and robotics, whereby a machine may be directed to perform certain actions based upon the perceived location of a target object.
  • the present invention provides a method and apparatus for visual tracking that incrementally updates a description of the target object.
  • an Eigenbasis represents the object being tracked. At successive frames, possible object locations near a predicted position are postulated according to a dynamic model. An observation model then provides a maximum a posteriori estimate of object location, whereby the possible location that can best be approximated by the current Eigenbasis is chosen. An inference model applies the dynamic and observation models over multiple past frames to predict the next location of the target object. Finally, the Eigenbasis is updated to account for changes in appearance of the target object.
  • the dynamic model represents the incremental motion of the target object using an affine warping model.
  • This model represents linear translation, rotation and scaling as a function of each observed frame and the current target object location, according to multiple normal distributions.
  • the observation model utilizes a probabilistic principal components distribution to evaluate the probability that the currently observed image was generated by the current Eigenbasis. A description of this is in M. E. Tipping and CM. Bishop “Probabilistic principal component analysis,” Journal of the Royal Statistical Society, Series B 61 (1999), which is incorporated by reference herein in its entirety.
  • the inference model utilizes a simple sampling method that operates on successive frame pairs to efficiently and effectively infer the most likely location of the target object.
  • the Eigenbasis is updated according to application of the sequential Karhunen-Loeve algorithm, and the Eigenbasis may be optionally initialized when training information is available.
  • a second embodiment extends the first in that the sequential inference model operates over a sliding window comprising a selectable number of successive frames.
  • the dynamic model represents six parameters, including those discussed above plus aspect ratio and skew direction.
  • the observation model is extended to accommodate the orthonormal components of the distance between observations and the Eigenbasis.
  • the Eigenbasis model and update algorithm are extended to account for variations in the sample mean while providing an exact solution, and no initialization of the Eigenbasis is necessary.
  • a system is provided that includes a computer system comprising an input device to receive the digital images, a storage or memory module for storing the set of digital images, and a processor for implementing identity-based visual tracking algorithms.
  • FIG. 1 is a schematic illustration of the visual tracking concept.
  • Figure 2 shows an overall algorithm for visual tracking.
  • Figure 3 shows an algorithm for initial Eigenbasis construction according to one embodiment of the present invention.
  • Figure 4 illustrates a concept of the dynamic model according to one embodiment of the present invention.
  • Figure 5 illustrates a concept of the distance-to-subspace observation model according to one embodiment of the present invention.
  • Figure 6 illustrates a concept of the distance-to-mean observation model according to one embodiment of the present invention.
  • Figure 7 shows a computer-based system according to one embodiment of the present invention.
  • Figure 8 shows the results of an experimental application of one embodiment of the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • FIG. The Figures ("FIG.") and the following description relate to preferred embodiments of the present invention by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of the claimed invention.
  • the object tracking problem is illustrated schematically in Figure 1.
  • an image region or frame F is observed in sequence, and the location of the target object, L , is treated as an unobserved or hidden state variable.
  • the motion of the object from one frame to the next is modeled based upon the probability of the object appearing at L , given that it was just at X .
  • the model represents possible locations of the object at time t, as determined prior to observing the current image frame.
  • the likelihood that the object is located at a particular possible position is then determined according to a probability distribution.
  • the goal is to determine the most probable a posteriori object location.
  • An initial frame vector is received in step 206.
  • This frame vector includes one element per pixel.
  • Each pixel comprises a description of brightness, color etc.
  • the initial location of the target object is determined. This may be accomplished either manually or through automatic means.
  • An example of automatic object location determination is face detection.
  • face detection is illustrated in patent application 10/858,878, Method, Apparatus and Program for Detecting an Object, which is incorporated by reference herein in its entirety. Such an embodiment informs the tracking method of an object or area of interest within an image.
  • an initial Eigenbasis is optionally constructed.
  • the Eigenbasis is a mathematically compact representation of the class of objects that includes the target object. For example, for a set of images of a particular human face captured under different illumination conditions, a polyhedral cone may be defined by a set of lines, or eigenvectors, in a multidimensional space R s , where S is the number of pixels in each image. The cone then bounds the set of vectors corresponding to that human's face under all possible or expected illumination conditions.
  • An Eigenbasis representing the cone may in turn be defined within the subspace R M , where M ⁇ S.
  • the identity of the subject may be efficiently determined.
  • the same concepts apply generally to other classes of objects of interest, including, e.g., animals, automobiles, geometric shapes etc.
  • Figure 3 illustrates initialization of the Eigenbasis from a set of training images of the object of interest or of similar objects. While such initialization may accelerate convergence of the Eigenbasis, it may be eliminated for simplicity, or where training images are unavailable.
  • step 312 all training images are histogram-equalized.
  • step 318 the mean is subtracted from the data. The desired principle components are computed in step
  • a dynamic model is employed to predict possible locations of the target object in the next frame, L + 1 , based upon the location within the current frame, L ( , according to a distribution p(L ⁇ L ) .
  • p(L ⁇ L ) This is shown conceptually in Figure 4, including location in the current frame 410 and possible locations in the next frame 420(i).
  • a probability distribution provided by the dynamic model encodes beliefs about where the target object might be at time t, prior to observing the respective frame and image region.
  • L the location of the target object at time t
  • L the location of the target object at time t
  • This transformation warps the image, placing the target window, corresponding to the boundary of the object being tracked, in a rectangle centered at coordinates (0,0), with the appropriate width and height.
  • This warping operates as a function of an image region F and the object location L, i.e., w(F ).
  • the initialization of dynamic model 224 assumes that each parameter is independently distributed, according to a normal distribution, around a predetermined location L . Specifically
  • ⁇ Z ⁇ l ⁇ ⁇ ' o' ⁇ ⁇ 0 ' ⁇ ⁇ r ⁇ ;r o' ⁇ S ⁇ o' ⁇ ⁇ ( )
  • N(z; ⁇ , ⁇ 2 ) denotes evaluation of the normal distribution function for data point z, with mean ⁇ and variance ⁇ 2 .
  • step 230 an image observation model is next applied.
  • the observation model evaluates the probability that the currently observed image was generated by the current Eigenbasis.
  • a probabilistic principal components distribution also known as sensible PCA
  • this model assumes that the observed image region was generated by sampling an appearance of the object from the Eigenbasis and inserting it at .
  • N(z; ⁇ ,BB ⁇ + ⁇ I) the probability of observing a datum z given the Eigenbasis B and mean ⁇
  • the ⁇ /term corresponds to the covariance of additive Gaussian noise present in the observation process.
  • noise might arise, for example, from data quantization, errors in the video sensor or thermal effects.
  • N(z; ⁇ ,BB ⁇ + ⁇ I) is proportional to the negative exponential of the squared distance between z and the linear subspace B,
  • an inference model 236 is next applied to predict the location of the target object.
  • L since L is never directly observed, full Bayesian inference would require computation of the distribution P(L
  • this distribution is infeasible to compute in closed form. Instead, it is approximated using a normal distribution of the same form as that in Equation 1 around the maximum l t of p(L F ,l t . ⁇ ) .
  • Bayes' rule to integrate the observation with the prior belief yields the conclusion that the most probable a posteriori object location is at the maximum /, of
  • K-L Karhunen-Loeve
  • a description of this is in A. Levy and M. Lindenbaum, "Sequential Karhunen-Loeve basis extraction and its application to images," IEEE Transactions on Image Processing 9 (2000), which is incorporated by reference herein it its entirety.
  • the new basis is used when calculating /?(
  • the mean of the probabilistic PCA model can be updated online, as described below.
  • the sampling method thus described is flexible and can be applied to automatically localize targets in the first frame, though manual initialization or sophisticated object detection algorithms are also applicable. By specifying a broad prior (e.g., a Gaussian distribution with larger covariance matrix or larger standard deviation) over the entire image, and by drawing enough samples, the target can be located by the maximum response using the current distribution and the initial Eigenbasis.
  • a broad prior e.g., a Gaussian distribution with larger covariance matrix or larger standard deviation
  • step 242 Since the appearance of the target obj ect or its illumination may be time varying, and since an Eigenbasis is used for object representation, it is important to continually update the Eigenbasis from the time-varying covariance matrix. This is represented by step 242 in Figure 2.
  • This problem has been studied in the signal processing community, where several computationally efficient techniques have been proposed in the form of recursive algorithms. A description of this is in B. Champagne and Q.G. Liu, "Plane rotation-based EVD updating schemes for efficient subspace tracking," IEEE Transactions on Signal Processing 46 (1998), which is incorporated by reference herein it its entirety.
  • the SVD computation of jf can be efficiently carried by using the smaller matrices, JJ, ⁇ , ⁇ and the SVD of smaller matrix ⁇ ' .
  • the sequential Karhunen-Loeve algorithm further exploits the low dimensional subspace approximation and only retains a small number of eigenvectors as new data arrive, as explained in Levy and Lindenbaum, which was cited above.
  • the loop control comprising steps 248 and 256 causes dynamic model 224, observation model 230, inference model 236 and Eigenbasis update 242 to be applied to successive frames until the last frame has been processed.
  • This embodiment is flexible in that it can be carried out with or without constructing an initial Eigenbasis as per step 212.
  • an Eigenbasis can be constructed that is useful at the onset of tracking.
  • the algorithm can gradually construct and update an Eigenbasis from the incoming images if the target object is localized in the first frame.
  • no training images of the target object are required prior to the start of tracking. That is, after target region initialization, the method learns a low dimensional eigenspace representation online and incrementally updates it.
  • the method incorporates a particle filter so that the sample distributions are propagated over time.
  • the initial frame vector is received in step 206 and the initial location of the target object is established in step 212.
  • step 218 Eigenbasis initialization is eliminated, thus advantageously allowing tracking of objects for which no description is available a priori.
  • the Eigenbasis is learned online and updated during the object tracking process.
  • dynamic model 224 is implemented as an affine image- warping algorithm that approximates the motion of a target object between two consecutive frames.
  • a state variable X describes the affine motion parameters, and thereby the location, of the target at time t.
  • ⁇ , y , ⁇ , s ( , ( , ⁇ , denote x-y translation, rotation angle, scale, aspect ratio, and skew direction at time t.
  • Each parameter in X is modeled independently by a Gaussian distribution around its counterpart in_Y . That is, p(X t ⁇ X t )-!MXJC tA , ⁇ ) where ⁇ is a diagonal covariance matrix whose elements are the corresponding variances of 2 2 2 2 2 2 affine parameters, i.e., o x , ⁇ y , ⁇ ⁇ , ⁇ s , ⁇ ⁇ , ⁇ .
  • observation model 230 employs a probabilistic interpretation of principal component analysis. A description of this is in M. E. Tipping and C M. Bishop, "Probabilistic principal component analysis,” Journal of the Royal Statistical Society, Series B, 61(3), 1999, which is incorporated by reference herein in its entirety. Given a target object predicated by-Y, this model assumes that the observed image I was generated from a subspace spanned by Uand centered at ⁇ , as depicted in Figure 6. The probability that a sample of the target object was generated from the subspace is inversely proportional to the distance d from the sample to the reference point, i.e., center, of the subspace ⁇ .
  • This distance can be decomposed into the distance-to-subspace d and the distance-within-subspace d from the projected sample to the subspace center.
  • This distance formulation is based on an orthonormal subspace and its complement space, and is similar in spirit to the description given in B. Moghaddam and A. Pentland, "Probabilistic visual learning for object recognition," IEEE Transactions an Pattern Analysis and Machine Intelligence, 19(7), 1997, which is incorporated by reference herein in its entirety.
  • the probability that a sample was generated from subspace U, p (I ⁇ X) is governed by a Gaussian distribution: where I is an identity matrix, ⁇ is the mean, and ⁇ l corresponds to the additive Gaussian noise in the observation process. It can be shown that the negative exponential distance from/ to the subspace spanned by U, i.e., exp - (i, — ⁇ ) ⁇ UU 7 (j. t - ⁇ n 1, is proportional
  • the likelihood of the projected sample can be modeled by the Mahalanobis distance from the mean as follows: where ⁇ is the center of the subspace and ⁇ is the matrix of singular values corresponding to the columns of U.
  • the observation model of this embodiment computes p(F ⁇ X) using (3).
  • M. J. Black and A. D. Jepson "Eigentracking: Robust matching and tracking of articulated objects using view-based representation," Proceedings of European Conference on Computer Vision, 1996, which is incorporated by reference herein in its entirety.
  • a method similar to that used in Black and Jepson is applied in order to compute d and d .
  • the tracking process is governed by the observation model >(J
  • the Condensation algorithm based on factored sampling, approximates an arbitrary distribution of observations with a stochastically generated set of weighted samples. A description of this is in M. Isard and A. Blake, "Contour tracking by stochastic propagation of conditional density," Proceedings of the Fourth European Conference on Computer Vision, Volume 2, 1996, which is incorporated by reference herein in its entirety.
  • the inference model uses a variant of the Condensation algorithm to model the distribution over the object's location, as it evolves over time. In other words, this embodiment is a Bayesian approach that integrates the information over time.
  • the Eigenbasis is next updated in step 242.
  • variations in the mean are accommodated as successive frames arrive.
  • conventional methods may accomplish this, they only accommodate one datum per update, and provide only approximate results.
  • this embodiment handles multiple data at each Eigenbasis update, and renders exact solutions.
  • a description of this is in P. Hall, D. Marshall, and R. Martin, "Incremental Eigenanalysis for classification," Proceedings of British Machine Vision Conference, 1998, which is incorporated by reference herein in its entirety.
  • the low dimensional approximation of image data can be further exploited by putting larger weights on more recent observations, or equivalently down weighting the contributions of previous observations. For example, as the appearance of a target object gradually changes, more weight may be placed on recent observations in updating the Eigenbasis, since recent observations are more likely to resemble the current appearance of the target.
  • Computer system 700 comprises an input module 710, a memory device 714, a processor 716, and an output module 718.
  • an image processor 712 can be part of the main processor 716 or a dedicated device to pre-format digital images to a preferred image format.
  • memory device 714 may be a standalone memory device, (e.g., a random access memory chip, flash memory, or the like), or an on-chip memory with the processor 716 (e.g., cache memory).
  • computer system 700 can be a stand-alone system, such as, a server, a personal computer, or the like.
  • computer system 700 can be part of a larger system such as, for example, a robot having a vision system (e.g., ASIMO advanced humanoid robot, of Hyundai Motor Co., Ltd., Tokyo, Japan), a security system (e.g., airport security system), or the like.
  • computer system 700 comprises an input module 710 to receive the digital images I.
  • the digital images, I may be received directly from an imaging device 701, for example, a digital camera 701a (e.g., robotic eyes), a video system 701b (e.g., closed circuit television), image scanner, or the like.
  • the input module 710 may be a network interface to receive digital images from another network system, for example, an image database, another vision system, Internet servers, or the like.
  • the network interface may be a wired interface, such as, a USB, RS-232 serial port, Ethernet card, or the like, or may be a wireless interface module, such as, a wireless device configured to communicate using a wireless protocol, e.g., Bluetooth, WiFi, IEEE 802.11, or the like.
  • An optional image processor 712 may be part of the processor 716 or a dedicated component of the system 700.
  • the image processor 712 could be used to pre- process the digital images I received through the input module 710 to convert the digital images, I, to the preferred format on which the processor 716 operates. For example, if the digital images, I, received through the input module 710 come from a digital camera 710a in a JPEG format and the processor is configured to operate on raster image data, image processor 712 can be used to convert from JPEG to raster image data.
  • processor 716 applies a set of instructions that when executed perform one or more of the methods according to the present invention, e.g., dynamic model, Eigenbasis update, and the like. While executing the set of instructions, processor 716 accesses memory device 714 to perform the operations according to methods of the present invention on the image data stored therein.
  • processor 716 accesses memory device 714 to perform the operations according to methods of the present invention on the image data stored therein.
  • Processor 716 tracks the location of the target object within the input images, I, and outputs indications of the tracked object's identity and location through the output module 718 to an external device 725 (e.g., a database 725a, a network element or server 725b, a display device 725c, or the like).
  • an external device 725 e.g., a database 725a, a network element or server 725b, a display device 725c, or the like.
  • output module 718 can be wired or wireless.
  • Output module 718 may be a storage drive interface, (e.g., hard- drive or optical drive driver), a network interface device (e.g., an Ethernet interface card, wireless network card, or the like), or a display driver (e.g., a graphics card, or the like), or any other such device for outputting the target object identification and/or location.
  • each video comprises a series of 320x240 pixel gray-scale images and was recorded at 15 frames per second.
  • each target image region was resized to a 32x32 patch, and the number of eigenvectors used in all experiments was set to 16, though fewer eigenvectors may also work well.
  • the tracking algorithm was implemented in MATLAB with MEX, and runs at 4 frames per second on a standard computer with 200 possible particle locations.
  • Figure 8 shows nine panels of excerpted information for a sequence containing an animal doll moving in different pose, scale, and lighting conditions.
  • the topmost image is the captured frame.
  • the frame number is denoted on the upper left corner, and the superimposed rectangles represent the estimated location of the target object.
  • the images in the second row of each panel show the current sample mean, tracked image region, reconstructed image based on the mean and Eigenbasis, and the reconstruction error respectively.
  • the third and forth rows show the ten largest Eigenvectors. All Eigenbases were constructed automatically without resort to training and were constantly updated to model the target object as its appearance changed. Despite significant camera motion, low frame rate, large pose changes, cluttered background and lighting variation, the tracking algorithm remained stably locked on the target. Also, despite the presence of noisy background pixels within the rectangular sample window, the algorithm faithfully modeled the appearance of the target, as shown in the Eigenbases and reconstructed images.
  • Advantages of the present invention include the ability to efficiently, robustly and stably track an object within a motion video based upon a method that learns and adapts to intrinsic as well as to extrinsic changes.
  • the tracking may be aided by one or more initial training images, but is nonetheless capable of execution where no training images are available.
  • the invention provides object recognition. Experimental confirmation demonstrates that the method of the invention is able to track objects well in real time under large lighting, pose and scale variation.

Abstract

A system and method (Fig. 2) are disclosed for adaptive probabilistic tracking of an object within a motion video. The method utilizes a time-varying Eigenbasis (218) and dynamic (224), observation (230) and inference (236) models. The Eigenbasis (218) serves as a model of the target object. The dynamic model (224) represents the motion of the object and defines possible locations of the target based upon previous locations. The observation model (230) provides a measure of the distance of an observation of the object relative to the current Eigenbasis (218). The inference model (236) predicts the most likely location of the object based upon past and present observation.

Description

ADAPTIVE PROBABILISTIC VISUAL TRACKING WITH INCREMENTAL SUBSPACE UPDATE INVENTORS: MING-HSUAN YANG, JONGWOO LIM, DAVID ROSS, RUEI-SUNG LIN CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority under 35 USC § 119(e) to U.S. Provisional
Patent Application No. 60/520,005, titled "Adaptive Probabilistic Visual Tracking With
Incremental Subspace Update", the content of which is incorporated by reference herein in its entirety.
[0002] This application is related to U.S. Patent Application Number 10/703,294, filed on November 06, 2003, entitled "Clustering Appearances of Objects Under Varying
Illumination Conditions," the content of which is hereby incorporated by reference by reference herein in its entirety.
[0003] This application is related to U.S. Patent Application Number 10/858,878, filed on June 01, 2004, entitled "Method, Apparatus and Program for Detecting an Object," the content of which is hereby incorporated by reference herein in its entirety. FIELD OF THE INVENTION
[0004] The present invention generally relates to the field of computer vision, and more specifically, to visual tracking of objects within a motion video. BACKGROUND OF THE INVENTION
[0005] From the photography aficionado type digital cameras to the high-end computer vision systems, digital imaging is a fast growing technology that is becoming an integral part of everyday life. In its most basic definition, a digital image is a computer readable representation of an image of a subject taken by a digital imaging device, e.g. a camera, video camera, or the like. A computer readable representation, or digital image, typically includes a number of picture elements, or pixels, arranged in an image file or document according to one of many available graphic formats. For example, some graphic file formats include, without limitation, bitmap, Graphics Interchange Format (GIF), Joint Photographic Experts Group (JPEG) format, and the like. A subject is anything that can be imaged, i.e., photographed, videotaped, or the like. In general, a subject may be an object or part thereof, a person or a part thereof, a scenic view, an animal, or the like. An image of a subject typically comprises viewing conditions that, to some extent, make the image unique. In imaging, viewing conditions typically refer to the relative orientation between the camera and the object (i.e., the pose), and the external illumination under which the images are acquired.
[0006] Motion video is generally captured as a series of still images, or frames. Of particular interest and utility is the ability to track the location of an object of interest within the series of successive frames comprising a motion video, a concept generally referred to as visual tracking. Example applications include without limitation intelligence gathering, whereby the location and description of the target object over time are of interest, and robotics, whereby a machine may be directed to perform certain actions based upon the perceived location of a target object.
[0007] The non-stationary aspects of the target object and the background within the overall image challenge the design of visual tracking methods. Conventional algorithms may be able to track objects, either previously viewed or not, over short spans of time and in well-controlled environments. However, these algorithms usually fail to observe the object's motion or eventually encounter significant drifts, either due to drastic change in the object's appearance or large lighting variation. Although such situations have been ameliorated, most visual tracking algorithms typically operate on the premise that the target object does not change drastically over time. Consequently, these algorithms initially build static models of the target object, without accounting for changes in appearance, e.g., large variation in pose or facial expression, or in the surroundings, e.g., lighting variation. Such an approach is prone to instability.
[0008] From the above, there is a need for an improved, robust method for visual tracking that learns and adapts to intrinsic changes, e.g., in pose or shape variation of the target object itself, as well as to extrinsic changes, e.g., in camera orientation, illumination or background. SUMMARY OF THE INVENTION
[0009] The present invention provides a method and apparatus for visual tracking that incrementally updates a description of the target object. According to the iterative tracking algorithm, an Eigenbasis represents the object being tracked. At successive frames, possible object locations near a predicted position are postulated according to a dynamic model. An observation model then provides a maximum a posteriori estimate of object location, whereby the possible location that can best be approximated by the current Eigenbasis is chosen. An inference model applies the dynamic and observation models over multiple past frames to predict the next location of the target object. Finally, the Eigenbasis is updated to account for changes in appearance of the target object.
[0010] According to one embodiment of the invention, the dynamic model represents the incremental motion of the target object using an affine warping model. This model represents linear translation, rotation and scaling as a function of each observed frame and the current target object location, according to multiple normal distributions. The observation model utilizes a probabilistic principal components distribution to evaluate the probability that the currently observed image was generated by the current Eigenbasis. A description of this is in M. E. Tipping and CM. Bishop "Probabilistic principal component analysis," Journal of the Royal Statistical Society, Series B 61 (1999), which is incorporated by reference herein in its entirety. The inference model utilizes a simple sampling method that operates on successive frame pairs to efficiently and effectively infer the most likely location of the target object. The Eigenbasis is updated according to application of the sequential Karhunen-Loeve algorithm, and the Eigenbasis may be optionally initialized when training information is available.
[0011] A second embodiment extends the first in that the sequential inference model operates over a sliding window comprising a selectable number of successive frames. The dynamic model represents six parameters, including those discussed above plus aspect ratio and skew direction. The observation model is extended to accommodate the orthonormal components of the distance between observations and the Eigenbasis. Finally, the Eigenbasis model and update algorithm are extended to account for variations in the sample mean while providing an exact solution, and no initialization of the Eigenbasis is necessary. [0012] According to another embodiment of the present invention, a system is provided that includes a computer system comprising an input device to receive the digital images, a storage or memory module for storing the set of digital images, and a processor for implementing identity-based visual tracking algorithms. [0013] The embodiments of the invention thus discussed facilitate efficient computation, robustness and stability. Furthermore, they provide object recognition in addition to tracking. Experimentation demonstrates that the method of the invention is able to track objects well in real time under large lighting, pose and scale variation. [0014] The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The invention has other advantages and features which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
[0016] Figure ("FIG.") 1 is a schematic illustration of the visual tracking concept.
[0017] Figure 2 shows an overall algorithm for visual tracking.
[0018] Figure 3 shows an algorithm for initial Eigenbasis construction according to one embodiment of the present invention.
[0019] Figure 4 illustrates a concept of the dynamic model according to one embodiment of the present invention.
[0020] Figure 5 illustrates a concept of the distance-to-subspace observation model according to one embodiment of the present invention.
[0021] Figure 6 illustrates a concept of the distance-to-mean observation model according to one embodiment of the present invention.
[0022] Figure 7 shows a computer-based system according to one embodiment of the present invention.
[0023] Figure 8 shows the results of an experimental application of one embodiment of the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0024] The Figures ("FIG.") and the following description relate to preferred embodiments of the present invention by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of the claimed invention.
[0025] Reference will now be made in detail to several embodiments of the present invention(s), examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
[0026] The object tracking problem is illustrated schematically in Figure 1. At each time step t an image region or frame F is observed in sequence, and the location of the target object, L , is treated as an unobserved or hidden state variable. The motion of the object from one frame to the next is modeled based upon the probability of the object appearing at L , given that it was just at X . In other words, the model represents possible locations of the object at time t, as determined prior to observing the current image frame. The likelihood that the object is located at a particular possible position is then determined according to a probability distribution. The goal is to determine the most probable a posteriori object location.
[0027] Referring now to Figure 2, a first embodiment of the invention is depicted. An initial frame vector is received in step 206. This frame vector includes one element per pixel. Each pixel comprises a description of brightness, color etc. In step 212, the initial location of the target object is determined. This may be accomplished either manually or through automatic means.
[0028] An example of automatic object location determination is face detection. One embodiment of face detection is illustrated in patent application 10/858,878, Method, Apparatus and Program for Detecting an Object, which is incorporated by reference herein in its entirety. Such an embodiment informs the tracking method of an object or area of interest within an image.
[0029] In step 218, an initial Eigenbasis is optionally constructed. The Eigenbasis is a mathematically compact representation of the class of objects that includes the target object. For example, for a set of images of a particular human face captured under different illumination conditions, a polyhedral cone may be defined by a set of lines, or eigenvectors, in a multidimensional space Rs, where S is the number of pixels in each image. The cone then bounds the set of vectors corresponding to that human's face under all possible or expected illumination conditions. An Eigenbasis representing the cone may in turn be defined within the subspace RM, where M < S. By defining multiple such subspaces corresponding to different human subjects, and by computing the respective distances to an image including an unidentified subject, the identity of the subject may be efficiently determined. The same concepts apply generally to other classes of objects of interest, including, e.g., animals, automobiles, geometric shapes etc.
[0030] Figure 3 illustrates initialization of the Eigenbasis from a set of training images of the object of interest or of similar objects. While such initialization may accelerate convergence of the Eigenbasis, it may be eliminated for simplicity, or where training images are unavailable. In step 312, all training images are histogram-equalized. In step 318, the mean is subtracted from the data. The desired principle components are computed in step
324. Finally, the Eigenbasis is created in step 330.
[0031] Returning to Figure 2, in step 224, a dynamic model is employed to predict possible locations of the target object in the next frame, L + 1 , based upon the location within the current frame, L(, according to a distribution p(L \L ) . This is shown conceptually in Figure 4, including location in the current frame 410 and possible locations in the next frame 420(i). In other words, a probability distribution provided by the dynamic model encodes beliefs about where the target object might be at time t, prior to observing the respective frame and image region. [0032] According to dynamic model 224, L , the location of the target object at time t, is represented using the four parameters of a similarity transformation, i.e., x andy for translation in andy, r for rotation, and s for scaling. This transformation warps the image, placing the target window, corresponding to the boundary of the object being tracked, in a rectangle centered at coordinates (0,0), with the appropriate width and height. This warping operates as a function of an image region F and the object location L, i.e., w(F ). [0033] The initialization of dynamic model 224 assumes that each parameter is independently distributed, according to a normal distribution, around a predetermined location L . Specifically
^Zιl^=^ι' o'σ^^ι^0'σ^rι;ro'σS^ι^o'σ^ ( ) where N(z; μ,σ2) denotes evaluation of the normal distribution function for data point z, with mean μ and variance σ2.
[0034] Returning to Figure 2, in step 230, an image observation model is next applied.
Since the Eigenbasis is used to model the target object's appearance, the observation model evaluates the probability that the currently observed image was generated by the current Eigenbasis. A probabilistic principal components distribution (also known as sensible PCA) may serve as a basis for this model. A description of this is in S. Roweis, "EM algorithms for PCA and SPCA," Advances in Neural Information Processing Systems, M.I. Jordan, M.J. Kearns and S.A. Solla eds., 10 MIT Press (1997), which is incorporated by reference herein in its entirety. Given a location i , this model assumes that the observed image region was generated by sampling an appearance of the object from the Eigenbasis and inserting it at . Following Roweis, and as illustrated conceptually in Figure 5, the probability of observing a datum z given the Eigenbasis B and mean μ is N(z;μ,BBτ+εI) , where the ε/term corresponds to the covariance of additive Gaussian noise present in the observation process. Such noise might arise, for example, from data quantization, errors in the video sensor or thermal effects. In the limit as ε->0, N(z;μ,BBτ+εI) is proportional to the negative exponential of the squared distance between z and the linear subspace B, |(z-μ)--55r(z-μ)|2
[0035] Again referring to Figure 2, an inference model 236 is next applied to predict the location of the target object. According to the probabilistic model of Figure 1, since L is never directly observed, full Bayesian inference would require computation of the distribution P(L | F , F , ..., F JL ) at each time step. Unfortunately, this distribution is infeasible to compute in closed form. Instead, it is approximated using a normal distribution of the same form as that in Equation 1 around the maximum lt of p(L F ,lt.\) . Using
Bayes' rule to integrate the observation with the prior belief yields the conclusion that the most probable a posteriori object location is at the maximum /, of
* [0036] An approximation to lt can be efficiently and effectively computed using a simple sampling method. Specifically, a number of sample locations are drawn from the * * prior p(L I |t ι) . For each sample / s the posterior probability ? s =p(l s \F t ,lt. ) is computed, p s is simply the likelihood of/ under the probabilistic PCA distribution, times the probability with which / was sampled, disregarding the normalization factor which is constant across all samples. Finally the sample with the largest posterior probability is selected to be the * approximate lt, i.e.,
Figure imgf000009_0001
This method has the advantageous property that a single parameter, namely the number of samples, can be used to control the tradeoff between speed and tracking accuracy.
[0037] To allow for incremental updates to the target object model, the probability distribution of observations is not fixed over time. Rather, recent observations are used to update this distribution, albeit in a non-Bayesian fashion. Given an initial Eigenbasis B , * and a new appearance w =w(F ,lt.χ) a new basis B is computed using the sequential
Karhunen-Loeve (K-L) algorithm, as described below. A description of this is in A. Levy and M. Lindenbaum, "Sequential Karhunen-Loeve basis extraction and its application to images," IEEE Transactions on Image Processing 9 (2000), which is incorporated by reference herein it its entirety. The new basis is used when calculating /?( |Z, ) . Alternately, the mean of the probabilistic PCA model can be updated online, as described below. [0038] The sampling method thus described is flexible and can be applied to automatically localize targets in the first frame, though manual initialization or sophisticated object detection algorithms are also applicable. By specifying a broad prior (e.g., a Gaussian distribution with larger covariance matrix or larger standard deviation) over the entire image, and by drawing enough samples, the target can be located by the maximum response using the current distribution and the initial Eigenbasis.
[0039] Since the appearance of the target obj ect or its illumination may be time varying, and since an Eigenbasis is used for object representation, it is important to continually update the Eigenbasis from the time-varying covariance matrix. This is represented by step 242 in Figure 2. This problem has been studied in the signal processing community, where several computationally efficient techniques have been proposed in the form of recursive algorithms. A description of this is in B. Champagne and Q.G. Liu, "Plane rotation-based EVD updating schemes for efficient subspace tracking," IEEE Transactions on Signal Processing 46 (1998), which is incorporated by reference herein it its entirety. In this embodiment, a variant of the efficient sequential Karhunen-Loeve algorithm is utilized to update the Eigenbasis, as explained in Levy and Lindenbaum, which was cited above. This in turn is based on the classic R-SVD method. A description of this is in G.H. Golub and C.F. Van Loan, "Matrix Computations," The Johns Hopkins University Press (1996), which is incorporated by reference herein in its entirety.
[0040] LetX=C/Σ rbe the SVD of a data MxP matrix where each column vector is an observation (e.g., image). The R-SVD algorithm provides an efficient way to carry out the SVD of a larger matrix J "= X\E) , where E is a MxK matrix consisting of K additional observations (e.g., incoming images) as follows. 1. Use an orthonormaliztion process (e.g., Gram-Schmidt algorithm) on (U]E) to obtain an orthonormal matrix U=(U\E) .
Form the matrix where I is a K dimensional identity matrix.
Figure imgf000011_0001
LQt
Figure imgf000011_0002
since Σ=UTXV and EτXV=0. Note that the K rightmost columns of ∑' are the new image vectors, represented in the updated orthonormal basis spanned by the columns of if. 4. Compute the SVD of ∑'=£/∑ and the SVD of * as £=ϋφW)Ϋτ=(ϋϋ)l{vτΫτ) (3)
[0041] By exploiting the orthonormal properties and block structure, the SVD computation of jf can be efficiently carried by using the smaller matrices, JJ, Ϋ, Σ and the SVD of smaller matrix ∑'.
[0042] Based on the R-SVD method, the sequential Karhunen-Loeve algorithm further exploits the low dimensional subspace approximation and only retains a small number of eigenvectors as new data arrive, as explained in Levy and Lindenbaum, which was cited above.
[0043] Referring again to Figure 2, following the first Eigenbasis update, the loop control comprising steps 248 and 256 causes dynamic model 224, observation model 230, inference model 236 and Eigenbasis update 242 to be applied to successive frames until the last frame has been processed.
[0044] This embodiment is flexible in that it can be carried out with or without constructing an initial Eigenbasis as per step 212. For the case where training images of the object are available and well cropped, an Eigenbasis can be constructed that is useful at the onset of tracking. However, since training images maybe unavailable, the algorithm can gradually construct and update an Eigenbasis from the incoming images if the target object is localized in the first frame. [0045] According to a second embodiment of the visual tracking algorithm, no training images of the target object are required prior to the start of tracking. That is, after target region initialization, the method learns a low dimensional eigenspace representation online and incrementally updates it. In addition, the method incorporates a particle filter so that the sample distributions are propagated over time. Based on the Eigenspace model with updates, an effective likelihood estimation function is developed. Also, the R-SVD algorithm updates both the sample mean and Eigenbasis as new data arrive. Finally, the present method utilizes a robust error norm for likelihood estimation in the presence of noisy data or partial occlusions, thereby rendering accurate and robust tracking results. [0046] Referring again to Figure 2, according to the present method, the initial frame vector is received in step 206 and the initial location of the target object is established in step 212. However, step 218, Eigenbasis initialization is eliminated, thus advantageously allowing tracking of objects for which no description is available a priori. As described below, the Eigenbasis is learned online and updated during the object tracking process. [0047] In this embodiment, dynamic model 224 is implemented as an affine image- warping algorithm that approximates the motion of a target object between two consecutive frames. A state variable X describes the affine motion parameters, and thereby the location, of the target at time t. In particular, six parameters model the state transition from to X of a target object being tracked. Let X=(xt ,yt,Qt,st, ) where
Λ , y , θ , s(, (, φ , denote x-y translation, rotation angle, scale, aspect ratio, and skew direction at time t. Each parameter in X is modeled independently by a Gaussian distribution around its counterpart in_Y . That is, p(Xt\Xt )-!MXJCtA,Ψ) where Ψ is a diagonal covariance matrix whose elements are the corresponding variances of 2 2 2 2 2 affine parameters, i.e., ox, σy, σθ, σs, σα, σψ.
[0048] According to this embodiment, observation model 230 employs a probabilistic interpretation of principal component analysis. A description of this is in M. E. Tipping and C M. Bishop, "Probabilistic principal component analysis," Journal of the Royal Statistical Society, Series B, 61(3), 1999, which is incorporated by reference herein in its entirety. Given a target object predicated by-Y, this model assumes that the observed image I was generated from a subspace spanned by Uand centered at μ, as depicted in Figure 6. The probability that a sample of the target object was generated from the subspace is inversely proportional to the distance d from the sample to the reference point, i.e., center, of the subspace μ. This distance can be decomposed into the distance-to-subspace d and the distance-within-subspace d from the projected sample to the subspace center. This distance formulation is based on an orthonormal subspace and its complement space, and is similar in spirit to the description given in B. Moghaddam and A. Pentland, "Probabilistic visual learning for object recognition," IEEE Transactions an Pattern Analysis and Machine Intelligence, 19(7), 1997, which is incorporated by reference herein in its entirety. [0049] The probability that a sample was generated from subspace U, p (I\X), is governed by a Gaussian distribution:
Figure imgf000013_0001
where I is an identity matrix, μ is the mean, and εl corresponds to the additive Gaussian noise in the observation process. It can be shown that the negative exponential distance from/ to the subspace spanned by U, i.e., exp - (i, — μ)~UU7 (j.t - μn 1, is proportional
to p d (i , |X , ) = 5V (i , ; μ , UU τ + εl ) as ε-»0 , as explained in Roweis, which was cited above
[0050] Within a subspace, the likelihood of the projected sample can be modeled by the Mahalanobis distance from the mean as follows:
Figure imgf000013_0002
where μ is the center of the subspace and Σ is the matrix of singular values corresponding to the columns of U.
[0051] Combining the above, the likelihood of a sample being generated from the subspace is governed by
Figure imgf000013_0003
[0052] Given a drawn sample X and the corresponding image region /, the observation model of this embodiment computes p(F\X) using (3). To minimize the effects x2 of noisy pixels, the robust error norm p(x,σ)= 2 2 is used instead of the Euclidean norm :)=||x||2 , to ignore the "outlier" pixels, e.g., the pixels that are not likely to appear inside the target region given the current Eigenspace. A description of this is in M. J. Black and A. D. Jepson, "Eigentracking: Robust matching and tracking of articulated objects using view-based representation," Proceedings of European Conference on Computer Vision, 1996, which is incorporated by reference herein in its entirety. A method similar to that used in Black and Jepson is applied in order to compute d and d .
This robust error norm is helpful especially when a rectangular region is used to enclose the target, which region inevitably contains some "noisy" background pixels. [0053] Again referring to Figure 2, inference model 236 is next applied. According to this embodiment, given a set of observed images I={Iχ,...J^ , the value of the hidden state variable X is estimated. Using Bayes' theorem,
Figure imgf000014_0001
[0054] The tracking process is governed by the observation model >(J| ) , where the likelihood of X observing/, and the dynamical model between two states p(X\X .) is estimated. The Condensation algorithm, based on factored sampling, approximates an arbitrary distribution of observations with a stochastically generated set of weighted samples. A description of this is in M. Isard and A. Blake, "Contour tracking by stochastic propagation of conditional density," Proceedings of the Fourth European Conference on Computer Vision, Volume 2, 1996, which is incorporated by reference herein in its entirety. According to this embodiment, the inference model uses a variant of the Condensation algorithm to model the distribution over the object's location, as it evolves over time. In other words, this embodiment is a Bayesian approach that integrates the information over time.
[0055] Referring again to Figure 2, the Eigenbasis is next updated in step 242. In this embodiment, variations in the mean are accommodated as successive frames arrive. Although conventional methods may accomplish this, they only accommodate one datum per update, and provide only approximate results. Advantageously, this embodiment handles multiple data at each Eigenbasis update, and renders exact solutions. A description of this is in P. Hall, D. Marshall, and R. Martin, "Incremental Eigenanalysis for classification," Proceedings of British Machine Vision Conference, 1998, which is incorporated by reference herein in its entirety. Given a sequence of -dimensional image vectors I r., let , = ff,. l2> •••> V. «* <W,.*--W ' and = )-
[0056] Given the mean lp and the SVD of existing data e£p , i.e., UpΣpVp τ , and given the counterparts for new data e£? , the mean I r and the SVD of e r , i.e., tJrΣr F" r τ , are computed easily by extending the method of the first embodiment as follows: n — m 1. Comp rute I / = — n+m~I P + n+ Tm- « ,' and £ = <&, -ι, i n + m p *}
2. Compute R-SVD with UpΣpV and E to obtain UrΣrVr J .
[0057] In many visual tracking applications, the low dimensional approximation of image data can be further exploited by putting larger weights on more recent observations, or equivalently down weighting the contributions of previous observations. For example, as the appearance of a target object gradually changes, more weight may be placed on recent observations in updating the Eigenbasis, since recent observations are more likely to resemble the current appearance of the target. A forgetting factor/can be used under this premise as suggested in Levy and Lindenbaum, which was cited above, i.e., A'=(fA \E)=(U(fΣ) V\E) where A and A are original and weighted data matrices, respectively.
[0058] Now referring to Figure 7, a system according to one embodiment of the present invention is shown. Computer system 700 comprises an input module 710, a memory device 714, a processor 716, and an output module 718. In an alternative embodiment, an image processor 712 can be part of the main processor 716 or a dedicated device to pre-format digital images to a preferred image format. Similarly, memory device 714 may be a standalone memory device, (e.g., a random access memory chip, flash memory, or the like), or an on-chip memory with the processor 716 (e.g., cache memory). Likewise, computer system 700 can be a stand-alone system, such as, a server, a personal computer, or the like. Alternatively, computer system 700 can be part of a larger system such as, for example, a robot having a vision system (e.g., ASIMO advanced humanoid robot, of Honda Motor Co., Ltd., Tokyo, Japan), a security system (e.g., airport security system), or the like. [0059] According to this embodiment, computer system 700 comprises an input module 710 to receive the digital images I. The digital images, I, may be received directly from an imaging device 701, for example, a digital camera 701a (e.g., robotic eyes), a video system 701b (e.g., closed circuit television), image scanner, or the like. Alternatively, the input module 710 may be a network interface to receive digital images from another network system, for example, an image database, another vision system, Internet servers, or the like. The network interface may be a wired interface, such as, a USB, RS-232 serial port, Ethernet card, or the like, or may be a wireless interface module, such as, a wireless device configured to communicate using a wireless protocol, e.g., Bluetooth, WiFi, IEEE 802.11, or the like.
[0060] An optional image processor 712 may be part of the processor 716 or a dedicated component of the system 700. The image processor 712 could be used to pre- process the digital images I received through the input module 710 to convert the digital images, I, to the preferred format on which the processor 716 operates. For example, if the digital images, I, received through the input module 710 come from a digital camera 710a in a JPEG format and the processor is configured to operate on raster image data, image processor 712 can be used to convert from JPEG to raster image data. [0061] The digital images, I, once in the preferred image format if an image processor
712 is used, are stored in the memory device 714 to be processed by processor 716. Processor 716 applies a set of instructions that when executed perform one or more of the methods according to the present invention, e.g., dynamic model, Eigenbasis update, and the like. While executing the set of instructions, processor 716 accesses memory device 714 to perform the operations according to methods of the present invention on the image data stored therein.
[0062] Processor 716 tracks the location of the target object within the input images, I, and outputs indications of the tracked object's identity and location through the output module 718 to an external device 725 (e.g., a database 725a, a network element or server 725b, a display device 725c, or the like). Like the input module 710, output module 718 can be wired or wireless. Output module 718 may be a storage drive interface, (e.g., hard- drive or optical drive driver), a network interface device (e.g., an Ethernet interface card, wireless network card, or the like), or a display driver (e.g., a graphics card, or the like), or any other such device for outputting the target object identification and/or location. [0063] To evaluate the performance of the image tracking algorithm, videos were recorded in indoor and outdoor environments where the target objects changed pose in different lighting conditions. Each video comprises a series of 320x240 pixel gray-scale images and was recorded at 15 frames per second. For the Eigenspace representation, each target image region was resized to a 32x32 patch, and the number of eigenvectors used in all experiments was set to 16, though fewer eigenvectors may also work well. The tracking algorithm was implemented in MATLAB with MEX, and runs at 4 frames per second on a standard computer with 200 possible particle locations.
[0064] Figure 8 shows nine panels of excerpted information for a sequence containing an animal doll moving in different pose, scale, and lighting conditions. Within each panel, the topmost image is the captured frame. The frame number is denoted on the upper left corner, and the superimposed rectangles represent the estimated location of the target object. The images in the second row of each panel show the current sample mean, tracked image region, reconstructed image based on the mean and Eigenbasis, and the reconstruction error respectively. The third and forth rows show the ten largest Eigenvectors. All Eigenbases were constructed automatically without resort to training and were constantly updated to model the target object as its appearance changed. Despite significant camera motion, low frame rate, large pose changes, cluttered background and lighting variation, the tracking algorithm remained stably locked on the target. Also, despite the presence of noisy background pixels within the rectangular sample window, the algorithm faithfully modeled the appearance of the target, as shown in the Eigenbases and reconstructed images.
[0065] Advantages of the present invention include the ability to efficiently, robustly and stably track an object within a motion video based upon a method that learns and adapts to intrinsic as well as to extrinsic changes. The tracking may be aided by one or more initial training images, but is nonetheless capable of execution where no training images are available. In addition to object tracking, the invention provides object recognition. Experimental confirmation demonstrates that the method of the invention is able to track objects well in real time under large lighting, pose and scale variation. [0066] Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a method and apparatus for visual tracking of objects through the disclosed principles of the present invention. Thus, while particular embodiments and applications of the present invention have been illustrated and described, it is to be understood that the invention is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

CLAMSWHAT IS CLAIMED IS:
1. A computer-based method for tracking a location of an object within two or more digital images of a set of digital images, the method comprising the steps of: receiving a first image vector representing a first image within the set of digital images; determining the location of the object from said first image vector; applying a dynamic model to said first image vector to determine a possible motion of the object between said first image vector and a successive image vector representing a second image within the set of digital images; applying an observation model to said first image vector to determine a most likely location of the object within said successive image vector from a set of possible locations of the object within said successive image vector; applying an inference model to said dynamic model and to said observation model to predict said most likely location of the object; and updating an Eigenbasis representing an image space of the two or more digital images.
2. The method of claim 1, wherein said dynamic model represents linear translation, rotation, and scaling according to an affine warping.
3. The method of claim 1 , wherein said observation model comprises a probabilistic principal components distribution and represents a normal component of a distance between observations and said Eigenbasis.
4. The method of claim 1, wherein said inference model comprises a sampling method that operates on successive pairs of the digital images of the set of digital images.
5. The method of claim 1 , wherein said updating an Eigenbasis comprises recursive singular value decomposition and application of a sequential Karhunen-Loeve algorithm.
6. The method of claim 1, wherein said dynamic model represents linear translation, rotation, scaling, aspect ratio and skew according to an affine warping.
7. The method of claim 1, wherein said observation model comprises a probabilistic principal components distribution and represents orthonormal components of a distance between observations and said Eigenbasis.
8. The method of claim 1, wherein said inference model operates over a sliding window comprising a selectable number of successive of the digital images of the set of digital images.
9. The method of claim 1, wherein said updating an Eigenbasis comprises recursive singular value decomposition and application of a sequential Karhunen-Loeve algorithm and accounts for variations in the sample mean.
10. The method of claim 1, further comprising the step of constructing an initial Eigenbasis representing said image space of the two or more digital images.
11. The method of claim 10, wherein said dynamic model represents linear translation, rotation, and scaling according to an affine warping.
12. The method of claim 10, wherein said observation model comprises a probabilistic principal components distribution and represents a normal component of a distance between observations and said Eigenbasis.
13. The method of claim 10, wherein said inference model comprises a simple sampling method that operates on successive pairs of the digital images of the set of digital images.
14. The method of claim 10, wherein said updating an Eigenbasis comprises recursive singular value decomposition and application of a sequential Karhunen-Loeve algorithm.
15 The method of claim 10, wherein said dynamic model represents linear translation, rotation, scaling, aspect ratio and skew according to an affine warping.
16. The method of claim 10, wherein said observation model comprises a probabilistic principal components distribution and represents orthonormal components of a distance between observations and said Eigenbasis.
17. The method of claim 10, wherein said inference model operates over a sliding window comprising a selectable number of successive of the digital images of the set of digital images.
18. The method of claim 10, wherein said updating an Eigenbasis comprises recursive singular value decomposition and application of a sequential Karhunen-Loeve algorithm and accounts for variations in the sample mean.
19. A computer system for tracking the location of an object within two or more digital images of a set of digital images, the system comprising: means for receiving a first image vector representing a first image within the set of digital images; means for determining the location of the object from said first image vector; means for applying a dynamic model to said first image vector to determine a possible motion of the object between said first image vector and a successive image vector representing a second image within the set of digital images; means for applying an observation model to said first image vector to determine a most likely location of the object within said successive image vector from a set of possible locations of the object within said successive image vector; means for applying an inference model to said dynamic model and to said observation model to predict said most likely location of the object; and means for updating an Eigenbasis representing an image space of the two or more digital images. 20 The system of claim 19, further comprising means for constructing an initial Eigenbasis representing said image space of the two or more digital images.
1. An image processing computer system for tracking the location of an object within a set of digital images, comprising: an input module for receiving data representative of the set of digital images; a memory device coupled to said input module for storing said data representative of the set of digital images; a processor coupled to said memory device for iteratively retrieving data representative of two or more digital images of the set of digital images, said processor configured to: apply a dynamic model to a first digital image of said two or more digital images to determine a possible motion of the object between said first digital image of said two or more digital images and a successive digital image of said two or more digital images; apply an observation model to said first digital image to determine a most likely location of the object within said successive digital image from a set of possible locations of the object within said successive digital image; apply an inference model to said dynamic model and to said observation model to predict said most likely location of the object within said successive digital image; and update an Eigenbasis representing an image space of said two or more digital images.
PCT/US2004/038189 2003-11-13 2004-11-15 Adaptive probabilistic visual tracking with incremental subspace update WO2005050543A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP04811059A EP1704509A4 (en) 2003-11-13 2004-11-15 Adaptive probabilistic visual tracking with incremental subspace update
JP2006539984A JP4509119B2 (en) 2003-11-13 2004-11-15 Adaptive stochastic image tracking with sequential subspace updates

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US52000503P 2003-11-13 2003-11-13
US60/520,005 2003-11-13

Publications (1)

Publication Number Publication Date
WO2005050543A1 true WO2005050543A1 (en) 2005-06-02

Family

ID=34619417

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/038189 WO2005050543A1 (en) 2003-11-13 2004-11-15 Adaptive probabilistic visual tracking with incremental subspace update

Country Status (4)

Country Link
US (1) US7463754B2 (en)
EP (1) EP1704509A4 (en)
JP (1) JP4509119B2 (en)
WO (1) WO2005050543A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8548195B2 (en) 2007-06-14 2013-10-01 Omron Corporation Tracking method and device adopting a series of observation models with different life spans
US8649556B2 (en) 2008-12-30 2014-02-11 Canon Kabushiki Kaisha Multi-modal object signature

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007044044A2 (en) * 2004-12-21 2007-04-19 Sarnoff Corporation Method and apparatus for tracking objects over a wide area using a network of stereo sensors
US8189905B2 (en) 2007-07-11 2012-05-29 Behavioral Recognition Systems, Inc. Cognitive model for a machine-learning engine in a video analysis system
FR2944629B1 (en) * 2009-04-17 2017-01-20 Univ De Tech De Troyes SYSTEM AND METHOD FOR TARGET LOCATION BY A CAMERA NETWORK
EP2345998B1 (en) * 2009-12-01 2019-11-20 Honda Research Institute Europe GmbH Multi-object tracking with a knowledge-based, autonomous adaptation of the tracking modeling level
JP5091976B2 (en) * 2010-04-09 2012-12-05 株式会社東芝 Video display device, video display method, and video display program
US9268996B1 (en) 2011-01-20 2016-02-23 Verint Systems Inc. Evaluation of models generated from objects in video
JP5964108B2 (en) * 2012-03-30 2016-08-03 株式会社メガチップス Object detection device
US20130279804A1 (en) * 2012-04-23 2013-10-24 Daniel Kilbank Dual transform lossy and lossless compression
US9659235B2 (en) 2012-06-20 2017-05-23 Microsoft Technology Licensing, Llc Low-dimensional structure from high-dimensional data
CN104216894B (en) * 2013-05-31 2017-07-14 国际商业机器公司 Method and system for data query
US9084411B1 (en) * 2014-04-10 2015-07-21 Animal Biotech Llc Livestock identification system and method
US9697614B2 (en) 2014-12-08 2017-07-04 Mitsubishi Electric Research Laboratories, Inc. Method for segmenting and tracking content in videos using low-dimensional subspaces and sparse vectors
US10440350B2 (en) 2015-03-03 2019-10-08 Ditto Technologies, Inc. Constructing a user's face model using particle filters
US10515390B2 (en) 2016-11-21 2019-12-24 Nio Usa, Inc. Method and system for data optimization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226388B1 (en) * 1999-01-05 2001-05-01 Sharp Labs Of America, Inc. Method and apparatus for object tracking for automatic controls in video devices
US6236736B1 (en) * 1997-02-07 2001-05-22 Ncr Corporation Method and apparatus for detecting movement patterns at a self-service checkout terminal
US6295367B1 (en) * 1997-06-19 2001-09-25 Emtera Corporation System and method for tracking movement of objects in a scene using correspondence graphs
US6539288B2 (en) * 2000-05-24 2003-03-25 Matsushita Electric Industrial Co., Ltd. Vehicle rendering device for generating image for drive assistance

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5416899A (en) 1992-01-13 1995-05-16 Massachusetts Institute Of Technology Memory based method and apparatus for computer graphics
US5680531A (en) 1993-07-02 1997-10-21 Apple Computer, Inc. Animation system which employs scattered data interpolation and discontinuities for limiting interpolation ranges
DE69533870T2 (en) * 1994-10-19 2005-05-25 Matsushita Electric Industrial Co., Ltd., Kadoma Device for image decoding
US5960097A (en) * 1997-01-21 1999-09-28 Raytheon Company Background adaptive target detection and tracking with multiple observation and processing stages
US6047078A (en) * 1997-10-03 2000-04-04 Digital Equipment Corporation Method for extracting a three-dimensional model using appearance-based constrained structure from motion
US6363173B1 (en) * 1997-12-19 2002-03-26 Carnegie Mellon University Incremental recognition of a three dimensional object
US6400831B2 (en) * 1998-04-02 2002-06-04 Microsoft Corporation Semantic video object segmentation and tracking
US6346124B1 (en) 1998-08-25 2002-02-12 University Of Florida Autonomous boundary detection system for echocardiographic images
US6757423B1 (en) * 1999-02-19 2004-06-29 Barnes-Jewish Hospital Methods of processing tagged MRI data indicative of tissue motion including 4-D LV tissue tracking
AU3002500A (en) 1999-02-19 2000-09-04 Barnes-Jewish Hospital Methods of processing tagged mri data indicative of tissue motion including 4-d lv tissue tracking
TW413795B (en) * 1999-02-26 2000-12-01 Cyberlink Corp An image processing method of 3-D head motion with three face feature points
US7003134B1 (en) * 1999-03-08 2006-02-21 Vulcan Patents Llc Three dimensional object pose estimation which employs dense depth information
US6337927B1 (en) * 1999-06-04 2002-01-08 Hewlett-Packard Company Approximated invariant method for pattern detection
US6683968B1 (en) * 1999-09-16 2004-01-27 Hewlett-Packard Development Company, L.P. Method for visual tracking using switching linear dynamic system models
US6870945B2 (en) * 2001-06-04 2005-03-22 University Of Washington Video object tracking by estimating and subtracting background
US6999600B2 (en) * 2003-01-30 2006-02-14 Objectvideo, Inc. Video scene background maintenance using change detection and classification
US7558402B2 (en) * 2003-03-07 2009-07-07 Siemens Medical Solutions Usa, Inc. System and method for tracking a global shape of an object in motion
JP2005099903A (en) * 2003-09-22 2005-04-14 Kddi Corp Update device for authenticating dictionary in biometrics authentication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6236736B1 (en) * 1997-02-07 2001-05-22 Ncr Corporation Method and apparatus for detecting movement patterns at a self-service checkout terminal
US6295367B1 (en) * 1997-06-19 2001-09-25 Emtera Corporation System and method for tracking movement of objects in a scene using correspondence graphs
US6226388B1 (en) * 1999-01-05 2001-05-01 Sharp Labs Of America, Inc. Method and apparatus for object tracking for automatic controls in video devices
US6539288B2 (en) * 2000-05-24 2003-03-25 Matsushita Electric Industrial Co., Ltd. Vehicle rendering device for generating image for drive assistance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1704509A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8548195B2 (en) 2007-06-14 2013-10-01 Omron Corporation Tracking method and device adopting a series of observation models with different life spans
US8649556B2 (en) 2008-12-30 2014-02-11 Canon Kabushiki Kaisha Multi-modal object signature

Also Published As

Publication number Publication date
EP1704509A1 (en) 2006-09-27
US20050175219A1 (en) 2005-08-11
EP1704509A4 (en) 2010-08-25
US7463754B2 (en) 2008-12-09
JP2007513408A (en) 2007-05-24
JP4509119B2 (en) 2010-07-21

Similar Documents

Publication Publication Date Title
Dornaika et al. On appearance based face and facial action tracking
Ho et al. Visual tracking using learned linear subspaces
US7650011B2 (en) Visual tracking using incremental fisher discriminant analysis
US7212665B2 (en) Human pose estimation with data driven belief propagation
US9269012B2 (en) Multi-tracker object tracking
US7463754B2 (en) Adaptive probabilistic visual tracking with incremental subspace update
Hu et al. Incremental tensor subspace learning and its applications to foreground segmentation and tracking
Yang et al. Efficient mean-shift tracking via a new similarity measure
Jeyakar et al. Robust object tracking with background-weighted local kernels
US7376246B2 (en) Subspace projection based non-rigid object tracking with particle filters
Porikli et al. Object detection and tracking
Zhang et al. Graph-embedding-based learning for robust object tracking
US20060285770A1 (en) Direct method for modeling non-rigid motion with thin plate spline transformation
Grenander et al. Asymptotic performance analysis of Bayesian target recognition
Tu et al. Accurate head pose tracking in low resolution video
Gai et al. Studentized dynamical system for robust object tracking
Mei et al. Probabilistic visual tracking via robust template matching and incremental subspace update
Yow Automatic human face detection and localization
Sun et al. Panoramic capturing and recognition of human activity
Davoine et al. Head and facial animation tracking using appearance-adaptive models and particle filters
Rätsch et al. Efficient object tracking by condentional and cascaded image sensing
Prince et al. Statistical cue integration for foveated wide-field surveillance
Gupta et al. CONDENSATION-Based Predictive EigenTracking.
Rätsch et al. Coarse-to-fine particle filters for multi-object human computer interaction
Abdallah Investigation of new techniques for face detection

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004811059

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2006539984

Country of ref document: JP

WWP Wipo information: published in national office

Ref document number: 2004811059

Country of ref document: EP