AU2002318859A1

AU2002318859A1 - A Method for Video Object Detection and Tracking Based on 3D Segment Displacement

Info

Publication number: AU2002318859A1
Application number: AU2002318859A
Authority: AU
Inventors: Brian John Parker
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-12-19
Filing date: 2002-12-18
Publication date: 2003-08-28
Anticipated expiration: 2022-12-18
Also published as: AU2002318859B2

Description

S&F Ref: 617607

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant Actual Inventor(s): Address for Service: Invention Title: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Brian John Parker Spruson Ferguson St Martins Tower,Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) A Method for Video Object Detection and Tracking Based on 3D Segment Displacement ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU PR9632 [32] Application Date 19 Dec 2001 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c 1 -1- A METHOD FOR VIDEO OBJECT DETECTION AND TRACKING BASED ON 3D SEGMENT DISPLACEMENT Technical Field of the Invention The present invention relates generally to video processing and, in particular, to the detection and tracking of objects in a video sequence.

Background An important initial step in deriving an understanding of a video sequence is to first extract any semantic video objects (SVO's) that correspond to significant real-world objects, and, secondarily, to track those objects across time. The advent of the MPEG-4 standard for video object plane (VOP)-based encoding was an important driver for this research, but there are many other applications. Practically any video content-analysis task is assisted by the ability to detect objects within a video frame and track them over time (ie. a sequence of such frames). An example is content summarisation, in which an object-based description of the video content is desirable for indexing, browsing, and searching functionalities. Another example is active camera control, in which the parameters of a camera may be altered to optimise the filming of a certain detected object.

Another example is object removal or other object-based image enhancements applied to an entire video clip or a single frame of an image sequence.

The simplest definition of an SVO is a contiguous segment of pixels with a similar extraction cue. Hence an SVO is a homogeneous segment. Two important lowlevel cues that can be used as extraction cues to extract these SVO's are colour and motion.

The advantage of colour as an SVO extraction cue is the generally clear distinction between pixels of one colour and those of another. This leads to sharply localised segments of homogeneous colour, but these segments by themselves are insufficient to define entire significant SVO's, as such SVO's typically are multi-coloured and therefore consist of several such segments. Hence other extraction cues are required, separately or in addition to colour, to extract SVO's.

Motion, as an SVO extraction cue, can separate out areas of video that are moving with a similar velocity and hence that are likely to represent significant SVO's.

However, the limitations with using motion alone as an SVO extraction cue are that: 617607.DOC -2motion tends to be an ephemeral cue, that is, when an object stops moving it can no longer be separated from a background; motion estimation algorithms tend to produce results that are poorly demarcated spatially and are otherwise inaccurate. Such algorithms are typically pixel based and rely on detecting the local motion of features of objects, especially sharp edges, and so tend to give diffuse results at the boundaries of those objects.

Existing art typically calculates relative motion between consecutive frames, using such techniques as frame-differencing, optical flow etc, and so tend to detect small local motions of an object in addition to significant global motion. However, for most applications the semantically significant objects in a video tend to be only those that have a significant absolute movement over some background and not those that are essentially fixed and only exhibit local movements.

Summary of the Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

According to a first aspect of the invention, there is provided a method of detecting and tracking semantic video objects across a sequence of video frames, said method comprising the steps of: forming a 3D pixel data block from said sequence of video frames; segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; and grouping adjacent significant 3D segments into semantic video objects.

Other aspects of the invention are also disclosed.

Brief Description of the Drawings One or more embodiments of the present invention will now be described with reference to the drawings and appendix, in which: 617607.DOC -3- Fig. 1 shows a flow chart of the main processing steps of a method of detecting and tracking objects across a sequence of video frames; Fig. 2 shows a graphical representation of a data structure that may be used to store segments; Fig. 3 is a schematic block diagram representation of a programmable device in which arrangements described may be implemented; Fig. 4 illustrates a sequence of video frames, a 3D segmentor sliding window and view plane; Figs. 5A and 5B show the 2D intersections of two 3D segments in two separate frame intervals; Figs. 6A and 6B show the 2D intersections of three 3D segments in two separate frame intervals; Fig. 7 shows how 3D segments extending across the view plane allow tracking of an object across time; and Fig. 8 is a schematic block diagram representing the segmentation steps used in the method of video tracking shown in Fig. 1.

Detailed Description including Best Mode Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities.

Apparatus Fig. 3 shows a programmable device 700 for performing the operations of the method described below. Such a programmable device 700 may be specially constructed for the required purposes, or may comprise a general-purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus.

The programmable device 700 comprises a computer module 701, input devices such as a keyboard 702 and mouse 703, output devices including a display device 714. A 617607.DOC -4- Modulator-Demodulator (Modem) transceiver device 716 is used by the computer module 701 for communicating to and from a communications network 720, for example connectable via a telephone line 721 or other functional medium. The modem 716 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN).

The computer module 701 typically includes at least one processor unit 705, a memory unit 706, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 707, and an I/O interface 713 for the keyboard 702 and mouse 703, and an interface 708 for the modem 716 and a camera 750 through connection 748. A storage device 709 is provided and typically includes a hard disk drive 710 and a floppy disk drive 711. A CD-ROM drive 712 is typically provided as a non-volatile source of data.

The components 705 to 713 of the computer module 701, typically communicate via an interconnected bus 704 and in a manner which results in a conventional mode of operation of the programmable device 700 known to those in the relevant art.

In another implementation the computer module 700 is located inside the camera 750.

The method may be implemented as software, such as an application program executing within the programmable device 700. The application program may be stored on a computer readable medium, including the storage devices 709. The application program is loaded into the computer from the computer readable medium, and then executed by the processor 705. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the programmable device 700 preferably effects an advantageous apparatus for detecting and tracking objects across a sequence of video frames.

Typically, the application program of the preferred embodiment is resident on the hard disk drive 710 and read and controlled in its execution by the processor 705.

Intermediate storage of the program and any data fetched from the network 720 and camera 750 may be accomplished using the semiconductor memory 706, possibly in concert with the hard disk drive 710. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 712 or 711, or alternatively may be read by the user from the network 720 via the 617607.DOC modem device 716. The foregoing is merely exemplary of relevant computer readable mediums. Other computer readable media may be practiced without departing from the scope and spirit of the invention.

The programmable device 700 may be constructed from one or more integrated circuits performing the functions or sub functions and for example incorporated in the digital video camera 750. As seen, the camera 750 includes a display screen 752 which can be used to display objects and information regarding the same. In this fashion, a user of the camera may record a video sequence, and using the processing methods described below, create metadata that may be associate with the video sequence to conveniently describe the video sequence, thereby permitting the video sequence to be used or otherwise manipulated with a specific need for a user to view the video sequence.

Method When attempting to extract a significant object from a video sequence, a fundamental step is to define the meaning of the significant object to be extracted. This meaning needs to be defined in terms of available low level cues that will be used. The definition used herein is that a significant object is one that has a significant absolute translational motion against a background relative to the object's dimensions, over some time period. This definition differs from others that have been used which, for example, consider areas of the image that are moving with the same instantaneous motion to be part of the same significant object, or that consider areas with a large instantaneous motion to be significant. Although the definition of "significant object" depends somewhat on the application to which results of the object tracking will be applied, the definition used here corresponds well with an intuitive notion that an independent object is one that is spatially circumscribed and that can be translated to an arbitrary new position.

Fig. 1 shows a flow chart of a method 100 for detecting and tracking objects across a sequence of video frames. The steps of method 100 are effected by instructions in the application program that are executed by the processor 705 of the programmable device 700 (Fig. In the description that follows, the camera 750 is assumed to be fixed. Generalising the method 100 to a moving camera is described in a later section.

Method 100 starts in step 110 where pixel data of a next frame in a video sequence is received as input to the computer module 701 of the programmable device 700. The pixel data at each frame interval t comprises a two dimensional array of data, typically in some colour space such as RGB or LUV. Optionally, the luminance 617607.DOC channel of the video data in the LUV colour space may be down-weighted. The video data may be received directly from camera 750, or may be from a recorded video sequence stored on the storage device 709 or 712.

Step 120 follows wherein the processor 705 performs spatiotemporal (3D) segmentation of the video data based on colour, with the video data including the frame data of the N most recently received frames. Fig. 4 illustrates a sequence of video frames, with those frames illustrated in phantom being future frames not yet received by the programmable device 700. A window 400 includes the N most recently received frames, forming a "block" of pixel data. The block of data also includes a view plane 410, which, in the preferred implementation, is coplanar with the oldest frame in the current window 400. The view plane 410 may generally be any plane in the window 400. As a new frame is received in step 110, the new frame is added to the window 400, while the oldest frame in the window 400 is removed from the window 400, thereby maintaining N frames in the window 400. The segmentation step 120 is performed as each new frame is received in step 110 to form a set of three-dimensional connected segments so that every pixel in the block is related to one segment Si in which all pixels belonging to the same segment Si have homogeneous colour values. The preferred 3D segmentation is described in a later section.

The output of the 3D segmentation step 120 at each frame interval is an updated graph G, where vertices represent the 3D segments Si, and edges represent boundaries between the 3D segments Fig. 2 shows a graphical representation of a data structure 900 that may be used to store the graph G. The data structure 900 is essentially an adjacency list representation of the segmentation. The data structure 900 includes a segment list 910 and a boundary list 920 for each segment Si in the segment list 910. The segment list 910 contains for each segment Si a segment identifier i and are the coordinates of all pixels in that segment S;.

Graph G is a persistent data structure across video frames. That is, graph G is dynamically updated during the segmentation step 120 and incorporates the new frame data received in step 110 into the existing graph G. Data associated with each 3D segment Si persists between frames, and additional data stored with the vertices and edges of the 3D segment Si in the graph G will also be maintained between frames.

Step 130 follows where the processor 705 computes a per-segment absolute motion estimate using only the intrinsic information of the segments Si from the 3D 617607.DOC -7segmentation step 120 themselves. By using motion estimation derived from the segments Si themselves, method 100 avoids the need to compute a motion field estimate using a separate motion estimation algorithm, and so avoids problems of poor demarcation of same.

In particular, step 130 computes an estimate of the motion of the 2D intersection of each 3D segment Si with the view plane 410 (Fig. The estimation is performed over a number of frames as the 3D segments move through the window and intersect with the view plane 410. In the following description, the absolute motion of each segment S is estimated over the entire life-span of that 3D segment Si, that is from when the 3D segment Si first intersects the view plane, to the current frame interval under consideration.

Thus, upon a new 3D segment Si intersecting the view plane 410, a 2D centroid t) of the 2D intersection is saved in the persistent data structure 900 representing that 3D segment, that is graph G.

The centroid T of the 2D intersection of a 3D segment S, is calculated as: Yi= mo/moo, (1) Yi moI/moo, (2) moo lEf(x,y), (3) x y mo= ZZXf;(x,y), (4) x y mo, Zy yf x Y where fj(x, y) is the binary 2D image formed by the intersection of the 3D segment Si with the 2D viewing plane 410, wherein pixels (xy) forming part of segment S. have a value of and those pixels (xy) not forming part of segment i have a value of In later frame intervals, for each existing 3D segment Si, the absolute motion ,Ajj) is estimated by calculating, by the processor 705, the vector difference between a current centroid (Y77, Tff that is the current 2D intersection of the 3D segment 617607.DOC -8- Si under consideration with the 2D view plane 410, from the centroid when the 3D segment Si under consideration first intersected the view plane 410.

Unfortunately, the absolute motion may give a spurious estimate of the absolute 2D motion of a segment in two cases, namely: an occluded segment Si will have a centroid shift induced by the movement of an overlying segment; and fine intensity changes cause boundary changes of the segment Si, which cause a jitter in the position of the calculated centroid x, Equivalently, the shape of the segment can change due to a change in the shape of the original subject.

Considering case which is known as the occlusion problem, with reference to Figs. 5A and 5B, where the 2D intersections of two 3D segments S, and S2 are shown in two separate frame intervals, with segment S2 partially occluding segment S. In the illustrated example, segment Si is stationary, while segment S2 is moving, exposing more of segment S, as it is moving. Due to the actual movement of segment S2, segment S2 has an absolute motion A 2 in the x-direction. However, even though segment Si is stationary, it will also have a non-zero absolute motion AV, in the x-direction, because the centroid j) of segment S\ moved in the x direction as more of segment S 1 is revealed. This is an example of false motion.

Let the set {xmin,yminxmaxymax) define an initial bounding box of an occluded segment. In the illustration, only the x-direction bounds Xmin and xax of segment S, are indicated.

For any segment Si, the x-component of the centroid x, will fall between the xdirection bounds xmin and xm,, of that segment Si: Xmin I X, Xmax (6) and an x component of the centroid motion due to occlusion aocTusio have the following restriction: Aocclusion (Xmax Xmin) AXbound (7) Similarly, a y component of the centroid motion due to occlusion Aision have the restriction: 617607.DOC Ayocclusion (Ymax Ymin) Aybound (8) Therefore, if the actual bounds of the unoccluded 2D intersection of each 3D segment Si, defined by AXb,,,d and Aybod, are known, then the true motion (AT,,Ay;) could be robustly distinguished from false motion due to occlusion by setting the threshold for significant segment movement in the x and y directions to be greater than the actual bounds Arbound and Aybound respectively. Thus a centroid shift due to occlusion is limited to being smaller than the segment's true width and height, as defined in Equations and In general however, the actual bounds AxbO,.d and Aybound of the full, unoccluded, segment's 2D intersection are not known, as one or more segments Si may first appear already occluded by another segment. However, estimates of the actual bounds ,bo.nd and A5bound may be calculated, and these estimates Aound and Abound may be updated over time, as occluding segments reveal more of underlying occluded segment(s), thereby improving the actual bounds estimates Abound and A,,un d Referring again to Fig. 1, the processor 705 excludes spurious motion estimates, for example caused by occlusion of segments Si, in step 140. When a given segment Si first intersects with the view plane 410, the initial bounds estimates Mbound and 4,bound are set to the width and height of the bounding box of the segment's 2D intersection with the view plane 410. This information is also stored with the persistent data structure 900 (Fig. 2).

In each subsequent frame interval, the bounds estimates Abound and Abound are updated by determining the maximum of a previous bounds estimate A" A l and the bounds of the segment's 2D intersection for the current frame interval Ax7,"', current Ybound current ound max( AoAn d P 4od (9) Abou>d max( Ay' yb, d 617607.DOC Thus, the updated bounds estimates Ai'oud and A bo,,,d give the best estimate of the segment's true (unoccluded) extent based on the information up to that point in time.

Referring again to Fig. 5A and 5B, a false motion like Ax will always be less than AXboxud (Equation whereas a true motion like Ax 2 may be greater than the corresponding bound of segment S2.

Referring now to Figs. 6A and 6B where the 2D intersections of three 3D segments $3 S4 and $5 are shown in two separate frame intervals, with segment $4 partially occluded by both segments S3 and In this example, the occluded segment $4 is again stationary, while segments $3 and 55 are both moving in the x-direction, resulting in true motions in the x-direction AX 3 and Ax, respectively. As segment $3 is moving, it occludes more of segment S4 while, at the same time, segment $5 is occluding less of segment This change in occlusion causes a change in the centroid (74, 74 of segment which in turn caused segment $4 to have an absolute motion AX4 in the xdirection.

However, in this case the updated bounds estimates A and AL'o..

d remain unchanged, and do not give improved estimates of the segment's true (unoccluded) extent. It is therefore noted that, in such pathological situations, the updated bounds estimates ALbd and A5,,,b may never closely approximate the segment's true unoccluded bounds Axbod and Ayo,,. This is a manifestation of the so-called aperture problem, whereby, for example, an occlusion is removed from one side of a segment Si whilst another occlusion occurs on the opposite side, so that the updated bounds estimates A'bod and do not change, as is the case described with reference to Figs. 6A and 6B. In extreme cases this may lead to false motion, such as the false motion AX 4 of segment to exceed a threshold used for classification described below, hence giving a false positive classification as true motion.

This rare case cannot, in principle, be distinguished using motion cues alone and other cues, such as range or other post-processing steps outside the scope of this description, are required to deal with such cases.

617607.DOC -11- The spurious estimate of absolute 2D motion in the case of fine intensity changes which cause boundary changes of the segment Si, (case above), typically leads to smaller centroid shifts than case and the false motions induced by it are subsumed by the tests for case Referring again to Fig. 1, method 100 continues to step 150 where the processor 705 classifies certain segments Si as significant segments. In particular, from the above discussion, if a segment Si satisfies the following test: A A ibound OR AT A)bou d then that segment Si is identified as being likely to represent a portion of an object with significant movement.

Motion is an ephemeral cue, that is, the estimated motion (Al i Ay.) can go back below the threshold of Equation (11) if the centroid y7i of that segment Si moves back to its original position. Therefore, the classification as a significant segment above is treated as sticky, that is, once a segment Si has been classified as significant, that segment S, stays classified as significant for the rest of its life-span. This is implemented by storing the classification result with the persistent segment data in the graph G.

Having classified the significant segments that are likely to be constituents of significant semantic video objects in step 150, step 160 follows where the significant segments are grouped by the processor 705 into separate SVO's that are then labelled with some object identifier that is persistent across time.

A key point here is that the 3D segments typically have shorter life-spans than the SVO that they are part of This may, for example, be due to lighting changes and object movement over time. Hence, at any particular moment in time the intersection of a particular SVO with the 2D view plane 410 (Fig. 4) will typically consist of multiple 3D colour segments Si, only some of which may have been previously labelled with the object identifier of the particular SVO, with the other segments Si representing newly created 3D segments Si and/or segments S, that have been newly classified as significant due to movement, but which have yet to be labelled with an object identifier. Labelled segments S, adjacent to those unlabelled significant segments assist in ensuring consistent labelling of objects across time.

To implement step 160, an algorithm based on a graph connected-component algorithm, known per se in the art, is used. A connected component is defined to be all 617607.DOC

I

12adjacent segments with the same classification as assigned in step 150. In the preferred embodiment, a breadth-first connected-component search is used.

In the following, it is assumed that the graph G' is a subset of the graph G consisting only of segments and boundaries that have a maximum time bound of greater or equal to that of the view plane they intersect the view plane) and so the graph G' contains only adjacency information of relevance to the current frame interval. Also, it is assumed that segments Si produced by the 3D segmentor in step 120 are non-branching.

As the connected-component algorithm of step 160 traverses and identifies each unlabelled segment Si, the unlabelled segment Si is assigned an object identifier. There are two cases to consider- The unlabelled segment Si has an adjacent segment S that has been previously labelled with an object identifier. If there is only one distinct object identifier involved, then the unlabelled segment Si is labelled with the object identifier of the adjacent segment Sj. This ensures that a consistent object identifier is maintained and tracked across video frames. In the case where the unlabelled segment S, has adjacent segments S with different object identifiers this is a manifestation of the merging subproblem, which can occur if, for example, two previously separated subjects of the same colour move together), then some majority voting or other scheme can be used to select the most representative and likely object identifier for the unlabelled segment Si, and that segment Si is labelled accordingly. In the preferred embodiment, the minimum merging cost of the segment over the adjacent segments in each of the said previously labelled objects is used to determine the most likely object id. Additionally, the angle of the segment displacement vectors can be used to distinguish objects that have merged from different directions. Alternatively, depending on the application, a special object identifier can be used that represents an object that is the merger of two or more objects.

The other case to consider is when the unlabelled segment Si has no adjacent segments S that have previously been labelled with an object identifier. This case represents a new object entering the view plane 410. In this case such an unlabelled segment S, is labelled with a new unique object identifier.

For example, in Fig. 7 segments SA and S7 have been previously assigned some object identifier I in view plane V 1 At a later frame interval and in view plane V2, segment Sc is appearing and has not yet been assigned an object identifier. However, because it is adjacent to segments SA and SR, applying the above graph connected- 617607.DOC 13component algorithm, segment Sc is assigned the same object identifier I as it is very likely that segment Sc is also a part of the same object as segments SA and SB.

Referring again to Fig. 1, the method 100 proceeds to step 170 where labelled SVO are returned as output. In a preferred implementation the frame that is currently coplanar with the view plane 410 (Fig. 4) is displayed on the display device 714, with each SVO in that video frame marked. This may be done, for example, by only showing the pixels of SVO's. Such pixels may be superimposed on any other image or video sequence. Alternatively, each SVO may be bounded by a distinct line border. It is noted that the displayed frame is delayed by N-1 frame intervals from the input frame.

to In a further implementation, each SVO is displayed with a tag corresponding to its object identifier.

Method 100 returns to step 110 where the next frame is received as input and steps 120 to 170 are repeated with the video data now including that frame.

In what follows, a number of refinements of method 100 is described, followed by a detailed description of the 3D spatiotemporal segmentation of step 120.

Hysteresis thresholding: The absolute motion test used in step 150 to classify segments Si as significant has, intrinsically, a delay as a moving segment S, may take several frame intervals before its estimated motion (AT i ,A 7) satisfies the threshold imposed by Equation (11).

Accordingly, different segments Si of a newly appearing object can take different amounts of frame intervals to be classified as significant by step 150, which could cause the SVO representing the object to appear piecemeal, and possibly wrongly interpreted as consisting of distinct objects.

To counter this effect, a time element is added to the absolute motion test of significance to form a two level test: firstly, a 3D segment Si needs to move greater than some spatial threshold, as defined in Equation and then is classified as significant only some fixed number M of frames intervals later.

However, segments Si that are adjacent to a segment Si classified as being significant and that pass the first absolute motion test (Equation but not the second time test, are also classified as significant, but in the connected-component algorithm step 160. This two-level threshold can be considered a form of hysteresis thresholding, and if the fixed number M of frame intervals is chosen to be greater than the average number of 617607.DOC -14frame intervals that segments Si take to move beyond the motion threshold (Equation then objects will tend to appear instantaneously and fully-formed.

Extension for arbitrary non-zooming camera motion.

The above discussion assumes a fixed background against which the moving objects are detected. To monitor a large field of view, a fixed 180 or 360 degree spherical or cylindrical projection camera system could be used. For the general case of a moving camera 750, a separate egomotion estimation algorithm, as known to those skilled in the art, can be used to estimate the camera motion and its current pan/tilt position, and the translational components of the induced relative image motion can then be subtracted from the absolute segment motions (AJT,A3) as estimated by step 130 to give a true absolute motion for processing by later steps.

In detail, the current camera absolute pan and tilt angles are stored in the corresponding segment data structure at segment creation time. The angular units of the pan/tilt may be arbitrary; in the preferred implementation, the angular units are normalized so that half of the camera field of view equals 1.0. This has the advantage that the actual camera field of view in degrees need not be known. A given segment's current angular position (elevation and azimuth) relative to the current camera view may be approximated by the relative x and y displacement of the segment centroid from the centre of the image and so the current segment's absolute angular position may be determined by subtracting the known camera pan/tilt angular location at segment creation time from the current segment spherical coordinates.

Because the camera motion will typically include translational motion and hence parallax effects, and because the egomotion estimate itself will have some degree of error, the absolute motion estimate may be limited to some fixed previous frame interval F rather than the full lifetime of the corresponding segment Si. Thus error accumulation is limited to this fixed frame interval F which is chosen such that this error is expected to be much smaller than the false motion error that is detected in step 140.

The previous discussion strictly assumes a spherical projection camera system.

For the typical planar projection camera system with a relatively small field of view and limited camera pan tilt, the approximation to a spherical projection may be ignored. The segment displacement tests may be made robust to any small errors due to the approximation by increasing the thresholds used. Optionally, for a moving camera with 617607.DOC 15 large pan/tilts and/or a large field of view, and where the camera field of view is known, then the segment coordinates may be rectified to a reference planar or spherical projection before the above-described tests for more accurate estimates, using transforms known to those skilled in the art.

Retrospective Classification: The above description assumed that future frame data is unavailable, as is the case in real-time streaming video object tracking. In many cases, such as the processing of previously captured stored video sequences, it is possible to go back in time to revisit previous determinations. There are at least three steps of method 100 where this additional flexibility can be utilised.

Firstly, in step 140, the updated bounds estimates Ai'o.d and Abound determined using Equations and (10) at the end of a segment's lifespan are more accurate estimates of the bounds Axbound and AYbound of the unoccluded 2D intersection of each 3D segment. Thus, these end bounds estimates A*ud and A $ou, may be used in a post-processing classification step 150 to provide an improved classification of which segments are significant.

Secondly, in step 150, after a segment Si has been classified as moving, a later application of step 160 may use this segment classification from the beginning of that segment's lifespan, and hence remove the effects of the latency inherent in the motionbased segment classifier.

Finally, in step 160, by essentially running the connected-component algorithm over the classified segment data in reverse time order, the merging problem becomes the branching problem and vice versa. So by combining these two sources of object labelling information, a more robust final labelling can be achieved.

Branching Problem: In the case where a single object bifurcates into several separate objects of the same colour a group of people where one or more people leave the group), it may happen that method 100 assigned a same object identifier to 3D segments Si that were once adjacent, but their intersections with the view plane 410 in the current frame interval are not adjacent. This case may be handled by testing for it on a per frame interval basis using a 2D connected-component test, and then assigning separate object identifiers for each such 3D segment S; when detected. In the preferred implementation the smaller 617607.DOC -16volume 3D segment has its object identifier rescinded such that it will be reassigned a new id by the later grouping passes.

Head Tracking: By performing a second pass of the grouping phase (step 160) that restricts each connected-component search to be within the previously extracted video objects, and which is based on the Mahalanobis distance from predetermined skin colour models, skin patches and covariance matrices, skin patches within the video objects may be labelled.

Such a combination of a motion test followed by a skin color test is used as a robust basis for head tracking and gesture recognition applications.

Object Removal: Once the foreground video objects in a scene have been labelled, a model of the background of the scene may be built up. This allows one to remove selected objects from a video or single frame of an image sequence from a digital video camera and replace them with the background model.

To store the background model, an image-sized frame buffer is used. The background frame buffer is updated at each new frame by simply overwriting it with the current image data masked by the current foreground object mask, that is, the new background model is equal to the current frame background where the foreground mask is 0, and the value of the previous background model elsewhere.

If there is significant camera movement, then the previous background model must be aligned with the current frame before updating, using the known or estimated camera ego-motion parameters.

Additionally, several iterations of a morphological growing operation may be applied to the foreground video object mask to exclude stray boundary pixels that may have been missed in the object segmentation.

Initially, areas of the background model behind objects will be unknown, but as the objects move they reveal more of their background, and hence the background model becomes more complete.

In the preferred implementation, the video object is only presented to the user as being selectable for removal when the object has moved greater than its bounding box, ensuring that a complete background model is available to replace the object when it is selected. Note that this implies that the constituent segments of the object will have 617607.DOC -17moved greater than their width and thus this ensures that the segment displacement test described with reference to Equation (11) will work correctly for this application.

The above describes the situation for offline processing using the retrospective labelling pass. For real-time object removal a more conservative threshold for user selection is needed, such as, when the object has moved greater than twice its width in the worst case. This more conservative threshold ensures that replacement background data is available.

3D Segmentation The spatiotemporal (3D) segmentation of the video data based on colour performed by the processor 705 in step 120 will now be described in more detail. Further description of a spatiotemporal (3D) segmentation technique is provided in the publication "Three-dimensional Video Segmentation Using a Variational Method", International Conference on Image Processing (ICIP 2001), by Parker, B. and Magarey, J.

The segmentation uses the block of pixel data of the N most recently received frames. Each pixel in the two dimensional array of a frame at frame interval n has colour values as a vector of measurements, which includes colour values from an image sensor, such as that in camera 750. The three-dimensional block of pixels, or voxels, is segmented into a set of three-dimensional segments so that every pixel in the block is related to one segment Si in which all pixels belonging to the same segment Si have homogeneous colour values 4(x,y).

An assumption underlying the segmentation problem is that each colour value is associated with a particular state. The model used to define the states is decided upon in advance. Each state is defined by a unknown segment model parameter vector a of length c, with each state being assumed to be valid over the contiguous 3D segment Si.

The aim of segmentation is to identify these 3D segments Si and the model parameters for each segment Si.

A model vector of measurements Kx,y,n) over each segment Si is assumed to be a linear projection of the c-vector model parameter for that segment Si: (xy,n) =A(xy,n) ai eFSi (12) where A(x,y,n) is an m by c matrix, which relates the state of segment Si to the model measurements K xy,n), thereby encapsulating the nature of the predefined model.

617607.DOC 18- In the colour video segmentation case, c=m and matrix A(x,y,n) is the c by c identity matrix for all (xy,n).

Each vector of actual colour values is subject to a random error e(x,y,n) such that y(x,y,n) (13) Further, the error e(x,y,n) may be assumed to be drawn from a zero-mean normal (Gaussian) distribution with covariance A(x,y,n): e(x,y,n) N(O, (14) wherein A(x,y,n) is a c by c covariance matrix. Each component of the error e(x,y,n) is assumed to be independently and identically distributed, i.e.: A(x,y,n) a2(x,y, n)c where Ic is the c by c identity matrix.

Variational segmentation requires that a cost function E be assigned to each possible segmentation. A partition into segments Si may be compactly described by a binary function J(d) on the boundary elements, in which the value one is assigned to each boundary element d bordering a segment Si. This function J(d) is referred to as a boundary map.

The cost function E used in the preferred implementation is one in which a model fitting error is balanced with an overall complexity of the model. The sum of the statistical residuals of each segment Si is used as the model fitting error. Combining Equations (14) and the residual over segment Si as a function of the model parameters a, is given by s (16) (x,n)ES, The model complexity is simply the number of segment-bounding elements d.

Hence the overall cost functional E may be defined as E(y,J, E, A J(d) (17) d where the (non-negative) parameter X controls the relative importance of model fitting error and model complexity. The contribution of the model fitting error to the cost functional E encourages a proliferation of segments, while the model complexity 617607.DOC -19encourages few segments. The functional E must therefore balance the two components to achieve a reasonable result. The aim of variational segmentation is to find a minimising model measurement 7 and a minimising boundary map J(d) of the overall cost functional E, for a given parameter X value.

Note that if the segment boundaries d are given as a valid boundary map J(d), the minimising model parameters W- over each segment Sj may be found by minimising the segment residuals Ej. This may be evaluated using a simple weighted linear least squares calculation. Given this fact, any valid boundary map J(d) will fully and uniquely describe a segmentation. Therefore, the cost function E may be regarded as a function over the space of valid edge maps (J-space), whose minimisation yields an optimal segment partition J for a given parameter The corresponding minimising model parameters i may then be assumed to be those which minimise the residuals E over each segment Si. The corresponding minimum residuals for segment Si will hereafter be written as E,.

If parameter X. is low, many boundaries are allowed, giving "fine" segmentation.

As parameter k increases, the segmentation gets coarser. At one extreme, the optimal segment partition Jo, where the model complexity is completely discounted, is the trivial segmentation, in which every pixel constitutes its own segment Si, and which gives zero model fitting error e. On the other hand, the optimal segment partition where the model fitting error e is completely discounted, is the null or empty segmentation in which the entire block is represented by a single segment Si,. Somewhere between these two extremes lies the segmentation J, which will appear ideal in that the segments Si correspond to a semantically meaningful partition.

To find an approximate solution to the variational segmentation problem, a segment merging strategy has been employed, wherein properties of neighbouring segments Si and S are used to determine whether those segments come from the same model state, thus allowing the segments Si and Sj to be merged as a single segment SY.

The segment residual Eij also increases after any 2 neighbouring segments S and S are merged.

Knowing that the trivial segmentation is the optimal segment partition Jx for the smallest possible parameter value of 0, in segment merging, each voxel in the block is 617607.DOC initially labelled as its own unique segment Si, with model parameters ai set to the colour values Adjacent segments Si and S are then compared using some similarity criterion and merged if they are sufficiently similar. In this way, small segments take shape and are gradually built into larger ones.

The segmentations JA before and after the merger differ only in the two segments Si and Sj. Accordingly, in determining the effect on the total cost functional E after such a merger, a computation may be confined to those segments Si and Sj. By examining Equations (25) and a merging cost for the adjacent segment pair {Si,Sj} may be written as V Ej-(Ei+E, (18) 1(3j) where l(>ij) is the area of the common boundary between three-dimensional segments Si and Sj. If the merging cost rg is less than parameter k, the merge is allowed.

The key to efficient segment growing is to compute the numerator of the merging cost ij as fast as possible. Firstly, Equation (16) is rewritten as: jaj)(Fj- aj) (19) where: Hj is an (vjc) by c matrix composed of the c by c identity matrices stacked on top of one another as varies over segment Sj, with vj the number of voxels in region Oj; and Fj is a column vector of length (vjc) composed of the individual colour value vectors stacked on top of one another.

By weighted least square theory, the minimising model parameter vector aj for the segment Sj is given by the mean of the colour value over segment Sj.

icj is the confidence in the model parameter estimate aj, defined as the inverse of its covariance: K H which simply evaluates to Vjc. The corresponding residual is given by Ej=(F -HHajy )r (F -Hj1) (21) 617607.DOC -21 When merging two segments Si and Sj, the "merged" matrix Hq is obtained by concatenating matrix Hi with matrix j; likewise for matrix F. These facts may be used to show that the best fitting model parameter vector for the merged segment S 0 is given by: (vial-via 1 (23) ~ai -(3 V. +Vj and the merged confidence is Kj xK j Vyl, (24) The merged residual is given by EY =El Ej aj>(ai -ai) V Vj Combining Equations (24) and the merging cost ru in Equation (18) may be computed as: 2 Vri j 1i+V (26) from the model parameters and confidences of the segments Sj and S to be merged. If the merge is allowed, Equations (32) and (33) give the model parameter a, and confidence Ky of the merged segment SU.

During segment-merging segmentation, the merging of segments must stop once the merging cost rq exceeds a predetermined threshold Note that under this strategy, only Equations and (26) need to be applied throughout the merging process.

Only the model parameters i and their confidences wi for each segment Si are therefore required as segmentation proceeds. Further, neither the original colour values nor the model structure itself the matrices A(xy,n)) are required.

Fig. 8 shows the segment-merging segmentation step 120 (Fig. 1) in more detail.

The segment-merging segmentation step 120 starts at sub-step 802 and proceeds to step 804 which sets the model parameters a(x,yn) to the colour values and the model confidences to the identity matrix I, for each voxel in the block of N 617607.DOC -22frames. Segment-merging starts with the trivial segmentation where each pixel forms its own segment Si. Step 806 then determines all adjacent segment pairs Si and Sj, and computes the merging cost rij according to Equation (26) for each of the boundaries between adjacent segment pairs Si and Sj. Step 808 inserts the boundaries with merging cost r into a priority queue P in priority order.

Step 810 takes the first entry from the priority queue P(1) and merges the corresponding segment pair Si and Sj the segment pair Si and Sj with the lowest merging cost rz) to form a new segment Si. Step 812 records the merging cost rq in a list

U.

Step 814 identifies all boundaries between segments St adjoining either of the merged segments Si and Sj and merges any duplicate boundaries, adding their areas. Step 818 follows by calculating a new merging cost z, I for each boundary between adjacent segments Sy and The new merging costs r,.t effectively reorder the priority queue P into the final sorted queue in step 818.

Step 818 passes control to step 822 to determine whether the merging cost rj corresponding to the segments Si and S at the top of the priority queue P (entry has a value greater than the predetermined threshold st,,p, which signifies the stopping point of the merging. If the merging has reached a stopping point, then the segment-growing segmentation step 110 108 ends in step 830. Alternatively, control is returned to step 810 from where steps 810 to 822 are repeated, merging the two segments with the lowest merging cost r every cycle, until the stopping point is reached.

With reference again to Fig. 4 and as noted previously, as a new frame is received in step 110, the new frame is added to the window 400, while the oldest frame in the window 400 is removed from the window 400. The segmentation step 120 is performed as each new frame is received in step 110. However, after the segmentation step 120 described with reference to Fig. 8 has been performed a first time, in subsequent execution of the segmentation step 120 the regions Si formed in a previous segmentation are maintained in step 804, with only the model parameters a(x,y,n) and model confidences x(x,y,n) of the new frame being set to the colour values and the identity matrix I respectively. The effect of the segmentation step 120 is thus to merge the unsegmented pixels of the new frame into the existing segments Si from a previous 617607.DOC -23segmentation. However, those existing segments Si from a previous segmentation may adjust due to the information contained in the new frame.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only of". Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

617607.DOC

Claims

1. A method of detecting and tracking semantic video objects across a sequence of video frames, said method comprising the steps of: forming a 3D pixel data block from said sequence of video frames; segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; and grouping adjacent significant 3D segments into semantic video objects.

2. A method as claimed in claim 1 wherein said calculated bound is a maximum bound of said intersection of said 3D segment over previous frame intervals.

3. A method as claimed in claim I or 2, where step comprises the sub-steps of: (ea) identifying adjacent intersections of significant 3D segments with said view plane; and (eb) labelling said adjacent significant 3D segments with an object identifier.

4. A method as claimed in claim 3, wherein sub-step (eb) comprises the sub-steps of: (ebl) determining whether at least one of said adjacent significant 3D segments have been labelled with said object identifier; (eb2) if all said labelled adjacent significant 3D segments have a same object identifier, labelling the remainder of said adjacent significant 3D segments with said same object identifier; (eb3) if said labelled adjacent significant 3D segments have different object identifiers, labelling said remainder of said adjacent significant 3D segments with a most likely object identifier of said different object identifiers; and

617607.DOC I L 25 (eb4) if all said adjacent significant 3D segments are unlabelled, labelling said adjacent significant 3D segments with a new object identifier. An apparatus for detecting and tracking semantic video objects across a sequence of video frames, said apparatus comprising: means for forming a 3D pixel data block from said sequence of video frames; means for segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; means for estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; means for classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; and means for grouping adjacent significant 3D segments into semantic video objects. 6. An apparatus as claimed in claim 5 wherein said calculated bound is a maximum bound of said intersection of said 3D segment over previous frame intervals. 7. An apparatus as claimed in claim 5 or 6, where said means for grouping comprises: means for identifying adjacent intersections of significant 3D segments with said view plane; and means for labelling said adjacent significant 3D segments with an object identifier. 8. An apparatus as claimed in claim 7, wherein said means for labelling comprises: means for determining whether at least one of said adjacent significant 3D segments have been labelled with said object identifier; means for labelling the remainder of said adjacent significant 3D segments, wherein: 617607.DOC I 26 if all said labelled adjacent significant 3D segments have a same object identifier, said remainder of said adjacent significant 3D segments are labelled with said same object identifier; if said labelled adjacent significant 3D segments have different object identifiers, said remainder of said adjacent significant 3D segments are labelled with a most likely object identifier of said different object identifiers; and if all said adjacent significant 3D segments are unlabelled, said adjacent significant 3D segments are labelled with a new object identifier. 9. A program stored in a memory medium for detecting and tracking semantic video objects across a sequence of video frames, said program comprising: code for forming a 3D pixel data block from said sequence of video frames; code for segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; code for estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; code for classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; and code for grouping adjacent significant 3D segments into semantic video objects. 10. A program as claimed in claim 9 wherein said calculated bound is a maximum bound of said intersection of said 3D segment over previous frame intervals. 11. A program as claimed in claim 9 or 10, where said code for grouping comprises: code for identifying adjacent intersections of significant 3D segments with said view plane; and code for labelling said adjacent significant 3D segments with an object identifier. 617607.DOC 27 12. A program as claimed in claim 11, wherein said code for labelling comprises: code for determining whether at least one of said adjacent significant 3D segments have been labelled with said object identifier; code for labelling the remainder of said adjacent significant 3D segments, wherein: if all said labelled adjacent significant 3D segments have a same object identifier, said remainder of said adjacent significant 3D segments are labelled with said same object identifier; if said labelled adjacent significant 3D segments have different object identifiers, said remainder of said adjacent significant 3D segments are labelled with a most likely object identifier of said different object identifiers; and if all said adjacent significant 3D segments are unlabelled, said adjacent significant 3D segments are labelled with a new object identifier. 13. A method of removing a semantic video object across a sequence of video frames, said method comprising the steps of: forming a 3D pixel data block from said sequence of video frames; segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; grouping adjacent significant 3D segments into semantic video objects; forming a background image by updating a memory at each frame interval with pixel data of non-significant 3D segments; receiving a selection of one or more semantic video objects for removal; and replacing pixel data of selected semantic video objects with pixel data from said background image at each frame interval. 617607.DOC I I I -28 14. A method as claimed in claim 13 wherein step comprises for each semantic video object the sub-steps of: determining whether said absolute motion of said semantic video object exceeds a bounding box dimension of said semantic video object; and accepting said selection of said semantic video objects for removal upon determination that said absolute motion of said semantic video object exceeds said bounding box dimension of said semantic video object. An apparatus for removing a semantic video object across a sequence of video frames, said apparatus comprising: means for forming a 3D pixel data block from said sequence of video frames; means for segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; means for estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; means for classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; means for grouping adjacent significant 3D segments into semantic video objects; means for forming a background image by updating a memory at each frame interval with pixel data of non-significant 3D segments; means for receiving a selection of one or more semantic video objects for removal; and means for replacing pixel data of selected semantic video objects with pixel data from said background image at each frame interval. 16. An apparatus as claimed in claim 15 wherein said selection of said semantic video object(s) for removal is only accepted upon determination that said absolute motion of said semantic video object exceeds a bounding box dimension of said semantic video object. 617607.DOC I -29- 17. A program stored on a memory medium for removing a semantic video object across a sequence of video frames, said program comprising: code for forming a 3D pixel data block from said sequence of video frames; code for segmenting said 3D data block on the basis of colour into a set of 3D segments using 3D spatiotemporal segmentation; code for estimating an absolute motion of each 3D segment by calculating a vector difference between a 2D centroid of an intersection of said 3D segment with a view plane and a 2D centroid of an intersection of said 3D segment with said view plane in a previous frame interval; code for classifying each 3D segment, wherein 3D segments having absolute motion exceeding, in any direction, a calculated bound of said intersection of said 3D segment with a view plane in that direction, are classified as significant; code for grouping adjacent significant 3D segments into semantic video objects; code for forming a background image by updating a memory at each frame interval with pixel data of non-significant 3D segments; code for receiving a selection of one or more semantic video objects for removal; and code for replacing pixel data of selected semantic video objects with pixel data from said background image at each frame interval. 18. A method of detecting and tracking semantic video objects across a sequence of video frames, said method being substantially as described herein with reference to the accompanying drawings. 19. An apparatus for detecting and tracking semantic video objects across a sequence of video frames, said apparatus being substantially as described herein with reference to the accompanying drawings. 617607.DOC A program for detecting and tracking semantic video objects across a sequence of video frames, said program being substantially as described herein with reference to the accompanying drawings. DATED this Seventeenth Day of DECEMBER 2002 Canon Kabushiki Kaisha Patent Attorneys for the Applicant SPRUSON&FERGUSON 617607.DOC