AU2003202411A1

AU2003202411A1 - Automatic Annotation of Digital Video Based on Human Activity

Info

Publication number: AU2003202411A1
Application number: AU2003202411A
Authority: AU
Inventors: Julian Frank Andrew Magarey
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-03-22
Filing date: 2003-03-21
Publication date: 2003-10-16
Anticipated expiration: 2023-03-21
Also published as: AU2003202411B2

Description

S&F Ref: 621256

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Julian Frank Andrew Magarey Actual Inventor(s): Address for Service: Invention Title: Spruson Ferguson St Martins Tower,Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) Automatic Annotation of Digital Video Based on Human Activity ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU PS 1323 [32] Application Date 22 Mar 2002 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c -1- AUTOMATIC ANNOTATION OF DIGITAL VIDEO BASED ON HUMAN

ACTIVITY

Technical Field of the Invention The present invention relates generally to digital video analysis and, in particular, to the classification of objects in a video sequence as human and the creation of metadata of appearances of humans in the video sequence.

Background Consumer video presents a huge challenge to the art of content annotation. The volume of data generated is vast, the quality is variable, and the attention span of the potential viewer is limited. Annotation, or content summarisation, is the task of indexing the content of the video so as to highlight features or portions of interest to a viewer, linking them to sites in the video, and storing the results in some readily searchable format. The results of annotation are often referred to as metadata. Such indexing metadata is particularly useful for reviewing and editing the video, since it allows random access to the extracted features and portions. An analogy of such metadata for text data is the index appearing at the end of a book, which lists items of interest in alphabetical order together with links to the places where they appear in the text, which is usually one or more page numbers or sections.

The task of annotation may be carried out manually, but this is clearly a timeconsuming task. Computer-based systems designed to perform automatic or even semiautomatic annotation, that is, assisted by a human user, offer a clear advantage of convenience over manual systems. The problem is that there is generally a trade-off between the level of semantic abstraction and the amount of user input required to obtain the annotation. That is, simple low-level descriptions, such as colour histograms or motion statistics, may be obtained fully automatically. However, such low-level descriptions are less than meaningful to casual human viewers. In order to extract more meaningful descriptions, a system usually requires a significant amount of user guidance.

One feature of consumer video that is commonly regarded as meaningful is the presence of people, their movements and interactions. Such descriptions would enable queries of the kind "Locate all appearances of this person" to be handled. Fully automatic systems that produce descriptions of human activity in consumer video are therefore highly desirable.

621256.DOC -2- Systems exist that attempt to automatically detect and track humans in video, both in real time and as post-processors. Some of these systems use elaborate models of the human body, either two-dimensional or three-dimensional, and attempt to fit the chosen model to detected moving regions in the video frame. These systems, while potentially highly accurate, usually require some kind of active tracking targets to be placed in advance on the filmed person, significant user assistance, or highly constrained environments. Such systems are therefore unsuitable for operation on general consumer video.

A more promising approach followed is to detect and track general moving objects, classify them as human or non-human using simple heuristics, and store the properties of the detected humans. Such a human classification step is often neglected in prior art systems, or is designed to exclude only spurious objects resulting from detection or tracking errors. The latter approach is still vulnerable to false positives triggered by, for example, mechanical objects or animals. Other systems use the presence of human skin, detected according to some trained skin colour model, as a classification cue. This helps to exclude many false positives, but many others remain.

In consumer video a person often becomes completely occluded, only to reappear some time later. It is clearly advantageous for later searching and browsing to be able to determine whether two non-time-overlapping (disjoint) detected persons are actually instances of the same person.

A further, possible final aspect of a complete system for automatically annotating video is the saving of the potentially huge volume of extracted human activity metadata in a convenient format. The metadata format should be structured so as to allow efficient searching by a client application, followed (if requested) by efficient retrieval of the indicated section of the raw video. Efficiency in this context means that only a subset of the metadata or data that is relevant to the query should be searched. Previous approaches to human-metadata extraction have not demonstrated such efficiency of result storage.

Summary of the Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

According to a first aspect of the invention, there is provided a method of classifying an object in a video sequence as human, said method comprising the steps of: estimating motion vectors of pixels of said object; 621256.DOC -3calculating a rigidity measure for said object, said rigidity measure quantifying an extent to which said motion vectors fit a model of uniform translation and rotation; and classifying said object as one of human or non-human dependent on said rigidity measure.

According to a second aspect of the invention, there is provided a method of detecting human activity in a video sequence, said method comprising the steps of: identifying three-dimensional objects within said video sequence; estimating motion vectors of pixels of said three-dimensional objects; calculating a rigidity measure for said three-dimensional objects, said rigidity measure quantifying an extent to which said motion vectors associated with each said three-dimensional objects fit a model of uniform translation and rotation; classifying each of said three-dimensional objects as either human or nonhuman using at least said rigidity measure; calculating a feature vector for each of said three-dimensional objects classified as being human; and labelling said three-dimensional objects classified as being human, wherein nontime-overlapping three-dimensional objects having similar feature vectors are labelled with a same label.

According to another aspect of the invention, there is provided an apparatus for implementing any one of the aforementioned methods.

According to another aspect of the invention there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above.

According to another aspect of the invention, there is provided a data structure for storing metadata of objects in a video sequence, said data structure comprising: a first-level sub-structure for storing, at each frame interval of said video sequence, position data of each object at that frame interval; and a second-level sub-structure for storing for each object, data defining an object set that said object is grouped in and data defining frame intervals where said object was present, wherein said object set is a group of objects.

Other aspects of the invention are also disclosed.

621256.DOC -4- Brief Description of the Drawings One or more embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 shows a flow diagram of a method of extracting metadata of human activity from a sequence of video frames; Fig. 2 is a flow diagram of the sub-steps of a classification step performed during the method of extracting metadata of human activity; Fig. 3 shows a perspective camera coordinate system used for computing a rigidity measure; Fig. 4 is a flow diagram of the processes for computing the rigidity measure; Fig. 5A is a flow diagram of the sub-steps of the tentative labelling step performed during the method of extracting metadata of human activity; Fig. 5B is a flow diagram of the sub-steps of an alternative tentative labelling step; Figs. 6A to 6C show the contents of the hierarchically structured metadata; Fig. 7 illustrates a response to a typical content-based query using the human activity metadata; Fig. 8 shows a programmable device for performing the steps of the method of extracting metadata of human activity from the sequence of video frames; Fig. 9 shows a Graphical User Interface within which the content-based editing queue may be rendered; Fig. 10 shows a flow diagram of the main processing steps for detecting and tracking objects across the sequence of video frames; Fig. 11 shows a graphical representation of a data structure that may be used to store the output of the 3D segmentation; and Fig. 12 illustrates a sequence of video frames, with a window including the L most recently received frames, forming a "block" of pixel data.

Detailed Description including Best Mode Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data 621256.DOC within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities.

Apparatus Fig. 8 shows a programmable device 700 for performing the operations of a method of extracting metadata of human activity from a sequence of video frames, which is described below. Such a programmable device 700 may be specially constructed for the required purposes, or may comprise a general-purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus.

The programmable device 700 comprises a computer module 701, input devices such as a keyboard 702 and mouse 703, and output devices including a display device 714. A Modulator-Demodulator (Modem) transceiver device 716 is used by the computer module 701 for communicating to and from a communications network 720, for example connectable via a telephone line 721 or other functional medium. The modem 716 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN).

The computer module 701 typically includes at least one processor unit 705, a memory unit 706, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 707, and an I/O interface 713 for the keyboard 702 and mouse 703, and an interface 708 for the modem 716 and a camera 750 through connection 748. A storage device 709 is provided and typically includes a hard disk drive 710 and a floppy disk drive 711. A CD-ROM/DVD drive 712 is typically provided as a non-volatile source of data. The components 705 to 713 of the computer module 701, typically communicate via an interconnected bus 704 and in a manner which results in a conventional mode of operation of the programmable device 700 known to those in the relevant art.

In another implementation the computer module 700 is located inside the camera 750.

The method may be implemented as software, such as an application program executing within the programmable device 700. The application program may be stored 621256.DOC -6on a computer readable medium, including the storage devices 709. The application program is loaded into the computer from the computer readable medium, and then executed by the processor 705. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the programmable device 700 preferably effects an advantageous apparatus for extracting metadata of human activity from a sequence of video frames In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 712 or 711, or alternatively may be read by the user from the network 720 via the modem device 716.

The foregoing is merely exemplary of relevant computer readable mediums. Other computer readable media may be practiced without departing from the scope and spirit of the invention.

The programmable device 700 may be constructed from one or more integrated circuits performing the functions or sub functions and for example incorporated in the digital video camera 750. A user of the camera may record a video sequence, and using the processing methods described below, create metadata that may be associated with the video sequence to conveniently describe the video sequence, thereby permitting the video sequence to be used or otherwise manipulated with a specific need for a user.

Method of extracting metadata of human activity Fig. 1 shows a flow diagram of the main processing steps of a method 100 of extracting metadata of human activity from a sequence of video frames. The steps of method 100 are effected by instructions in the application program that are executed by the processor 705 of the programmable device 700 (Fig. Method 100 starts in step 105.

In step 110 the processor 705 receives frame data from a video sequence, with the frame data at each frame interval t comprising a two dimensional array x (or equivalently of pixel data, typically in some colour space such as RGB or LUV.

The frame data may be received directly from camera 750, or may be from a recorded video sequence stored on the storage device 709 or DVD inserted in the CD- ROM/DVD drive 712.

The current frame data is passed in step 115 to an object tracking module which performs semantic video object (SVO) detection and tracking. The object tracking module processes the frame data of the current frame together with frame data from 621256.DOC -7one or more previous frames, and returns an object label map where n is the output frame interval.

Some SVO tracking algorithms have an inherent latency. That is, there is a finite non-zero delay L between the input frame interval t and the output frame interval n, to which the SVOs oj k and object label map o, correspond. That is, n=t-L (1) To produce the object label map o, the object tracking module partitions the pixels x at frame interval n into sets, with each of those sets forming an SVO such that the pixels x belonging to the SVO o, make up the projection of a real-world object onto the plane of the image sensor of camera 750 at frame interval n. The integer k is a unique identifier for the SVO o'.

The object label map returned by the object tracking module is a 2D scalar array made up of the object identifiers k at each pixel x of frame interval n. If a pixel x is not assigned to an SVO it is deemed to be part of the background, whose identifier k is by convention set to zero.

The detection of SVOs on is a task which (in principle) can be carried out independently at each frame. The tracking of SVOs o, in contract thereto, requires for the object tracking module to assign persistent identifiers k to each SVO ok detected in each frame, such that the SVO o' (with identifier k at frame interval n) corresponds to the same real-world object as SVO ok in the preceding output frame interval n-i. The SVO tracking task therefore requires at least some memory of previous frames.

Any one of a number of published SVO tracking algorithms may be used for performing step 115. The preferred SVO tracking algorithm is described in more detail below.

The main requirement for the object tracking module is a capability to maintain a persistent identifier k for each SVO o' as long as the SVO o is at least partly visible.

The generality of application of the method 100 is improved if the object tracking module is also capable of handling a moving camera and does not require a "background" image to be captured or maintained.

621256.DOC The resulting sequence of two-dimensional SVOs o k may be compactly written as a cumulative SVO 0 k having the property: OU,0 (2) The cumulative SVO O k may be regarded as being made up of voxels n), with voxel n) belonging to cumulative SVO 0' if and only if pixel x belongs to the SVO o (x,n)e0,k, (3) After step 115 performed SVO detection and tracking to produce the object label map method 100 advances to step 120 where the cumulative SVOs O k are classified as human or non-human. The classification is sticky, which means that once a cumulative SVO 0' is classified as human, it remains so classified for the duration of the appearance of the associated SVOs o k The classification of the cumulative SVOs O k as human or non-human is based on two aspects: Presence of skin coloured voxels and Rigidity non-rigidity of SVO motion.

If a cumulative SVO 0O has a significant proportion of its voxels n) classified as being human skin pixels, and the cumulative SVO O k moves non-rigidly, then that cumulative SVO Oa is classified as human. The human classification step 120 is described in more detail below.

A cumulative SVO 0 classified as being human by step 120 is hereinafter termed an appearance Ak Each appearance Ak also has a unique identifier k, which it inherits from its corresponding cumulative SVO O. New appearances A k are also assigned a start frame property st(k), which is the current frame interval n.

As a real-world object is occluded by other objects or moves out of frame, the object tracking module (step 115) loses track of the corresponding SVO o, that is, the object label map contains no pixels x labelled with its identifier k, causing the corresponding SVO o, and therefore also the cumulative SVO to expire. If the cumulative SVO O was classified by step 120 as being human and hence an appearance 621256.DOC A, the appearance A also expires. Expired appearances A' no longer need updating with new frame data so they are written as expired appearances Ak. Step 125 closes off newly expired appearances Ak and adds them to to a list of expired appearances. An end frame property end(k) of each newly expired appearance Ak is also set to the previous output frame interval, n-1.

Step 130 follows where the processor 705 increments the identifier k to that of the next current (non-expired) appearance A Step 135 then saves the frame metadata of the appearance A k to the storage device 709. A full description of the frame metadata appears later. In particular, the identifier k together with the coordinates of a rectangular box bounding the pixels x of the appearance A, k in the frame of frame interval n are stored.

Often, the person associated to an expired appearance Ak will later reappear in the frame. The object tracking module (step 115) will detect and track the cumulative SVO having a new identifier 1, related to that reappearance of the person. Step 120 will also assign a new appearance A' to that cumulative SVO O0 when it is classified as being human by step 120. It is clearly advantageous to be able to determine whether two or more non-time-overlapping (disjoint) appearances A' are actually instances of the same person. To make such a determination, some form of clustering of appearances A' by the identity of the corresponding person is required, and for each of the clustered appearances A to be labelled with a long-term identifier (LID) that is designed to be unique to a particular person. The identifier k used for appearances A k is hereafter referred to as the short-term identifier (SID) to distinguish it from the LID used for identifying a particular person.

Labelling is based on properties of each appearance accumulated over the duration of the appearance Each current appearance A k therefore has a corresponding feature vector V, whose aim is to capture a robust and yet distinctive property of the appearance A k The feature vectors V, of expired appearances Ak are frozen in their final states to become permanent feature vectors Vk. The preferred feature vector V, is a colour histogram, since the colour histogram of an appearance Ak is simple to compute and robust to changes of viewing angle and posture. Accordingly, to 621256.DOC enable labelling of appearances A: the feature vector VP of the appearance A k under consideration is updated in step 140. The details of the feature vector V k extraction are set out below.

Method 100 continues to step 145 where the appearance A' under consideration is assigned a tentative long-term identifier (TLID) w(k) by comparing the feature vector V of the appearance A' under consideration with the permanent feature vectors V of expired appearances Aj. The default TLID w(k) for a current appearance is k. The tentative labelling performed in step 145 is described in more detail below. The main constraint on the tentative labelling is that appearances A' with the same TLID w(k) must be disjoint, i.e. non-overlapping, in time.

In step 150 the processor 705 determines whether all current appearances A' has been processed by steps 130 to 145. If a current appearance A* remains, then steps 130 to 145 are repeated for each of the remaining current appearances A,.

Once all current appearances A' have been assigned a TLID the processor 705 determines in step 155 whether more frame data I(x) is available for processing. If more frame data is available for processing, the method 100 returns to step 110 where the frame data from the next frame in the video sequence is received.

When the processor 705 determines in step 155 that the end of the video sequence has been reached, that is there are no more frame data available for processing, all current appearances A k are finally classified as expired appearances Ak in step 160.

In step 165 a final LID W(k) is assigned to each expired appearance Ak, being the TLID w(k) of the cluster of expired appearances Ak to which the particular expired appearance Ak belongs.

The metadata extracted during the steps of method 100 is saved in step 170.

Preferably the metadata is saved in a form that is convenient for later search, retrieval, and (possibly) object-based editing. The preferred form used for saving the human activity metadata is set out below in detail.

The method 100 of extracting metadata of human activity from a sequence of video frames end in step 175. There now follow more details on same of the steps of method 100.

621256.DOC -11- Object detection and tracking Semantic video object detection and tracking is performed by the object tracking

J

module in step t 15 (Fig. 1) from the frame data of the current frame together with frame data from one or more previous frames, and returns an object label map o Fig. 10 shows a flow diagram of the main processing steps of step 115 for detecting and tracking objects across the sequence of video frames. In the description that follows, the camera 750 is assumed to be fixed. Generalising step 115 to a moving camera follows.

Step 115 starts in sub-step 350 where pixel data of a next frame in a video sequence is received as input. Sub-step 355 follows wherein the processor 705 performs spatiotemporal (3D) segmentation of the video data based on colour, with the video data including the frame data of the L most recently received frames. Fig. 12 illustrates a sequence of video frames, with those frames illustrated in phantom being future frames not yet received by the programmable device 700. A window 490 includes the L most recently received frames, forming a "block" of pixel data. The block of data also includes a view plane 491, which, in the preferred implementation, is coplanar with the oldest frame in the current window 490. The position of the view plane 491 determines the output frame interval n relative to the input frame interval t. As a new frame l(x) is received in sub-step 350, the new frame is added to the window 490, while the oldest frame in the window 490 is removed from the window 490, thereby maintaining L frames in the window 490. The segmentation sub-step 355 is performed as each new frame I(x) is received in sub-step 350 to form a set of three-dimensional connected segments so that every pixel in the block belongs to one segment Si, and all pixels belonging to the same segment S, have homogeneous colour values.

The output of the 3D segmentation sub-step 355 at each frame interval t is an updated graph G, where vertices represent the 3D segments Si, and edges represent boundaries between the 3D segments Si. Fig. 11 shows a graphical representation of a data structure 900 that may be used to store the graph G. The data structure 900 is essentially an adjacency list representation of the segmentation. The data structure 900 includes a segment list 910 and a boundary list 920 for each segment Si in the segment list 910. The segment list 910 contains for each segment S a segment identifier i and the coordinates of all pixels in that segment Si.

Graph G is a persistent data structure across video frames. That is, graph G is dynamically updated during the segmentation sub-step 355 and incorporates the new frame data received in sub-step 350 into the existing graph G. Data associated with 621256.DOC -12each 3D segment Si persists between frames, and additional data stored with the vertices and edges of the 3D segment Si in the graph G will also be maintained between frames.

Sub-step 360 follows where the processor 705 computes a per-segment absolute motion estimate using only the intrinsic information of the segments Si from the 3D segmentation sub-step 355 themselves. By using motion estimation derived from the segments Si themselves, step 115 avoids the need to compute a motion field estimate using a separate motion estimation algorithm, and so avoids problems of poor demarcation of estimated motion fields.

Sub-step 360 computes an estimate of the 2D motion of each segment Si. The estimation is performed over a number of frames as Si intersects with the moving view plane 491. In the following description, the absolute 2D motion of each segment Si is estimated over the entire life-span of Si, that is, from when Si first intersects the view plane 491, to the current output frame interval n.

The 2D intersection of the 3D segment Si with the view plane 491 (Fig. 12) is written as When a 3D segment Si intersects the view plane 491 for the first time, a 2D centroid of 2D intersection is saved in the persistent data structure 900 containing all 3D segments Si, that is, graph G.

In later frame intervals, for each existing 3D segment Si, an absolute motion is estimated by calculating the vector difference between a current centroid (Yi of2D intersection and the initial centroid (4,Yi x=xi _o (4) y= Yi i Unfortunately, the absolute motion (Axi,A^y) of 3D segment Si may give a spurious estimate of the absolute 2D motion of a segment S, in two cases, namely: An occluded 2D segment will have a centroid shift induced by the movement of the occluding segment and Fine intensity changes cause boundary changes of the 2D segment which in turn cause a jitter in the position of the calculated centroid yi Equivalently, the shape of the segment can change due to a change in the shape of the original subject.

621256.DOC -13- It may be shown that a centroid shift due to occlusion is necessarily smaller than the occluded 2D segment's true width and height, whereas a true motion may be greater than the corresponding width of the segment s.

Therefore, if the actual bounds of the unoccluded 2D segment defined by Axbound and AYbolf are known, then the true motion (A3F,,AT,) could be robustly distinguished in sub-step 365 from false motion due to occlusion by setting the threshold for significant segment movement in the x- and y-directions to be greater than the actual bounds (M un...d and AyI.. d respectively).

In general however, the actual bounds AxOUf a and Ay. of the full, unoccluded, 2D segment sn z are not known, as a segment may first appear already occluded by another segment. However, estimates Ai" and of the actual b oun d ,,nd bounds AX 1 and Ay, may be calculated when the segment S first intersects the view plane 491, and these estimates AV d and A may be updated over time, as occluding segments reveal more of underlying occluded segment(s), thereby improving the bounds estimates.

The spurious 2D motion in the case of fine intensity changes which cause boundary changes of the segment (case above), typically leads to smaller centroid shifts than case and the false motions induced by it are therefore subsumed by the tests for case Referring again to Fig. 10, step 115 continues to sub-step 370 where the processor 705 classifies certain segments Si as foreground segments. In particular, from the above discussion, if a segment Si satisfies the following test: At Ak i ound OR A. Agound (6) then that segment Si is classified as being likely to represent a portion of an object with significant movement, i.e. aforeground segment.

Motion is an ephemeral cue. That is, the estimated motion (Ai 1 ,Ay) can go back below the threshold of Equation if the centroid (r of the 2D segment moves back to its original position. Therefore, the above classification as a foreground segment is treated as sticky, that is, once a segment Si has been classified as foreground, that segment Si stays classified as foreground for the rest of its lifespan. This is 621256.DOC -14implemented by storing the classification result with the persistent segment data in the graph G.

Having classified the foreground segments in sub-step 370, sub-step 375 follows, in which the 2D segments of foreground segments S, are grouped by the processor 705 into separate SVO's o k This is accomplished by labelling segments s, belonging to the same SVO with an object identifier k. This identifier is made persistent across time by storing it with the persistent segment data in the graph G.

A key point here is that the 3D segments S, typically have shorter life-spans than the SVOs o,,k that they are part of. This may, for example, be due to lighting changes and object movement over time. Hence, an SVO o,, k will typically consist of multiple segments Si, only some of which have been previously labelled with the object identifier k k of the particular SVO o, k The other segments Si are newly appearing or newly classified as foreground, but have yet to be labelled with an object identifier k. Labelled segments Si adjacent to those unlabelled foreground segments assist in ensuring persistent labelling of SVOs k across time.

To implement sub-step 375, an algorithm based on a graph connectedcomponent algorithm, known per se in the art, is used. A connected component is defined to be all adjacent segments Si with the same classification as assigned in sub-step 370.

In the following, it is assumed that the graph G consists only of segments and boundaries that intersect the view plane 491, so the graph G contains only adjacency information of relevance to the current frame interval n. Also, it is assumed that segments S, produced by the 3D segmentor in sub-step 355 are non-branching, that is, their intersection with the view plane 491 consists of a single contiguous 2D region sn'.

As the connected-component algorithm of sub-step 375 traverses graph G and identifies each unlabelled foreground segment the unlabelled segment Si is assigned an SVO identifier k. There are two cases to consider: 1. The unlabelled segment Si has an adjacent segment Sj that has been previously labelled with an SVO identifier k. If there is only one distinct SVO identifier k involved, then the unlabelled segment Si is labelled with the SVO identifier k of the adjacent segment S. This ensures that a consistent object identifier k is maintained and tracked across video frames. In the case where the unlabelled segment Si has adjacent segments Sj and Sk with different SVO identifiers I and m (which can occur if, for example, two previously separated objects move together), then some majority voting or other scheme is used to select the most likely SVO identifier for the unlabelled segment Si, and that 621256.DOC segment Si is labelled accordingly. Alternatively, depending on the application, a special object identifier can be used that represents an object that is the merger of two or more objects.

2. The unlabelled segment Si has no adjacent segment S that has previously been labelled with an SVO identifier. This case represents a new SVO entering the view plane 491. In this case such an unlabelled segment S, is labelled with a new unique SVO identifier.

Referring again to Fig. 10, step 115 proceeds to sub-step 380 where SVOs o,f are returned as output.

The description hereto assumed a fixed background against which the moving objects are detected. For the general case of a moving camera 750, a separate egomotion estimation algorithm, known to those skilled in the art, can be used to estimate the camera motion and hence its current pan/tilt position. The translational components of the induced relative image motion can then be subtracted from the absolute segment motions as estimated by sub-step 360 to give a true absolute motion for processing by later sub-steps of step 115.

The description hereto also assumed that future frame data is unavailable, as is the case in real-time streaming video object tracking. In the case of post-processing stored video sequences, it is possible during a second pass through the sequence to look ahead in time using the results of a previous pass of the object tracking algorithm through the sequence. There are at least two sub-steps of step 115 where this additional flexibility can be utilised.

Firstly, in sub-step 365, the updated bounds estimates A ound and AP^ound prevailing at the end of a segment's lifespan are more accurate estimates of the actual bounds Axfound and b...dAx of the unoccluded 2D intersection of each 3D segment Si.

Thus, the final bounds estimates A ound and A- bound may be used in a second-pass classification sub-step 370 to provide an improved foreground classification.

Secondly, the grouping sub-step 375 of the second pass may use the results of the segment foreground background classification sub-step 370 of the first pass from the beginning of that segment's lifespan. The absolute motion test (Equation used in substep 370 to classify segments Si as foreground has an intrinsic delay, as a moving segment Si may require several frame intervals before its estimated motion (A i Ay.) satisfies the threshold of Equation Accordingly, differently sized segments s, of a newly 621256.DOC -16appearing SVO o k can take different numbers of frame intervals to be classified as foreground by sub-step 370. This may cause the SVO o' to appear piecemeal, and in some cases a single semantic object o may be detected as several distinct SVOs Using the classification results from the first pass during the grouping sub-step 375 of the second pass will minimise this effect and allow SVOs o' to be detected fully-formed.

Note that the above described SVO tracking algorithm has an inherent latency equal to the length L in frames of the block 490 in Fig. 12. That is, there is a finite nonzero delay L between the input frame interval t and the output frame interval n, to which the SVOs o' and object label map correspond, as defined in Equation In the preferred implementation the non-zero latency L is set to 4.

Human Classification Fig. 2 is a flow diagram of the sub-steps of the classification step 120 of method 100 (Fig. 1) wherein the two classification aspects, that is the presence of skin and rigidity of SVO motion, are applied to each cumulative SVO O k in sequence. Both aspects involve computing a quantity and comparing that quantity with a predetermined threshold in order to make the classification decision.

The human classification step 120 starts in sub-step 201 where the next cumulative SVO O is considered by incrementing the SVO identifier k. Sub-step 203 follows where the processor 705 finds the number q of skin pixels in the SVO o. Many methods have been published for determining whether a pixel x represents human skin. Most are based on obtaining a large amount of training data, extracting a skin histogram in some colour space from the training data, and then comparing the pixel x under consideration with the skin histogram, either directly or using some parameterised model of the skin histogram. The result, after thresholding, is a binary map for each frame indicating the presence of skin pixels. Any of these methods may be used.

In the preferred implementation, a Gaussian model of the skin colour histogram in the Luv colour space is used. The Gaussian model of the skin colour histogram is encapsulated in two parameters: the mean colour (a 3-vector) p, and the (3 by 3) covariance matrix C.

A pixel x of the frame data in Luv space is labelled as skin if it satisfies the following condition: 621256.DOC -17- (I T C- (I (7) where ts is a skin detection threshold. In the preferred implementation, the values used for the skin detection threshold ts, the mean colour p, and the inverse of the covariance matrix C' are as follows: 0.01032 0.00207 -0.00132 C' 0.00207 0.04688 -0.02929 (8) -0.00132 -0.02929 0.05124 S70.511 u 16.27, and (9) [8.251 t 5 =9 In sub-step 205 the processor 705 accumulates the number q of pixels satisfying the condition of Equation over the history of the cumulative SVO O k to form a variable Q, denoting the volume of detected skin voxels in cumulative SVO O The total volume of the cumulative SVO denoted by 1 0 1, is also accumulated in sub-step 207.

Sub-step 209 computes a cumulative rigidity measure R' for the cumulative SVO O k which quantifies how well an estimated motion of the cumulative SVO O, fits a uniform translation rotation model over its whole extent in both time and space. If the uniform translation rotation model fits well, the cumulative SVO O k is moving rigidly and the cumulative rigidity measure R, is high. If the cumulative rigidity measure R' falls below a predetermined threshold Tr, then the cumulative SVO 0* is deemed to be moving non-rigidly, and the cumulative SVO O, may represent a person.

The basis for the cumulative rigidity measure R k is a perspective camera model shown in Fig. 3. A scene point P [X Y Z]T is projected to a point x [xy]I on the image plane 300 as follows: x f r(1X1) 621256.DOC -18wherefis the focal length of a pinhole camera model.

Note that in this section the image plane coordinates x and y are measured in absolute length units originating at the centre of the image plane: x= -j (12) y= -Yi (13) where i and j are conventional matrix coordinates (row and column), Lx and Ly are the dimensions of the frame (columns and rows respectively), and l ry3J is the pixel size (horizontal and vertical components).

Consider the case where the scene point P belongs to an object which translates and rotates undergoes rigid motion) between frames. The new position P' of the scene point is given by X' 1 z -Qy X Tx Y' z 1 Ox Y Ty (14) Z' fQy -Q 1 Z T z where T= [Tx Ty Tz]T is the translation vector and Q [Ox Or Qz]T is the vector of small Euler angles characterising the rotation.

It is also assumed that the camera 750 zooms by a fractional amount s between frames, so that the new focal lengthf is given by f The new pixel coordinates x' of the scene point P on the image plane 300 are given by (16) y' Z'

Y'

Combining Equations (15) and an expression may be obtained for the new image coordinates x' of the scene point P as follows: 621256.DOC 19- +fTX-zy z (17) x f(s+1) +(fS (17) -Ox+Qxy+ f 1+-n Y fzx The expression in Equation (17) is a non-linear relation between the image coordinates x and its correspondent x' under rigid motion and a zooming camera. It comprises two separate and independent equations, whose parameters are: The focal lengthf(global, known); The zoom s (global, unknown); The translation and rotation vectors T and Q (constant over the object, unknown); and The depth Z (varies over the object; unknown).

It is first assumed that the translation vector T is non-zero. The depth Z may be eliminated from the Equation leaving a single constraint which we may be written in linear form as: u(x) =0 (18) where: yx' x' xy' yy' y' 1] (19) is a data vector (of length 9) and T Tz TYRQX TZ Tz x T, TAQX Tz 0= -Txx Tznz TZfl TX (Tx Q z Ty Xs 1) (TzQ TXs 1) TyQXs+) is a 9-vector of unknown parameters. The data vector is computed using angular image plane coordinates [x y] given by [x [x y] (21) f 621256.DOC Next, the SVO on is considered. If the real-world object represented by SVO o k is moving rigidly, then the translation vector T and rotation vector 0 are constant for all pixels x in SVO The constraint of Equation (19) will therefore hold for all pixels x in SVO Equation (18) may equivalently be written as uf (22) where u k is a matrix with 9 columns formed by stacking the data vectors u(x) for all pixels x in SVO o, This is the rigidity constraint for the SVO o,.

It may be shown that the same constraint applies to objects whose translation T is identically zero. To determine whether an SVO o is moving rigidly, the matrix uR is formed and it is determined whether the constraint of Equation (22) is satisfied for any non-zero vector 0. This will occur if matrix u is not of full rank, i.e. has at least one singular value of zero.

The matrix u has as many rows as SVO o has pixels. Performing a singular value decomposition (SVD) on matrices u which each potentially contain thousands of rows is extremely computationally intensive. An equivalent, more efficient, method, is to form the 9 by 9 matrix m' as follows: mn yr ur(x (23) xeo The eigenvalues of matrix m" are the squares of the singular values of matrix uk. Therefore, the reciprocal of the smallest eigenvalue of matrix mk may be used as an indicator of the rigidity of the SVO o' at frame interval n. However, two further manipulations are performed to obtain a robust rigidity measure R First, the eigenvalue is normalised by the number of pixels in SVO o, which is denoted by n. The reciprocal of the normalised eigenvalue is then mapped to a number p' in the range [0, which provides the rigidity measure p' for the two-dimensional SVO o,: 1 p 1n (24) pn 1 K Arr m k n.

621256.DOC -21where Ain() denotes finding of the smallest eigenvalue of a matrix. Note that the computational complexity of the eigenvalue computation is independent of the size of the SVO o k K is a constant whose value is set so that a "rigidity threshold" Tr is somewhere around Forming the matrix m requires finding corresponding pixel pairs over the SVO o. This task may be performed by a standard motion estimation module, known to those skilled in the art of video processing. Each (backward) estimated motion vector represents the perspective projection of the change in the position of the object point P relative to the camera coordinate system [x y] between frame intervals n-1 and n. Note that the motion vectors received from the motion estimation module (in units of pixels) must be converted to absolute image plane coordinates as follows (see Equation x =x where fx(x) and fy(x) are the horizontal and vertical components respectively of the estimated motion vectorfn(x).

The rigidity measure pj is a measure of how well a rigid motion model fits the instantaneous SVO This is subject to inevitable random fluctuations from frame to frame, and hence might cause erroneous human classification of an SVO o In the preferred implementation, classification is based on a cumulative rigidity measure R k for the cumulative SVO It is clear from Equation (23) that the rigidity could be accumulated over the whole frame history of cumulative SVO O, by accumulating the individual matrices m, to form a cumulative matrix: Mk (26) xEO By analogy with Equation the cumulative rigidity measure R k is given by Rk 2 7 nn (Mn M1 K (27) Nk 621256.DOC -22where N k is the volume of the cumulative SVO O.

However, in the preferred implementation the accumulation has a finite memory, so that the more distant history of cumulative SVO 0'i is gradually discarded in favour of the more recent history. This time-weighting may be accomplished by the recursive formulae:

M

k +k aMM k (28) and n n +a N, (29) This is equivalent to an exponential time-weighting of the instantaneous rigidity matrices: M: =Za' .m The parameter a (in the range 11) is a memory length parameter. If the memory length parameter a is set to the value 1, infinite memory holds, such as in Equation (27) whereas, if the memory length parameter a has the value 0, the accumulation has zero memory and R k p. Notionally, the accumulation memory length is roughly a(t-oa).

Fig. 4 is a flow diagram of the processes used to compute the cumulative rigidity measure R k in sub-step 209 (Fig. The current frame data is delayed in a delay buffer 402 The frame data and the delayed frame data are then used by a backward motion estimator 405 to compute the estimated motion vectorf,(x).

Process 408 uses the object label map o, the SID k, and the estimated motion vectors to calculate the matrix m (using Equation and the number n, of pixels in SVO on.

Process groups 410 and 415 implement Equations (28) and (29) respectively to calculate the cumulative matrix M and the volume Nk of the cumulative SVO 0O.

Finally the cumulative rigidity measure R k is calculated by process 420 by implementing Equation (27).

621256.DOC -23 Referring again to Fig. 2, in sub-step 211 the processor 705 compares the fraction of the volume of detected skin voxels Q k of the total volume of the cumulative SVO 0 that is Q: /I O: I, with a predetermined threshold T, If the fraction Q k/1' Ok is smaller than the threshold then the cumulative SVO O k contains a small percentage of skin coloured voxels and hence is unlikely to correspond to a human. Step 120 then continues to sub-step 219 where it is determined whether all the cumulative SVOs O k have been classified.

If the fraction Qk i O1 is larger than the threshold then the cumulative SVO O' is likely to correspond to a human and step 120 continues to sub-step 213 where the processor 705 determines whether the cumulative rigidity measure calculated in substep 209, is smaller than the predetermined threshold If the cumulative rigidity measure R' is not smaller than the predetermined threshold T, then the cumulative SVO 0, is determined not to be representing a human and step 120 continues to sub-step 219.

Alternatively, if the cumulative rigidity measure R' is smaller than the predetermined threshold then the cumulative SVO O k is determined to represent a human and step 120 continues to sub-step 215 where the processor 705 determines whether the cumulative SVO O has previously been classified as being human. If the cumulative SVO O* has previously been classified as being human, then step 120 continues to sub-step 219. Alternatively, if the cumulative SVO O0 has not previously been classified as being human, the cumulative SVO O' is classified as being human in sub-step 217 before step 120 continues to sub-step 219.

Sub-step 219 determines whether all the cumulative SVO O k have been classified. If at least one cumulative SVO O' remains to be classified, then step 120 returns to sub-step 201. Alternatively, if all cumulative SVOs O k have been classified, then step 120 ends.

Feature vector extraction As set out in relation to step 140 (Fig. each cumulative SVO O k that has been classified as a human appearance Ak in step 120 has an associated feature vector n3 621256.DOC -24-

V,

k The preferred feature vector V, k is the colour histogram. Computation of the colour histogram is well known to those skilled in the art of image processing, but the method used in the preferred implementation is described briefly below.

Each colour component of the frame data Il(x) forming part of the twodimensional SVO o of appearance A, denoted as instantaneous appearance is first mapped to a range by a linear transformation based on the maximum and minimum possible values of the colour components in the colour space. A number B of bins for each colour component is chosen. With vector i [ij k] representing a bin in the colour space, with i, j, and k each being a colour component bin in the range a histogram value h, for the bin represented by vector i is given by h eKL J IJ (31) where: d is a divisor Q/B; La] indicates a round-towards-zero operation of a variable a; and eq(a, b) is an equality function that is 1 when its two vector arguments are equal, and 0 otherwise.

The histogram value h k is a scalar at each bin represented by vector i, with the scalar indicating the number of colour pixels within instantaneous appearance a, corresponding to the colour represented by that bin. A cumulative histogram Hk for the appearance A' may be computed by simple accumulation of the histograms hk obtained from each instantaneous appearance a k Cumulative histogram H' may be rearranged into the vector V of length B 3 as follows: V(iB 2 jB (32) The distance D(V V k between the feature vectors Vj and V may be any standard normalised histogram difference measure known to those skilled in the art of colour image processing. In the preferred implementation, the X2 test is used, which is defined as follows: 621256.DOC A+ VRiH R 2 H 2 where: H is the cumulative histogram reconstituted from the feature vector j; [Aj and A k are the volumes of the expired appearance Aj and appearance A: respectively i.e. the number of samples in the cumulative histograms H.

and H k respectively; and

R

1 and R 2 are defined as follows: 1 A: k (34) The z test is used because: It provides a normalised output in the range 1] that may be compared with an absolute threshold; It takes into account the number of samples in each histogram; and It is soundly based in the theory of statistical distributions.

Tentative labelling Tentative labelling is carried out in step 145 (Fig. 1) on current appearances A'n with the aim of assigning a TLD w(k) to each appearance A k Tentative labelling is managed using tentative clusters CJ which are sets of SID's k corresponding to expired appearances A, and at most one current appearance The cluster index j is the TLID of the constituent appearances A, and i.e.

k C 1 j ec* j A tentative cluster C. is labelled as active if that tentative cluster Cj contains a current appearance A, or inactive otherwise. Each tentative cluster Cj has a feature vector V associated with it that is obtained from the accumulation of the histograms of its constituent appearances A, and A k Each tentative cluster Cj also has an end frame end( C 1 that is the latest end frame of its constituent appearances A, and A k 621256.DOC 26 The main constraint on the tentative labelling is that appearances A, with the same TLID w(k) must be disjoint, i.e. non-overlapping in time. This is achieved by ensuring that current appearances A" are only compared with inactive tentative clusters Ci, and that the end frame end(Cj of the tentative cluster Cj is less than the start frame st(k) of the current appearance A, Fig. 5A is a flow diagram of the sub-steps of the tentative labelling step 145 that is carried out at each frame interval n for each current appearance A k Step 145 starts in sub-step 501 where the processor 705 determines whether the appearance A, is of sufficient duration. This is done by determining the duration of the appearance A which is the current frame interval n minus the start frame property st(k) of the appearance A, and determining whether the duration is above a predetermined threshold time This ensures that a sufficiently stable cumulative histogram Hk has been constructed for reliable comparison. If the duration is determined not to be sufficient, that is not above the predetermined threshold time then no tentative labelling is performed on the appearance A k and step 145 ends.

If the duration is determined to be sufficient, then sub-step 145 continues to substep 507 where the cluster index j is altered to point to a next tentative cluster C in reverse order of the end frame properties end(j) of the cluster Cj. The processor 705 then tests in sub-step 508 whether the cluster Cj is already labelled as active. If the cluster Cj is labelled as active, then sub-step 145 returns to sub-step 507 where the next tentative cluster Cj is considered. If the cluster Cj is labelled as inactive, then sub-step 509 determines whether current appearance A: and the cluster C. under consideration are disjoint in time by determining whether the start frame property st(k) of the current appearance A k is larger than, that is later than, the end frame property end(Cj of the cluster If the current appearance A, and the cluster C. under consideration are not disjoint in time, then step 145 returns to sub-step 507 from where the next tentative cluster Cj is considered.

621256.DOC -27- Alternatively, if the current appearance A k and the tentative cluster Cj are disjoint, then step 145 continues to step 511 where the feature difference D( Vj) between the current appearance A and the cluster C 1 is evaluated using Equation (33).

Sub-step 513 follows where it is determined whether the feature difference D( Vj) is below a predetermined threshold T1. If the feature difference Vj) is not below the predetermined threshold Th, then the current appearance Ak is considered not similar enough to the cluster C 1 i, and step 145 returns to sub-step 507 from where the next cluster C. is considered.

J

However, if the feature difference D(Vi, Vj) is below the predetermined threshold T 1 then the current appearance A' is considered to be of the same person as that represented by the cluster Ci, and the TLID w(k) of the current appearance A k is set to that of the cluster Cj, which is j, in sub-step 515. The identifier k is also added to cluster Cj and removed from any other cluster where it may have appeared (indicated by the previous value of TLID In step 516 the tentative cluster Cj is labelled as active. With the current appearance A k being assigned a TLID, step 145 ends.

Fig. 5B is a flow diagram of the sub-steps of an alternative tentative labelling step 145' which provides tentative labelling that is made sticky, that is, once a TLID w(k) has been assigned to an inactive tentative cluster Cj, it may not be changed. This may be implemented by inserting a further sub-step 505 for determining whether the current appearance A k still has the default TLID of k. The remainder of the steps are the same as that of step 145 described in relation to Fig. If it is dctcrmincd that the current appearance Ak already has been assigned to a tentative cluster Cj, then step 145' ends, whereas the current appearance A. is tentatively labelled otherwise by performing steps 507 to 516.

Sticky labelling, while more error-prone, is more useful if the result is to be acted on in real time, for example by an automated filming system, in which case a constantly changing TLID w(k) would be confusing to the system. In the sticky labelling 621256.DOC -28case shown in Fig. 5B, it is sensible to set a lower threshold Th in sub-step 513, thereby minimising the number of false tentative labellings.

Note also that when closing off an appearance A, its TLID w(k) is examined.

If its TLID w(k) is still equal to its SID k, a new inactive tentative cluster C, is created with TLID k containing one clement k, and its feature vector set to If not, its histogram Hk is merged with that of its tentative cluster and the feature vector Vj updated accordingly. The tentative cluster C. is then labelled as inactive, ready to receive new appearances. In both cases the end frame of the cluster is set to the end frame of the expired appearance A Saving metadata of human activity A three-level hierarchical structure is preferably used to save human activity metadata, which is convenient for later search, retrieval, and (possibly) object-based editing, for reasons set out below. From highest to lowest, the levels of the three-level hierarchical structure are: Clip-level metadata; Object-level metadata; and Frame-level metadata.

Each level contains summary and indexing information for the level below. The idea is that search and retrieval start at the highest level, which contains the clip-level metadata and is the most compact and quickest to digest, and then proceed to "drill down" to the desired portions of the lower level metadata. The indexing information enables this to be carried out efficiently.

The frame-level metadata is saved at every frame interval n for each current appearance A, in step 135. Referring to Fig. 6A where the contents of the frame-level metadata are illustrated, the frame-level metadata for each frame interval n consists of the instantaneous properties of the current instantaneous appearances which are the SID k, and simple location information such as centroid and bounding box, obtained from the object label map The object label map may also be stored.

The object-level metadata is a summary of the list of all expired appearances Ak saved in step 170. Referring to Fig. 6C where the contents of the object-level metadata are illustrated, each entry in the object-level metadata consists of the SID k for the 621256.DOC -29appearance Ak, and its LID start frame property st(k), and end frame property end(k).

The clip-level metadata is a summary of the object-level metadata, in the form of a lookup table as illustrated in Fig. 6B. Each entry, indexed by the LIDs 1, is a list of the SIDs k of the appearances Ak belonging to cluster i.e. those k such that W(k) 1.

The clip-level metadata also contains details about the input frame data such as the number of frames, the width and height of each frame, and a pointer to the raw frame data sequence. Details of how this metadata structure is used in a typical search and retrieval are set out below.

Search, retrieval, and editing using human metadata The hierarchical structure of the human activity metadata, illustrated in Figs. 6A to 6C, enables efficient responses to content-based queries. A typical content-based query is: "Find all appearances of a person with SID k, and edit them together in chronological order, drawing a box around the person". The processing steps performed by the processor 705 in response to this query and the manner in which the hierarchical structure of the human activity metadata is used are illustrated in Fig. 7. From the query the SID k of the appearance A k is extracted in step 601. Using the object-level metadata, and in particular the k-th entry of the object-level metadata, the LID W(k) for appearance Ak is identified.

With the LID W(k) of the person identified, in step 602 the SIDs li, lp, Ip of all the separate appearances A, are identified by looking up the LID W(k) in the clip-level metadata, which has the list of SIDs labelled with LID W(k).

In step 603 the start and end frame intervals, identified by the start frame property st(lp) and end frame property end(lp), of each appearance A, are retrieved from the object-level metadata. The frame-level metadata is accessed in step 604 to find the bounding box parameters for appearance A, by looking up the particular frame interval st(lp), and then within that frame interval listings, the SID 1p of appearance A, which has the coordinates of the rectangular box in frame interval st(l,) that bounds the appearance A In step 605 the frame data of all the separate appearances A, are extracted from the un-edited video sequence 650, that is the frame data IL(x) for frame intervals 621256.DOC from st(lp) to end(l4) for each of the appearances A, having LID and placed in an editing queue 660. Finally, in step 606, the rectangular box 670 that bounds the appearance A, is also added to each frame 665 placed in the editing queue 660.

The editing queue 660 is then passed to a rendering engine, such as that described below, to render the frames 665 as an annotated movie.

Pixel list metadata, defining the actual outline of the person in each frame 665, saved as the object label map at each frame interval, may be employed for objectbased editing, such as compositing the person onto an alternative background, in a manner well known to those skilled in the art of video production.

Rendering engine Fig. 9 shows a Graphical User Interface (GUI) 800 of a rendering engine within which the content-based editing queue 660 (Fig. 7) may be rendered. The GUI 800 is formed on display device 714 of the programmable device 700 illustrated in Fig. 8, and is controlled by the processor 705.

The GUI 800 includes a browser window 810 which allows the user to search and/or browse a database or directory structure for video sequences and into which files containing video sequences may be loaded. The video sequences are typically loaded from a CD-ROM/DVD inserted into the CD-ROM/DVD drive 712 (Fig. 8).

Each file containing a video sequence is represented by an icon 804 once loaded into the browser window 810. The icon 804 may be a keyframe of the video sequence.

When an icon 804 is selected by the user, its associated video sequence is transferred to the review/edit window 812. More than one icon 804 may be selected, in which case the selected media content will be placed in the review/edit window 812 one after the other.

After selecting the aforementioned icons 804, a play button 814 on the review/edit window 812 may be selected, for example by pointing a cursor controlled by the mouse 703 and clicking a button of the mouse 703. The video sequence(s) associated with the aforementioned selected icon(s) 804 are played from a selected position and in the desired sequence, in a contiguous fashion as a single presentation, and continues until the end of the presentation at which point playback stops. The video frame data is displayed within the display area 840 of the review/edit window 812.

A playlist summary bar 820 is also provided on the review/edit window 812, presenting to the user an overall timeline representation of the entire production being considered. The playlist summary bar 820 has a playlist scrubber 825, which moves 621256.DOC -31 along the playlist summary bar 820 and indicates the relative position within the presentation presently being played. The user may browse the production by moving the playlist scrubber 825 along the playlist summary bar 820 to a desired position to commence play at that desired position, typically by using the mouse 703. The review/edit window 812 typically also includes other viewing controls including a pause button, a fast forward button, a rewind button, a frame step forward button, a frame step reverse button, a clip-index forward button, and a clip-index reverse button. The viewer play controls, referred to collectively as 850, may be activated by the user to initiate various kinds of playback within the presentation.

The user may also initiate a human activity extraction function, in which case the processor 705 performs the functions of method 100 (Fig. 1) on the selected video sequence. The human activity metadata described in relation to Figs. 6A to 6C is then formed and stored on the hard disk drive 710 of the programmable device 700.

The user may then browse the video sequence by using the viewer play controls 850 until a person of interest is shown in the display area 840. By using the mouse 703, the user may point to and select that person within the display area 840.

Once the person of interest is selected, the user may initiate a content based query, such as "Find all appearances of the selected person, and edit them together in chronological order, drawing a box around the person". This query initiates the processor 705 to perform the steps described in relation to Fig. 7 to create the editing queue 660 (Fig. The editing queue 660 may then be played on the GUI 800, which renders the frames 665 containing the selected person bounded by a rectangular box 670 in the display area 840.

On the playlist summary bar 820, transition lines 822 illustrate borders of segments in the editing queue, such as segment 830. The borders correspond with frame intervals st(lp) and end(lp) of appearance AI, of the selected person. The length of the playlist summary bar between the respective transition lines 822 represents the proportionate duration of an individual segment compared to the overall presentation duration.

The segments 830 are selectable and manipulable by common editing commands such as "drag and drop", "copy", "paste", "delete" and so on. Automatic "snapping" is also provided whereby, in a drag and drop operation, a dragged segment is automatically inserted at a point between two other segments, thereby retaining the unity of the segments 830.

621256.DOC 32 The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

621256.DOC

Claims

1. A method of classifying an object in a video sequence as human, said method comprising the steps of: estimating motion vectors of pixels of said object; calculating a rigidity measure for said object, said rigidity measure quantifying an extent to which said motion vectors fit a model of uniform translation and rotation; and classifying said object as one of human or non-human dependent on said rigidity measure.

2. A method as claimed in claim I wherein, in step motion vectors of pixels appearing earlier in said video sequence contribute less to said rigidity measure than motion vectors of pixels appearing later in said video sequence.

3. A method as claimed in claim 1 or 2, comprising the further step of: calculating a ratio of pixels of said object having a colour of human skin, wherein said object is further classified dependent on said ratio.

4. A method of detecting human activity in a video sequence, said method comprising the steps of: identifying three-dimensional objects within said video sequence; estimating motion vectors of pixels of said three-dimensional objects; calculating a rigidity measure for said three-dimensional objects, said rigidity measure quantifying an extent to which said motion vectors associated with each said three-dimensional objects fit a model of uniform translation and rotation; classifying each of said three-dimensional objects as either human or non- human using at least said rigidity measure; calculating a feature vector for each of said three-dimensional objects classified as being human; and labelling said three-dimensional objects classified as being human, wherein non- time-overlapping three-dimensional objects having similar feature vectors are labelled with a same label.

621256.DOC 34- A method as claimed in claim 4 wherein said three-dimensional objects are further classified dependent on a ratio of pixels of said three-dimensional object having a colour of human skin. 6. A method as claimed in claim 4 or 5 wherein said feature vector is a colour histogram of said three-dimensional object. 7. A method as claimed in any one of claims 4 to 6 wherein the similarity of said feature vectors is measured using histogram comparison. 8. A data structure for storing metadata of objects in a video sequence, said data structure comprising: a first-level sub-structure for storing, at each frame interval of said video sequence, position data of each object at that frame interval; and a second-level sub-structure for storing for each object, data defining an object set that said object is grouped in and data defining frame intervals where said object was present, wherein said object set is a group of objects. 9. A data structure as claimed in claim 8, wherein said data structure further comprises a third level sub-structure for storing data defining said objects belonging to each of said object sets. A data structure as claimed in claim 8 or 9, wherein said position data is coordinates of a rectangular box bounding said object. 11. A data structure as claimed in any one of claims 8 to 10, wherein said data defining frame intervals where said object was present comprises a start frame interval and an end frame interval. 12. An apparatus for classifying an object in a video sequence as human, said apparatus comprising: means for estimating motion vectors of pixels of said object; 621256.DOC means for calculating a rigidity measure for said object, said rigidity measure quantifying an extent to which said motion vectors fit a model of uniform translation and rotation; and means for classifying said object as one of human or non-human dependent on said rigidity measure. 13. An apparatus as claimed in claim 12 wherein motion vectors of pixels appearing earlier in said video sequence contribute less to said rigidity measure than motion vectors of pixels appearing later in said video sequence. 14. An apparatus as claimed in claim 12 or 13, further comprising: means for calculating a ratio of pixels of said object having a colour of human skin, and wherein said object is further classified dependent on said ratio. 15. An apparatus for detecting human activity in a video sequence, said apparatus comprising: means for identifying three-dimensional objects within said video sequence; means for estimating motion vectors of pixels of said three-dimensional objects; means for calculating a rigidity measure for said three-dimensional objects, said rigidity measure quantifying an extent to which said motion vectors associated with each said three-dimensional objects fit a model of uniform translation and rotation; means for classifying each of said three-dimensional objects as either human or non-human using at least said rigidity measure; means for calculating a feature vector for each of said three-dimensional objects classified as being human; and means for labelling said three-dimensional objects classified as being human, wherein non-time-overlapping three-dimensional objects having similar feature vectors are labelled with a same label. 16. An apparatus as claimed in claim 15 wherein said three-dimensional objects are further classified as being human dependent on a ratio of pixels of said three- dimensional object having a colour of human skin. 621256.DOC -36 17. An apparatus as claimed in claim 15 or 16 wherein said feature vector is a colour histogram of said three-dimensional object. 18. An apparatus as claimed in any one of claims 15 to 17 wherein the similarity of said feature vectors is measured using histogram comparison. 19. A program stored in a memory medium for classifying an object in a video sequence as human, said program comprising: code for estimating motion vectors of pixels of said object; code for calculating a rigidity measure for said object, said rigidity measure quantifying an extent to which said motion vectors fit a model of uniform translation and rotation; and code for classifying said object as one of human or non-human dependent on said rigidity measure. A program as claimed in claim 19 wherein motion vectors of pixels appearing earlier in said video sequence contribute less to said rigidity measure than motion vectors of pixels appearing later in said video sequence. 21. A program as claimed in claim 19 or 20, further comprising: code for calculating a ratio of pixels of said object having a colour of human skin, and wherein said object is further classified dependent on said ratio. 22. A program stored in a memory medium for detecting human activity in a video sequence, said program comprising: code for identifying three-dimensional objects within said video sequence; code for estimating motion vectors of pixels of said three-dimensional objects; code for calculating a rigidity measure for said three-dimensional objects, said rigidity measure quantifying an extent to which said motion vectors associated with each said three-dimensional objects fit a model of uniform translation and rotation; code for classifying each of said three-dimensional objects as either human or non-human using at least said rigidity measure; code for calculating a feature vector for each of said three-dimensional objects classified as being human; and 621256.DOC 37- code for labelling said three-dimensional objects classified as being human, wherein non-time-overlapping three-dimensional objects having similar feature vectors are labelled with a same label. 23. A program as claimed in claim 22 wherein said three-dimensional objects are further classified as being human dependent on a ratio of pixels of said three- dimensional object having a colour of human skin. 24. A program as claimed in claim 22 or 23 wherein said feature vector is a colour histogram of said three-dimensional object. An program as claimed in any one of claims 22 to 24 wherein the similarity of said feature vectors is measured using histogram comparison. 26. A data structure substantially as herein described in relation to Figs. 6A to 6C of the accompanying drawings. 27. A method of classifying an object in a video sequence as human, said method being substantially as herein described in relation to Figs. 3 and 4 of the accompanying drawings. 28. A method of detecting human activity in a video sequence, said method being substantially as herein described in relation to Figs. 3 and 4 of the accompanying drawings. 29. An apparatus for classifying an object in a video sequence as human, said apparatus being substantially as herein described in relation to Figs. 3 and 4 of the accompanying drawings. 621256.DOC -38- An apparatus for detecting human activity in a video sequence, said apparatus being substantially as herein described in relation to Figs. 3 and 4 of the accompanying drawings. DATED this Twenty-first Day of March 2003 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant SPRUSON&FERGUSON 621256.DOC