WO2006092765A2

WO2006092765A2 - Method of video indexing

Info

Publication number: WO2006092765A2
Application number: PCT/IB2006/050634
Authority: WO
Inventors: Mauro Barbieri; Nevenka Dimitrova; Lalitha Agnihotri
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-03-04
Filing date: 2006-03-01
Publication date: 2006-09-08
Also published as: WO2006092765A3

Abstract

There is provided an apparatus (100) for indexing video data objects (110), said apparatus (100) comprising an input (110) for receiving the video data objects, and a processor (150, 160, 170, 180) for analyzing the video data objects (110) to identify therefrom occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images, said processor (150, 160, 170, 180) being operable to process the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) to provide video index classifications of the video data objects (110).

Description

Method of video indexing

The present invention relates to methods of video indexing, for example to methods of processing stored video data objects to extract therefrom information useable for indexing purposes. Moreover, the present invention also relates to apparatus operable to implement the methods.

Advances in computers, communication systems and networks, and data storage media capacity has resulted in recent years in the generation of large archives of video data objects, for example in servers of communication networks operable to deliver video data objects to customers, as well as in domestic video entertainment systems where users of such systems often accumulate large personal libraries of video data objects. These archives tend to grow with time and can potentially comprise a wide variety of subject matter in their video data objects. There arises therefrom a technical problem of adequately and automatically indexing video data objects to allow users to search through these archives to identify video data objects of interest. Thus, the large archives can potentially comprise several terabytes of data objects in substantially disorganized form. Searching through these archives manually is a tedious and time-consuming task.

Although video data objects are, in some situations, stored with corresponding meta-data, such meta-data has been found in practice to be too coarse and also lacking in information content to enable certain types of searching operations to be executed.

Methods of automatically extracting semantically significant events from video are known. For example, there is described in a published United States Patent No. US 6, 721, 454 a multi- level technique for detecting semantically meaningful events in video. In the multi-level technique, video sequences are visually analyzed at a first level to detect shot boundaries, to measure color and texture of content in the sequences and to detect objects in the content. In a second level of the technique, image objects present in the sequences are classified and the content in each shot is thereby summarized. Moreover, in a third level of the technique, there is applied an event inference module for inferring occurrence of events on the basis of temporal and spatial phenomena disclosed in the shot summaries.

Contemporary methods of automatically extracting semantically significant events from video tend to be highly computationally demanding such that use of these methods in domestic equipment of moderate cost is impractical.

It is known, for example in a conference paper "Content-based retrieval of video data by the grammar of film", Yoshitaka et al., Proc. IEEE Symposium on Visual Languages 1997 pp. 310-317, that it is feasible to devise a method of retrieving video data from films by way of specifying semantic contents of scenes. In connection therewith, there is a so-called "grammar of film" which is an accumulation of knowledge and rules for expressing certain semantics of a scene more effectively. Thus, features of video data objects can be observed as a consequence of the effects of the grammar of the film. Methods employing "grammar of film" to index the video data objects is found to be more reliable in comparison to other ways of evaluating sequences of images in detail; such reliability is enhanced because indexing by attention to "film grammar" provides a closer approach to a semantic level of understanding of humans. However, use of "grammar of film" is not appropriate in all circumstances, for example when video sequences are produced by video surveillance cameras where compliance to "film grammar" is not a requirement. It is preferred to provide an alternative method of video indexing.

According to a first aspect of the invention, there is provided a method of indexing video data objects, said method comprising steps of:

(a) receiving the video data objects and analyzing them to identify occurrences of different categories of shots or segments therein, each shot or segment comprising a sequence of one or more images, each shot or segment being identified in response to image feature content of its one or more images; and

(b) processing the video data objects based on at least the occurrences of the different categories of shots or segments for providing video index classifications for the video data objects.

The invention is of advantage in that use of shot or segment information, namely field of view information, is capable of providing an enhanced method of video indexing. For the purposes of describing the invention, shots comprising one or more images can also be regarded as video content segments, such segments comprising one or more mutually related images by virtue of image feature content. However, a segment can include several corresponding shots including fractional parts of shots. Optionally, in the method, said different categories of shots or segments include at least one of: a long-shot, a medium-shot and a close-up shot; said long-shot corresponding to one or more images depicting an entire area of action; said medium- shot corresponding to one or more images depicting one or more persons or objects of interest occupying substantially 50% of an area of said one or more images; and said close-up shot corresponding to a part of a person of object of interest. Substantially 50% of an area of said one or more images optionally corresponds to in a range of 25% to 75% of an area of said one or more images, more optionally to in a range of 40% to 60% of an area of said one or more images. Other definitions for "substantially 50%" are possible and are described qualitatively later. Moreover, other categories of shots or segments are possible, for example extreme long-shot, extreme close-up shot.

Optionally, in the method, identification of the occurrences of the shots or segments involves computing at least one of:

(a) a measure of durations of the shots or segments;

(b) a measure of video content bit rates associated with the shots or segments; (c) a number, size and position of person faces detectable in images comprising the shots or segments;

(d) a presence, size and position of overlaid text in the shots or segments;

(e) audio classification labels for the shots or segments;

(f) a measure of audio change of pace in the shots or segments; (g) a measure of an average number of feature edges in images of the shots or segments;

(h) a measure of edge histogram computed from the number of feature edges in images of the shots or segments;

(i) a measure of homogeneous texture from images of the shots or segments; (j) a measure of a number and area of regions resulting from image segmentation in the shots or segments;

(k) a measure of intensity and spatial distribution of motion activity within the shots or segments;

(1) an estimation of camera motion occurring within the shots or segments; (m) a measure of a field-of-view difference between neighboring shots or segments;

(n) a measure of a ratio between areas in the images in the shots or segments which are substantially in- focus to those areas which are out-of- focus; (o) a correlation in time of geometrically invariant properties of regions of images resulting from segmentation;

(p) a measure of average color drift in the shots or segments;

(q) a cumulative frame difference providing a measure of dynamic activity in a shot or segment; (r) a measure of temporal interval between substantially similar shots or segments.

Video content bit rates is to be construed to relate to the number of bits employed to encode video shots or segments taking into account a temporal duration of the video segments or shots. Moreover, audio classification labels are to be construed, for example, to relate to additional audio features including number of speaking persons and audio loudness. Audio classification labels can include one or more of: speech, music, silence, combinations of speech and music, noise, combinations of speech and noise, environmental sounds (for example traffic), laughter, applause and crowd cheering.

Such measures and estimates are found in practice to be useful in combination with identification of shots or segments to provide more reliable indexing of video data objects.

Optionally, in the method, the correlation in time is based on invariant properties dependent on color and shape. Color and shape information is generally relatively easy to extract from images, thereby surprisingly, in combination with identification of shots or segments, providing more reliable automatic indexing of video data objects.

Optionally, in the method, the step of analyzing to identify the occurrences of the shots or segments employs at least one of: linear classifiers, statistical classifiers, neural networks, support vector machines, hidden Markov models. Such approaches are capable of enhancing reliability of video indexing whilst requiring more modest computing resources when implemented, thereby rendering the method useable in more modest apparatus, for example home video equipment.

Optionally, in the method, processing the video data objects includes training using example images and associated shot or segment classifications. Such an option of training in the method renders the method easier to customize to users' specific types of video data objects for obtaining enhanced reliability of video indexing.

Optionally, the method is operable to provide video indexing in at least one of: video recorders, video-on-demand systems to implement content-based features including at least one of: automatic video summarization, key- frame extraction for providing data to table-of-contents, violent scene detection, video genre classification, intelligent chaptering, and automatic editing of home videos.

According to a second aspect of the invention, there is provided an apparatus for indexing video data objects, said apparatus comprising an input for receiving the video data objects, and a processor for analyzing the video data objects to identify therefrom occurrences of different categories of shots or segments therein, each shot or segment comprising a sequence of one or more images, each shot or segment being identified in response to image feature content of its one or more images, said processor being operable to process the video data objects based on at least the occurrences of the different categories of shots or segments to provide video index classifications of the video data objects. Optionally, the apparatus includes functional units including:

(a) a shot-cut detector or a video segmentation unit for segmenting the video data objects into corresponding constituent shots or segments;

(b) a feature extractor for extracting features of the constituent shots or segments; (c) a statistical classifier for performing a statistical analysis of the extracted features; and

(d) a video summarizer for analyzing statistical output data from the statistical classifier for generating said video index classifications of the video data objects.

Optionally, the apparatus is adapted for incorporation into video recorders and video-on-demand systems. Moreover, the apparatus is beneficially operable to provide at least one of: automatic video summarization, key frame extraction for providing data for table-of-contents for user-presentation, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos.

According to a third aspect of the invention, there is provided software recorded on a data carrier, said software being executable on computing hardware for implementing a method according to the first aspect of the invention, said method concerning video indexing of video data objects based on determination of shots or segment included in the video data objects. It will be appreciated that features of the invention are susceptible to being combined in any combination without departing from the scope of the invention.

Embodiments of the invention will now be described, by way of example only, with reference to the following diagrams wherein:

Figure 1 is a graphical illustration of a long-shot, a medium-shot and a close- up shot; and

Figure 2 is a schematic diagram of an apparatus for analyzing video content according to the invention.

In order to describe the present invention in contradistinction to known techniques for analyzing video data objects, aspects of these known techniques will firstly be elucidated.

Contemporary video data object content analysis techniques utilize algorithms derived from image processing in general. These content analysis techniques include pattern recognition and artificial intelligence that aim to automatically create annotations of video material. Such annotations relate to a wide spectrum of features from low- level signal-related properties such as color and texture to higher- level information such as the presence of faces. It is contemporary practice to search for and retrieve video data objects in large unstructured archives of video content after the content has been indexed using content analysis techniques. Results of such content analysis techniques are used, for example, for detecting occurrence of commercial/advertisements, for implementing intelligent chaptering, for providing video previews and video summaries.

Thus, contemporary content analysis techniques aim to automatically index video content by extracting various features from images of the video content, for example image color, image texture shape, image motion, as well as from audio associated with the video content, for example from audio energy and audio frequency spectra. When analyzing low-level features, contemporary content analysis techniques aim to imitate human understanding by detecting image features such as faces, image objects, audio silences, music and similar. In consequence, contemporary content analysis techniques are limited by inherent limitations of algorithms employed to implement the techniques and are therefore capable of identifying only a very limited set of features, for example faces, outdoor environments, indoor environments.

The inventors have appreciated that it is beneficial when implementing automatic content analysis to utilize additional information relating to fields of view of video segments. Such additional information concerns a predefined set of categories such as close- up shot, medium shot, long shot and similar. It is beneficial to use the additional information to significantly enhance content analysis such as in respect of key- frame extraction, scene boundary detection, summarization, automatic editing and such like.

The inventors have further appreciated that aesthetic aspects of video content, for example light, color, space, time, motion and sound contribute in a significant manner to a message conveyed by the video content. Such aesthetic aspects are found in practice to be determined by the aforementioned "film grammar". An example of "film grammar" in video content arises when two persons are mutually involved in a dialogue wherein it is convention filming practice to employ close-up shots of the two persons alternated with medium shots of the two persons together in a same location.

In order to further elucidate the present invention, "film grammar" will be further described. In profession video production, certain common conventions are used which are generally referred to "the grammar of film". Film grammar includes conventions for conveying meaning by way of particular camera filming techniques and editing techniques when generating video content. For example, video content producers convey meaning through video by choice of field of view. Such field of view pertains to a size of a subject in relation to a size of an overall corresponding image frame including the subject. Moreover, the field of view is technically dependent on a focal length of a camera lens employed to capture the image frame. Based on fields of view, video content images can be categorized pursuant to three basic groups: long shots, medium shots and close-up shots. These three basic groups will now be further elucidated.

Referring to Figure 1, there is illustrated a long-shot wherein an image 10 captures an entire area of action, for example a place, persons at the place and inanimate objects at the place to provide an overall appearance. In aforementioned film grammar, long- shots are employed to establish all elements in a scene so that viewers of corresponding video content including the long-shots are able to appreciate persons depicted in the scene and their spatial location and relationship to the scene. For example, entrances and exits of persons from the scene are conventionally, pursuant to aforesaid film grammar, conveyed to viewers by way of long-shots. However, it is also conventional film grammar practice to employ long-shots sparingly in view of limited resolution and screen size provided in contemporary video content presentation apparatus, for example televisions with 40 cm-diagonal measurement to their screens.

In Figure 1, there is illustrated a medium-shot wherein an image 20 captures a subject, for example a person or an inanimate object, occupying approximately as much of an area of an image as does corresponding surroundings. For example, persons are typically filmed from above their knees, or just below their waists so that their gestures, facial expressions and movements are visible to viewers. It is contemporary film grammar practice that medium-shots generally comprise a majority of images in a given film or video content. Referring again to Figure 1, there is illustrated a close-up shot such that a corresponding image 30 in video content conveys a relatively small proportion of the aforesaid scene. In film grammar, close-up shots are used to focus attention on a person's feelings or reactions, for example as represented in facial expressions. Thus, close-up shots are often employed in interviews to show a person in a state of emotional excitement, grief or joy. Close-up shots are especially useful for performing the following functions:

(a) playing up narrative highlights: such playing up relates to important dialogue, action performed by a person, reaction of a person. Playing up, pursuant to aforementioned film grammar, is employed whenever dramatic emphasis or increased audience attention is required; (b) isolating significant subject matter and eliminating all non-essential material from view: such isolation and elimination is employed to concentrate viewer attention on an important action, for example a particular object or a meaningful expression on a person's face;

(c) magnifying small-scale actions: such magnification is employed when a given action is too small to be adequately represented in a medium-shot or long-shot;

(d) providing a time lapse: insertion of a close-up shot is beneficially used to shorten a time interval needed to convey otherwise temporally lengthy actions;

(e) distracting a viewer's attention: a close-up shot is beneficially employed to disguise a "jump-cut" caused by mismatched or missing action in video content; (f) substituting for hidden action: for example, substituting for an action which cannot be filmed for physical reasons;

(g) presenting reactions of off-screen persons: a reaction of an off-screen person may, in certain circumstances, be more aesthetically significant than an action performed by a principal person on-screen; and (h) cueing viewers of video content on how they should react: such cueing is susceptible to stimulating a viewer to have a similar feeling to that presented to the viewer in the close-up shot, for example a portrayal of fear, tension, awe, pity and such like.

Thus, the present invention is concerned with apparatus capable of receiving video content comprising video data objects, for example stored in a database or extracted from a database, and analyzing the video content to identify, amongst other things, the occurrence of long-shots, medium-shots and close-up shots for use in appropriately indexing the video content. The long-shots, medium-shots and close-up shots are optionally determined in response to spatial information included in their corresponding sequences of images rather than on merely duration of the shots as is known in the art. Such analysis of video content has been found to result in more reliable indexing.

The present invention therefore provides an apparatus for analyzing video content and thereby appropriately indexing the video content. The apparatus can be implemented in practice as dedicated image processing electronic hardware or by way of software executable on computing hardware, or a mixture of such hardware and software. Referring to Figure 2, the apparatus is indicated generally by 100 and includes an input 110 for receiving video content V including video data objects, and an output 120 for providing analysis data which is subsequently useful for video content indexing purposes, for example when a viewer is desirous to search for a given type of video data object. The apparatus 100 has functional units, implemented either by way of hardware or software or a combination of both. The functional units include a shot cut detector 150 for generating indications of identified shots 155, a feature extractor 160 for generating feature vectors 165, a statistical classifier 170 for generating field of view class probabilities 175, and a video summarizer 180 for generating a video summary at the output 120 for video indexing purposes. In operation, the apparatus 100 receives video content at its input 110 which is received at the shot-cut detector 150. The shot-cut detector 150 is operable to segment the video content into corresponding constituent shots, each shot comprising a sequence of one or more images. Beneficially, the detector 150 utilizes contemporary methods of shot-cut detection, for example as described in a publication "Video Keyframe Extraction and Filtering: A Keyframe is not a Keyframe to Everybody", Proceedings of ACM Conference on

Information and Knowledge Management, November 1997 which is hereby incorporated by reference. For each shot identified in the video content, the detector 150 in co-operation with the feature extractor 160 is operable to process the shot to compute one or more of the following: (i) a measure of duration of the shot;

(ii) a measure of a bit rate of the shot; for example, long shots usually represent more cluttered scenes and therefore are more difficult to encode; alternatively, a bit rate of one representative keyframe extracted from the shot is utilized; (iii) the number and sizes of faces detectable in the shot;

(iv) a presence of overlaid text in the shot;

(v) an audio classification label for the shot: close-up shots are more likely to be classified as speech, whereas long shots are likely to be accompanied by a music background;

(vi) a measure of audio change of pace: for example, how frequently audio labels change during the shot);

(vii) a measure of an average number of edges in images per shot;

(viii) a measure of an average distance between edges, for example substantially vertical edges, in images per slot;

(ix) a measure of edge histogram, for example expressed as an average for the shot; such an edge histogram is employed in the MPEG-7 standard as described in ISO/IEC

15938:2001, "Multimedia content description interface - Part 3: Visual" which is hereby incorporated by reference;

(x) a measure of homogeneous texture: for example, such homogeneous texture is as defined in the aforementioned MPEG-7 standard; (xi) a measure, for example average, of a number and area of regions resulting from image segmentation in the shot;

(xii) a measure of intensity and spatial distribution of motion activity in the shot; the measure is beneficially determined according to the aforesaid MPEG-7 standard;

(xiii) an estimation of camera motion occurring within the shot; for example, the estimation can provide a measure of pan, tilt and zoom types of camera motion utilized in filming the shot;

(xiv) a measure of field of view category of neighboring shots present in the video content; for example, long shots are firstly used to establish a place of a scene followed by close-up shots to capture a dialog between main persons portrayed in the video content; (xv) a measure, for example an average, of a ratio between areas in one or more images in the shot which are in focus to those areas which are out-of- focus;

(xvi) a correlation in time, for example between successive shots, of geometrically invariant properties of regions resulting from segmentation, for example color and shape, relative spatial arrangement; and (xvii) a measure of an average color drift in the shot; such a measure concerns whether a color composition of the shot remains substantially constant during an entire duration of the shot, or numerous new colors are added as the shot progresses. The measure is beneficially computed from a color histogram difference between successive images or frames in the shot.

The statistical classifier 170 is trained on a substantially comprehensive set of manually annotated video shots, such annotation assigning to each shot a probability value representative of an estimated probability of the shot belonging to, for example, one or the aforementioned three groups, namely long-shot, medium-shot and close-up shot. It will be appreciated that there can be more or less than three groups, for example extreme long-shot could be a fourth group which the classifier 170 is operable to utilize.

Reiterating the foregoing, the groups can be defined, for example, as follows:

(A) extreme close-up shot: for example, detail of a person;

(B) close-up shot: for example, all of a person's head; (C) medium close-up shot: for example, head and shoulders of a person;

(D) medium shot: for example, equal importance to person and associated surroundings;

(E) long shot: a person is included in its entirety in an image together with much of associated surroundings; and (F) extreme long shot: a person is captured as a small part of an image with associated surroundings.

The statistical classifier 170 optionally employs processing functions known in contemporary pattern classification. Such functions optionally include at least one of: linear classifiers, neural networks, support vector machines, hidden Markov models and such like. Beneficially, the statistical classifier 170 is conducted through a "training" phase during which it is given examples of field-of-view segments with corresponding manually annotating labels, for example close-up shot, long-shot and so forth, which the classifier 170 is expected to classify. More optionally, training of the classifier 170 is undertaken using several feature vectors that are extracted from shots, each of which has been labeled according to a field of view which is to be detected to classifier 170.

In some situations, screenplay information accompanying video content provides information describing types of shot employed in filming the video content. In the invention, temporal boundaries in video content are identified by way of "dialog alignment". Thus, audio labels are optionally used to locate audio scene descriptions and thereby include additional time stamps. When implementing the present invention, an analysis is employed for parsing screenplay information and thereby finding scene descriptions and matching them with potential time-stamped shot segments. The screenplay is thereby able to assist in identifying types of shots, for example a close-up shot or a long-shot, for assisting with video content indexing.

The present invention is suitable for employing in any situations where indexing of video content is required. For example, the present invention is susceptible to being utilized in video recorders and video-on-demand systems to implement content-based features such as automatic video summarization, key- frame extraction for providing data for table-of-contents for presentation to a user, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos to mention a few examples. The present invention is also applicable to professional video editing software and systems to facilitate video editing and production.

In the accompanying claims, numerals and other symbols included within brackets are included to assist understanding of the claims and are not intended to limit the scope of the claims in any way.

It will be appreciated that embodiments of the invention described in the foregoing are susceptible to being modified without departing from the scope of the invention as defined by the accompanying claims. Expressions such as "comprise", "include", "incorporate", "contain", "is" and

"have" are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present. Reference to the singular is also to be construed to be a eference to the plural and vice versa.

Claims

CLAIMS:

1. A method of indexing video data objects (110), said method comprising steps of:

(a) receiving the video data objects (100) and analyzing them to identify occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images; and

(b) processing the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) for providing video index classifications for the video data objects (l lθ).

2. A method as claimed in claim 1, wherein said different categories of shots or segments (10, 20, 30) include at least one of: a long-shot (10), a medium-shot (20) and a close-up shot (30); said long-shot (10) corresponding to one or more images depicting an entire area of action; said medium- shot (20) corresponding to one or more images depicting one or more persons or objects of interest occupying substantially 50% of an area of said one or more images; and said close-up shot (30) corresponding to a part of a person of object of interest.

3. A method as claimed in claim 1, wherein identification of the occurrences of the shots or segments (10, 20, 30) involves computing at least one of:

(a) a measure of durations of the shots or segments;

(b) a measure of video content bit rates associated with the shots or segments;

(c) a number, size and position of person faces detectable in images comprising the shots or segments;

(d) a presence, size and position of overlaid text in the shots or segments;

(e) audio classification labels for the shots or segments;

(f) a measure of audio change of pace in the shots or segments;

(g) a measure of an average number of feature edges in images of the shots or segments;

(n) a measure of a ratio between areas in the images in the shots or segments which are substantially in- focus to those areas which are out-of- focus;

(o) a correlation in time of geometrically invariant properties of regions of images resulting from segmentation;

(p) a measure of average color drift in the shots or segments;

(q) a cumulative frame difference providing a measure of dynamic activity in a shot or segment; and

(r) a measure of temporal interval between substantially similar shots or segments.

4. A method as claimed in claim 3, wherein the correlation in time is based on invariant properties dependent on color and shape.

5. A method as claimed in claim 1 , wherein the step of analyzing to identify the occurrences of the shots or segments (10, 20, 30) employs at least one of: linear classifiers, statistical classifiers, neural networks, support vector machines, hidden Markov models.

6. A method as claimed in claim 5, wherein processing the video data objects includes training using example images and associated shot classifications.

7. A method as claimed in claim 1, said method being operable to provide video indexing in at least one of: video recorders, video-on-demand systems to implement content- based features including at least one of: automatic video summarization, key- frame extraction for providing data to table-of-contents, violent scene detection, video genre classification, intelligent chaptering, and automatic editing of home videos.

8. An apparatus (100) for indexing video data objects, said apparatus (100) comprising an input (110) for receiving the video data objects, and a processor (150, 160, 170, 180) for analyzing the video data objects to identify therefrom occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images, said processor (150, 160, 170, 180) being operable to process the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) to provide video index classifications of the video data objects.

9. An apparatus (100) as claimed in claim 8, said apparatus (110) including functional units (150, 160, 170, 180) including:

(a) a shot-cut detector or a video segmentation unit (150) for segmenting the video data objects into corresponding constituent shots or segments;

(b) a feature extractor (160) for extracting features of the constituent shots or segments; (c) a statistical classifier (170) for performing a statistical analysis of the extracted features; and

(d) a video summarizer (180) for analyzing statistical output data from the statistical classifier for generating said video index classifications of the video data objects.

10. An apparatus (100) as claimed in claim 8, said apparatus (100) being adapted for incorporation into video recorders and video-on-demand systems.

11. An apparatus (100) as claimed in claim 8, said apparatus (100) being operable to provide at least one of: automatic video summarization, key frame extraction for providing data for table-of-contents for user-presentation, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos.

12. Computer programme product comprising instructions that are executable on computing hardware for implementing a method as claimed in claim 1, said method concerning video indexing of video data objects based on determination of shots or segments included in the video data objects.

13. Record carrier carrying the computer programme product as claimed in claim 12.