WO2006092765A2 - Method of video indexing - Google Patents

Method of video indexing Download PDF

Info

Publication number
WO2006092765A2
WO2006092765A2 PCT/IB2006/050634 IB2006050634W WO2006092765A2 WO 2006092765 A2 WO2006092765 A2 WO 2006092765A2 IB 2006050634 W IB2006050634 W IB 2006050634W WO 2006092765 A2 WO2006092765 A2 WO 2006092765A2
Authority
WO
WIPO (PCT)
Prior art keywords
shots
segments
video
shot
data objects
Prior art date
Application number
PCT/IB2006/050634
Other languages
French (fr)
Other versions
WO2006092765A3 (en
Inventor
Mauro Barbieri
Nevenka Dimitrova
Lalitha Agnihotri
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2006092765A2 publication Critical patent/WO2006092765A2/en
Publication of WO2006092765A3 publication Critical patent/WO2006092765A3/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • G11B27/32Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording on separate auxiliary tracks of the same or an auxiliary record carrier
    • G11B27/327Table of contents
    • G11B27/329Table of contents on a disc [VTOC]

Definitions

  • the present invention relates to methods of video indexing, for example to methods of processing stored video data objects to extract therefrom information useable for indexing purposes. Moreover, the present invention also relates to apparatus operable to implement the methods.
  • video data objects are, in some situations, stored with corresponding meta-data, such meta-data has been found in practice to be too coarse and also lacking in information content to enable certain types of searching operations to be executed.
  • a method of indexing video data objects comprising steps of:
  • shots comprising one or more images can also be regarded as video content segments, such segments comprising one or more mutually related images by virtue of image feature content.
  • a segment can include several corresponding shots including fractional parts of shots.
  • said different categories of shots or segments include at least one of: a long-shot, a medium-shot and a close-up shot; said long-shot corresponding to one or more images depicting an entire area of action; said medium- shot corresponding to one or more images depicting one or more persons or objects of interest occupying substantially 50% of an area of said one or more images; and said close-up shot corresponding to a part of a person of object of interest.
  • Substantially 50% of an area of said one or more images optionally corresponds to in a range of 25% to 75% of an area of said one or more images, more optionally to in a range of 40% to 60% of an area of said one or more images.
  • Other definitions for "substantially 50%" are possible and are described qualitatively later.
  • other categories of shots or segments are possible, for example extreme long-shot, extreme close-up shot.
  • identification of the occurrences of the shots or segments involves computing at least one of:
  • Video content bit rates is to be construed to relate to the number of bits employed to encode video shots or segments taking into account a temporal duration of the video segments or shots.
  • audio classification labels are to be construed, for example, to relate to additional audio features including number of speaking persons and audio loudness. Audio classification labels can include one or more of: speech, music, silence, combinations of speech and music, noise, combinations of speech and noise, environmental sounds (for example traffic), laughter, applause and crowd cheering.
  • the correlation in time is based on invariant properties dependent on color and shape.
  • Color and shape information is generally relatively easy to extract from images, thereby surprisingly, in combination with identification of shots or segments, providing more reliable automatic indexing of video data objects.
  • the step of analyzing to identify the occurrences of the shots or segments employs at least one of: linear classifiers, statistical classifiers, neural networks, support vector machines, hidden Markov models.
  • linear classifiers statistical classifiers
  • neural networks support vector machines
  • hidden Markov models Such approaches are capable of enhancing reliability of video indexing whilst requiring more modest computing resources when implemented, thereby rendering the method useable in more modest apparatus, for example home video equipment.
  • processing the video data objects includes training using example images and associated shot or segment classifications.
  • Such an option of training in the method renders the method easier to customize to users' specific types of video data objects for obtaining enhanced reliability of video indexing.
  • the method is operable to provide video indexing in at least one of: video recorders, video-on-demand systems to implement content-based features including at least one of: automatic video summarization, key- frame extraction for providing data to table-of-contents, violent scene detection, video genre classification, intelligent chaptering, and automatic editing of home videos.
  • an apparatus for indexing video data objects comprising an input for receiving the video data objects, and a processor for analyzing the video data objects to identify therefrom occurrences of different categories of shots or segments therein, each shot or segment comprising a sequence of one or more images, each shot or segment being identified in response to image feature content of its one or more images, said processor being operable to process the video data objects based on at least the occurrences of the different categories of shots or segments to provide video index classifications of the video data objects.
  • the apparatus includes functional units including:
  • the apparatus is adapted for incorporation into video recorders and video-on-demand systems.
  • the apparatus is beneficially operable to provide at least one of: automatic video summarization, key frame extraction for providing data for table-of-contents for user-presentation, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos.
  • a third aspect of the invention there is provided software recorded on a data carrier, said software being executable on computing hardware for implementing a method according to the first aspect of the invention, said method concerning video indexing of video data objects based on determination of shots or segment included in the video data objects.
  • Figure 1 is a graphical illustration of a long-shot, a medium-shot and a close- up shot
  • Figure 2 is a schematic diagram of an apparatus for analyzing video content according to the invention.
  • Contemporary video data object content analysis techniques utilize algorithms derived from image processing in general. These content analysis techniques include pattern recognition and artificial intelligence that aim to automatically create annotations of video material. Such annotations relate to a wide spectrum of features from low- level signal-related properties such as color and texture to higher- level information such as the presence of faces. It is contemporary practice to search for and retrieve video data objects in large unstructured archives of video content after the content has been indexed using content analysis techniques. Results of such content analysis techniques are used, for example, for detecting occurrence of commercial/advertisements, for implementing intelligent chaptering, for providing video previews and video summaries.
  • contemporary content analysis techniques aim to automatically index video content by extracting various features from images of the video content, for example image color, image texture shape, image motion, as well as from audio associated with the video content, for example from audio energy and audio frequency spectra.
  • image features such as faces, image objects, audio silences, music and similar.
  • contemporary content analysis techniques are limited by inherent limitations of algorithms employed to implement the techniques and are therefore capable of identifying only a very limited set of features, for example faces, outdoor environments, indoor environments.
  • the inventors have appreciated that it is beneficial when implementing automatic content analysis to utilize additional information relating to fields of view of video segments.
  • additional information concerns a predefined set of categories such as close- up shot, medium shot, long shot and similar. It is beneficial to use the additional information to significantly enhance content analysis such as in respect of key- frame extraction, scene boundary detection, summarization, automatic editing and such like.
  • film grammar In order to further elucidate the present invention, "film grammar" will be further described.
  • the grammar of film In profession video production, certain common conventions are used which are generally referred to "the grammar of film”.
  • Film grammar includes conventions for conveying meaning by way of particular camera filming techniques and editing techniques when generating video content.
  • video content producers convey meaning through video by choice of field of view.
  • Such field of view pertains to a size of a subject in relation to a size of an overall corresponding image frame including the subject.
  • the field of view is technically dependent on a focal length of a camera lens employed to capture the image frame.
  • video content images can be categorized pursuant to three basic groups: long shots, medium shots and close-up shots. These three basic groups will now be further elucidated.
  • a long-shot wherein an image 10 captures an entire area of action, for example a place, persons at the place and inanimate objects at the place to provide an overall appearance.
  • long- shots are employed to establish all elements in a scene so that viewers of corresponding video content including the long-shots are able to appreciate persons depicted in the scene and their spatial location and relationship to the scene.
  • entrances and exits of persons from the scene are conventionally, pursuant to aforesaid film grammar, conveyed to viewers by way of long-shots.
  • Figure 1 there is illustrated a medium-shot wherein an image 20 captures a subject, for example a person or an inanimate object, occupying approximately as much of an area of an image as does corresponding surroundings.
  • a subject for example a person or an inanimate object
  • persons are typically filmed from above their knees, or just below their waists so that their gestures, facial expressions and movements are visible to viewers.
  • a close-up shot such that a corresponding image 30 in video content conveys a relatively small proportion of the aforesaid scene.
  • close-up shots are used to focus attention on a person's feelings or reactions, for example as represented in facial expressions.
  • close-up shots are often employed in interviews to show a person in a state of emotional excitement, grief or joy. Close-up shots are especially useful for performing the following functions:
  • playing up narrative highlights such playing up relates to important dialogue, action performed by a person, reaction of a person. Playing up, pursuant to aforementioned film grammar, is employed whenever dramatic emphasis or increased audience attention is required; (b) isolating significant subject matter and eliminating all non-essential material from view: such isolation and elimination is employed to concentrate viewer attention on an important action, for example a particular object or a meaningful expression on a person's face;
  • the present invention is concerned with apparatus capable of receiving video content comprising video data objects, for example stored in a database or extracted from a database, and analyzing the video content to identify, amongst other things, the occurrence of long-shots, medium-shots and close-up shots for use in appropriately indexing the video content.
  • the long-shots, medium-shots and close-up shots are optionally determined in response to spatial information included in their corresponding sequences of images rather than on merely duration of the shots as is known in the art.
  • Such analysis of video content has been found to result in more reliable indexing.
  • the present invention therefore provides an apparatus for analyzing video content and thereby appropriately indexing the video content.
  • the apparatus can be implemented in practice as dedicated image processing electronic hardware or by way of software executable on computing hardware, or a mixture of such hardware and software.
  • the apparatus is indicated generally by 100 and includes an input 110 for receiving video content V including video data objects, and an output 120 for providing analysis data which is subsequently useful for video content indexing purposes, for example when a viewer is desirous to search for a given type of video data object.
  • the apparatus 100 has functional units, implemented either by way of hardware or software or a combination of both.
  • the functional units include a shot cut detector 150 for generating indications of identified shots 155, a feature extractor 160 for generating feature vectors 165, a statistical classifier 170 for generating field of view class probabilities 175, and a video summarizer 180 for generating a video summary at the output 120 for video indexing purposes.
  • the apparatus 100 receives video content at its input 110 which is received at the shot-cut detector 150.
  • the shot-cut detector 150 is operable to segment the video content into corresponding constituent shots, each shot comprising a sequence of one or more images.
  • the detector 150 utilizes contemporary methods of shot-cut detection, for example as described in a publication "Video Keyframe Extraction and Filtering: A Keyframe is not a Keyframe to Everybody", Proceedings of ACM Conference on
  • the detector 150 in co-operation with the feature extractor 160 is operable to process the shot to compute one or more of the following: (i) a measure of duration of the shot;
  • bit rate of the shot for example, long shots usually represent more cluttered scenes and therefore are more difficult to encode; alternatively, a bit rate of one representative keyframe extracted from the shot is utilized; (iii) the number and sizes of faces detectable in the shot;
  • edge histogram a measure of edge histogram, for example expressed as an average for the shot; such an edge histogram is employed in the MPEG-7 standard as described in ISO/IEC
  • (x) a measure of homogeneous texture: for example, such homogeneous texture is as defined in the aforementioned MPEG-7 standard;
  • (xi) a measure, for example average, of a number and area of regions resulting from image segmentation in the shot;
  • (xii) a measure of intensity and spatial distribution of motion activity in the shot; the measure is beneficially determined according to the aforesaid MPEG-7 standard;
  • (xiii) an estimation of camera motion occurring within the shot; for example, the estimation can provide a measure of pan, tilt and zoom types of camera motion utilized in filming the shot;
  • xiv a measure of field of view category of neighboring shots present in the video content; for example, long shots are firstly used to establish a place of a scene followed by close-up shots to capture a dialog between main persons portrayed in the video content;
  • xv a measure, for example an average, of a ratio between areas in one or more images in the shot which are in focus to those areas which are out-of- focus;
  • a correlation in time for example between successive shots, of geometrically invariant properties of regions resulting from segmentation, for example color and shape, relative spatial arrangement
  • a measure of an average color drift in the shot such a measure concerns whether a color composition of the shot remains substantially constant during an entire duration of the shot, or numerous new colors are added as the shot progresses.
  • the measure is beneficially computed from a color histogram difference between successive images or frames in the shot.
  • the statistical classifier 170 is trained on a substantially comprehensive set of manually annotated video shots, such annotation assigning to each shot a probability value representative of an estimated probability of the shot belonging to, for example, one or the aforementioned three groups, namely long-shot, medium-shot and close-up shot. It will be appreciated that there can be more or less than three groups, for example extreme long-shot could be a fourth group which the classifier 170 is operable to utilize.
  • the groups can be defined, for example, as follows:
  • the statistical classifier 170 optionally employs processing functions known in contemporary pattern classification. Such functions optionally include at least one of: linear classifiers, neural networks, support vector machines, hidden Markov models and such like. Beneficially, the statistical classifier 170 is conducted through a "training" phase during which it is given examples of field-of-view segments with corresponding manually annotating labels, for example close-up shot, long-shot and so forth, which the classifier 170 is expected to classify. More optionally, training of the classifier 170 is undertaken using several feature vectors that are extracted from shots, each of which has been labeled according to a field of view which is to be detected to classifier 170.
  • screenplay information accompanying video content provides information describing types of shot employed in filming the video content.
  • temporal boundaries in video content are identified by way of "dialog alignment".
  • audio labels are optionally used to locate audio scene descriptions and thereby include additional time stamps.
  • an analysis is employed for parsing screenplay information and thereby finding scene descriptions and matching them with potential time-stamped shot segments. The screenplay is thereby able to assist in identifying types of shots, for example a close-up shot or a long-shot, for assisting with video content indexing.
  • the present invention is suitable for employing in any situations where indexing of video content is required.
  • the present invention is susceptible to being utilized in video recorders and video-on-demand systems to implement content-based features such as automatic video summarization, key- frame extraction for providing data for table-of-contents for presentation to a user, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos to mention a few examples.
  • the present invention is also applicable to professional video editing software and systems to facilitate video editing and production.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

There is provided an apparatus (100) for indexing video data objects (110), said apparatus (100) comprising an input (110) for receiving the video data objects, and a processor (150, 160, 170, 180) for analyzing the video data objects (110) to identify therefrom occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images, said processor (150, 160, 170, 180) being operable to process the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) to provide video index classifications of the video data objects (110).

Description

Method of video indexing
The present invention relates to methods of video indexing, for example to methods of processing stored video data objects to extract therefrom information useable for indexing purposes. Moreover, the present invention also relates to apparatus operable to implement the methods.
Advances in computers, communication systems and networks, and data storage media capacity has resulted in recent years in the generation of large archives of video data objects, for example in servers of communication networks operable to deliver video data objects to customers, as well as in domestic video entertainment systems where users of such systems often accumulate large personal libraries of video data objects. These archives tend to grow with time and can potentially comprise a wide variety of subject matter in their video data objects. There arises therefrom a technical problem of adequately and automatically indexing video data objects to allow users to search through these archives to identify video data objects of interest. Thus, the large archives can potentially comprise several terabytes of data objects in substantially disorganized form. Searching through these archives manually is a tedious and time-consuming task.
Although video data objects are, in some situations, stored with corresponding meta-data, such meta-data has been found in practice to be too coarse and also lacking in information content to enable certain types of searching operations to be executed.
Methods of automatically extracting semantically significant events from video are known. For example, there is described in a published United States Patent No. US 6, 721, 454 a multi- level technique for detecting semantically meaningful events in video. In the multi-level technique, video sequences are visually analyzed at a first level to detect shot boundaries, to measure color and texture of content in the sequences and to detect objects in the content. In a second level of the technique, image objects present in the sequences are classified and the content in each shot is thereby summarized. Moreover, in a third level of the technique, there is applied an event inference module for inferring occurrence of events on the basis of temporal and spatial phenomena disclosed in the shot summaries.
Contemporary methods of automatically extracting semantically significant events from video tend to be highly computationally demanding such that use of these methods in domestic equipment of moderate cost is impractical.
It is known, for example in a conference paper "Content-based retrieval of video data by the grammar of film", Yoshitaka et al., Proc. IEEE Symposium on Visual Languages 1997 pp. 310-317, that it is feasible to devise a method of retrieving video data from films by way of specifying semantic contents of scenes. In connection therewith, there is a so-called "grammar of film" which is an accumulation of knowledge and rules for expressing certain semantics of a scene more effectively. Thus, features of video data objects can be observed as a consequence of the effects of the grammar of the film. Methods employing "grammar of film" to index the video data objects is found to be more reliable in comparison to other ways of evaluating sequences of images in detail; such reliability is enhanced because indexing by attention to "film grammar" provides a closer approach to a semantic level of understanding of humans. However, use of "grammar of film" is not appropriate in all circumstances, for example when video sequences are produced by video surveillance cameras where compliance to "film grammar" is not a requirement. It is preferred to provide an alternative method of video indexing.
According to a first aspect of the invention, there is provided a method of indexing video data objects, said method comprising steps of:
(a) receiving the video data objects and analyzing them to identify occurrences of different categories of shots or segments therein, each shot or segment comprising a sequence of one or more images, each shot or segment being identified in response to image feature content of its one or more images; and
(b) processing the video data objects based on at least the occurrences of the different categories of shots or segments for providing video index classifications for the video data objects.
The invention is of advantage in that use of shot or segment information, namely field of view information, is capable of providing an enhanced method of video indexing. For the purposes of describing the invention, shots comprising one or more images can also be regarded as video content segments, such segments comprising one or more mutually related images by virtue of image feature content. However, a segment can include several corresponding shots including fractional parts of shots. Optionally, in the method, said different categories of shots or segments include at least one of: a long-shot, a medium-shot and a close-up shot; said long-shot corresponding to one or more images depicting an entire area of action; said medium- shot corresponding to one or more images depicting one or more persons or objects of interest occupying substantially 50% of an area of said one or more images; and said close-up shot corresponding to a part of a person of object of interest. Substantially 50% of an area of said one or more images optionally corresponds to in a range of 25% to 75% of an area of said one or more images, more optionally to in a range of 40% to 60% of an area of said one or more images. Other definitions for "substantially 50%" are possible and are described qualitatively later. Moreover, other categories of shots or segments are possible, for example extreme long-shot, extreme close-up shot.
Optionally, in the method, identification of the occurrences of the shots or segments involves computing at least one of:
(a) a measure of durations of the shots or segments;
(b) a measure of video content bit rates associated with the shots or segments; (c) a number, size and position of person faces detectable in images comprising the shots or segments;
(d) a presence, size and position of overlaid text in the shots or segments;
(e) audio classification labels for the shots or segments;
(f) a measure of audio change of pace in the shots or segments; (g) a measure of an average number of feature edges in images of the shots or segments;
(h) a measure of edge histogram computed from the number of feature edges in images of the shots or segments;
(i) a measure of homogeneous texture from images of the shots or segments; (j) a measure of a number and area of regions resulting from image segmentation in the shots or segments;
(k) a measure of intensity and spatial distribution of motion activity within the shots or segments;
(1) an estimation of camera motion occurring within the shots or segments; (m) a measure of a field-of-view difference between neighboring shots or segments;
(n) a measure of a ratio between areas in the images in the shots or segments which are substantially in- focus to those areas which are out-of- focus; (o) a correlation in time of geometrically invariant properties of regions of images resulting from segmentation;
(p) a measure of average color drift in the shots or segments;
(q) a cumulative frame difference providing a measure of dynamic activity in a shot or segment; (r) a measure of temporal interval between substantially similar shots or segments.
Video content bit rates is to be construed to relate to the number of bits employed to encode video shots or segments taking into account a temporal duration of the video segments or shots. Moreover, audio classification labels are to be construed, for example, to relate to additional audio features including number of speaking persons and audio loudness. Audio classification labels can include one or more of: speech, music, silence, combinations of speech and music, noise, combinations of speech and noise, environmental sounds (for example traffic), laughter, applause and crowd cheering.
Such measures and estimates are found in practice to be useful in combination with identification of shots or segments to provide more reliable indexing of video data objects.
Optionally, in the method, the correlation in time is based on invariant properties dependent on color and shape. Color and shape information is generally relatively easy to extract from images, thereby surprisingly, in combination with identification of shots or segments, providing more reliable automatic indexing of video data objects.
Optionally, in the method, the step of analyzing to identify the occurrences of the shots or segments employs at least one of: linear classifiers, statistical classifiers, neural networks, support vector machines, hidden Markov models. Such approaches are capable of enhancing reliability of video indexing whilst requiring more modest computing resources when implemented, thereby rendering the method useable in more modest apparatus, for example home video equipment.
Optionally, in the method, processing the video data objects includes training using example images and associated shot or segment classifications. Such an option of training in the method renders the method easier to customize to users' specific types of video data objects for obtaining enhanced reliability of video indexing.
Optionally, the method is operable to provide video indexing in at least one of: video recorders, video-on-demand systems to implement content-based features including at least one of: automatic video summarization, key- frame extraction for providing data to table-of-contents, violent scene detection, video genre classification, intelligent chaptering, and automatic editing of home videos.
According to a second aspect of the invention, there is provided an apparatus for indexing video data objects, said apparatus comprising an input for receiving the video data objects, and a processor for analyzing the video data objects to identify therefrom occurrences of different categories of shots or segments therein, each shot or segment comprising a sequence of one or more images, each shot or segment being identified in response to image feature content of its one or more images, said processor being operable to process the video data objects based on at least the occurrences of the different categories of shots or segments to provide video index classifications of the video data objects. Optionally, the apparatus includes functional units including:
(a) a shot-cut detector or a video segmentation unit for segmenting the video data objects into corresponding constituent shots or segments;
(b) a feature extractor for extracting features of the constituent shots or segments; (c) a statistical classifier for performing a statistical analysis of the extracted features; and
(d) a video summarizer for analyzing statistical output data from the statistical classifier for generating said video index classifications of the video data objects.
Optionally, the apparatus is adapted for incorporation into video recorders and video-on-demand systems. Moreover, the apparatus is beneficially operable to provide at least one of: automatic video summarization, key frame extraction for providing data for table-of-contents for user-presentation, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos.
According to a third aspect of the invention, there is provided software recorded on a data carrier, said software being executable on computing hardware for implementing a method according to the first aspect of the invention, said method concerning video indexing of video data objects based on determination of shots or segment included in the video data objects. It will be appreciated that features of the invention are susceptible to being combined in any combination without departing from the scope of the invention.
Embodiments of the invention will now be described, by way of example only, with reference to the following diagrams wherein:
Figure 1 is a graphical illustration of a long-shot, a medium-shot and a close- up shot; and
Figure 2 is a schematic diagram of an apparatus for analyzing video content according to the invention.
In order to describe the present invention in contradistinction to known techniques for analyzing video data objects, aspects of these known techniques will firstly be elucidated.
Contemporary video data object content analysis techniques utilize algorithms derived from image processing in general. These content analysis techniques include pattern recognition and artificial intelligence that aim to automatically create annotations of video material. Such annotations relate to a wide spectrum of features from low- level signal-related properties such as color and texture to higher- level information such as the presence of faces. It is contemporary practice to search for and retrieve video data objects in large unstructured archives of video content after the content has been indexed using content analysis techniques. Results of such content analysis techniques are used, for example, for detecting occurrence of commercial/advertisements, for implementing intelligent chaptering, for providing video previews and video summaries.
Thus, contemporary content analysis techniques aim to automatically index video content by extracting various features from images of the video content, for example image color, image texture shape, image motion, as well as from audio associated with the video content, for example from audio energy and audio frequency spectra. When analyzing low-level features, contemporary content analysis techniques aim to imitate human understanding by detecting image features such as faces, image objects, audio silences, music and similar. In consequence, contemporary content analysis techniques are limited by inherent limitations of algorithms employed to implement the techniques and are therefore capable of identifying only a very limited set of features, for example faces, outdoor environments, indoor environments.
The inventors have appreciated that it is beneficial when implementing automatic content analysis to utilize additional information relating to fields of view of video segments. Such additional information concerns a predefined set of categories such as close- up shot, medium shot, long shot and similar. It is beneficial to use the additional information to significantly enhance content analysis such as in respect of key- frame extraction, scene boundary detection, summarization, automatic editing and such like.
The inventors have further appreciated that aesthetic aspects of video content, for example light, color, space, time, motion and sound contribute in a significant manner to a message conveyed by the video content. Such aesthetic aspects are found in practice to be determined by the aforementioned "film grammar". An example of "film grammar" in video content arises when two persons are mutually involved in a dialogue wherein it is convention filming practice to employ close-up shots of the two persons alternated with medium shots of the two persons together in a same location.
In order to further elucidate the present invention, "film grammar" will be further described. In profession video production, certain common conventions are used which are generally referred to "the grammar of film". Film grammar includes conventions for conveying meaning by way of particular camera filming techniques and editing techniques when generating video content. For example, video content producers convey meaning through video by choice of field of view. Such field of view pertains to a size of a subject in relation to a size of an overall corresponding image frame including the subject. Moreover, the field of view is technically dependent on a focal length of a camera lens employed to capture the image frame. Based on fields of view, video content images can be categorized pursuant to three basic groups: long shots, medium shots and close-up shots. These three basic groups will now be further elucidated.
Referring to Figure 1, there is illustrated a long-shot wherein an image 10 captures an entire area of action, for example a place, persons at the place and inanimate objects at the place to provide an overall appearance. In aforementioned film grammar, long- shots are employed to establish all elements in a scene so that viewers of corresponding video content including the long-shots are able to appreciate persons depicted in the scene and their spatial location and relationship to the scene. For example, entrances and exits of persons from the scene are conventionally, pursuant to aforesaid film grammar, conveyed to viewers by way of long-shots. However, it is also conventional film grammar practice to employ long-shots sparingly in view of limited resolution and screen size provided in contemporary video content presentation apparatus, for example televisions with 40 cm-diagonal measurement to their screens.
In Figure 1, there is illustrated a medium-shot wherein an image 20 captures a subject, for example a person or an inanimate object, occupying approximately as much of an area of an image as does corresponding surroundings. For example, persons are typically filmed from above their knees, or just below their waists so that their gestures, facial expressions and movements are visible to viewers. It is contemporary film grammar practice that medium-shots generally comprise a majority of images in a given film or video content. Referring again to Figure 1, there is illustrated a close-up shot such that a corresponding image 30 in video content conveys a relatively small proportion of the aforesaid scene. In film grammar, close-up shots are used to focus attention on a person's feelings or reactions, for example as represented in facial expressions. Thus, close-up shots are often employed in interviews to show a person in a state of emotional excitement, grief or joy. Close-up shots are especially useful for performing the following functions:
(a) playing up narrative highlights: such playing up relates to important dialogue, action performed by a person, reaction of a person. Playing up, pursuant to aforementioned film grammar, is employed whenever dramatic emphasis or increased audience attention is required; (b) isolating significant subject matter and eliminating all non-essential material from view: such isolation and elimination is employed to concentrate viewer attention on an important action, for example a particular object or a meaningful expression on a person's face;
(c) magnifying small-scale actions: such magnification is employed when a given action is too small to be adequately represented in a medium-shot or long-shot;
(d) providing a time lapse: insertion of a close-up shot is beneficially used to shorten a time interval needed to convey otherwise temporally lengthy actions;
(e) distracting a viewer's attention: a close-up shot is beneficially employed to disguise a "jump-cut" caused by mismatched or missing action in video content; (f) substituting for hidden action: for example, substituting for an action which cannot be filmed for physical reasons;
(g) presenting reactions of off-screen persons: a reaction of an off-screen person may, in certain circumstances, be more aesthetically significant than an action performed by a principal person on-screen; and (h) cueing viewers of video content on how they should react: such cueing is susceptible to stimulating a viewer to have a similar feeling to that presented to the viewer in the close-up shot, for example a portrayal of fear, tension, awe, pity and such like.
Thus, the present invention is concerned with apparatus capable of receiving video content comprising video data objects, for example stored in a database or extracted from a database, and analyzing the video content to identify, amongst other things, the occurrence of long-shots, medium-shots and close-up shots for use in appropriately indexing the video content. The long-shots, medium-shots and close-up shots are optionally determined in response to spatial information included in their corresponding sequences of images rather than on merely duration of the shots as is known in the art. Such analysis of video content has been found to result in more reliable indexing.
The present invention therefore provides an apparatus for analyzing video content and thereby appropriately indexing the video content. The apparatus can be implemented in practice as dedicated image processing electronic hardware or by way of software executable on computing hardware, or a mixture of such hardware and software. Referring to Figure 2, the apparatus is indicated generally by 100 and includes an input 110 for receiving video content V including video data objects, and an output 120 for providing analysis data which is subsequently useful for video content indexing purposes, for example when a viewer is desirous to search for a given type of video data object. The apparatus 100 has functional units, implemented either by way of hardware or software or a combination of both. The functional units include a shot cut detector 150 for generating indications of identified shots 155, a feature extractor 160 for generating feature vectors 165, a statistical classifier 170 for generating field of view class probabilities 175, and a video summarizer 180 for generating a video summary at the output 120 for video indexing purposes. In operation, the apparatus 100 receives video content at its input 110 which is received at the shot-cut detector 150. The shot-cut detector 150 is operable to segment the video content into corresponding constituent shots, each shot comprising a sequence of one or more images. Beneficially, the detector 150 utilizes contemporary methods of shot-cut detection, for example as described in a publication "Video Keyframe Extraction and Filtering: A Keyframe is not a Keyframe to Everybody", Proceedings of ACM Conference on
Information and Knowledge Management, November 1997 which is hereby incorporated by reference. For each shot identified in the video content, the detector 150 in co-operation with the feature extractor 160 is operable to process the shot to compute one or more of the following: (i) a measure of duration of the shot;
(ii) a measure of a bit rate of the shot; for example, long shots usually represent more cluttered scenes and therefore are more difficult to encode; alternatively, a bit rate of one representative keyframe extracted from the shot is utilized; (iii) the number and sizes of faces detectable in the shot;
(iv) a presence of overlaid text in the shot;
(v) an audio classification label for the shot: close-up shots are more likely to be classified as speech, whereas long shots are likely to be accompanied by a music background;
(vi) a measure of audio change of pace: for example, how frequently audio labels change during the shot);
(vii) a measure of an average number of edges in images per shot;
(viii) a measure of an average distance between edges, for example substantially vertical edges, in images per slot;
(ix) a measure of edge histogram, for example expressed as an average for the shot; such an edge histogram is employed in the MPEG-7 standard as described in ISO/IEC
15938:2001, "Multimedia content description interface - Part 3: Visual" which is hereby incorporated by reference;
(x) a measure of homogeneous texture: for example, such homogeneous texture is as defined in the aforementioned MPEG-7 standard; (xi) a measure, for example average, of a number and area of regions resulting from image segmentation in the shot;
(xii) a measure of intensity and spatial distribution of motion activity in the shot; the measure is beneficially determined according to the aforesaid MPEG-7 standard;
(xiii) an estimation of camera motion occurring within the shot; for example, the estimation can provide a measure of pan, tilt and zoom types of camera motion utilized in filming the shot;
(xiv) a measure of field of view category of neighboring shots present in the video content; for example, long shots are firstly used to establish a place of a scene followed by close-up shots to capture a dialog between main persons portrayed in the video content; (xv) a measure, for example an average, of a ratio between areas in one or more images in the shot which are in focus to those areas which are out-of- focus;
(xvi) a correlation in time, for example between successive shots, of geometrically invariant properties of regions resulting from segmentation, for example color and shape, relative spatial arrangement; and (xvii) a measure of an average color drift in the shot; such a measure concerns whether a color composition of the shot remains substantially constant during an entire duration of the shot, or numerous new colors are added as the shot progresses. The measure is beneficially computed from a color histogram difference between successive images or frames in the shot.
The statistical classifier 170 is trained on a substantially comprehensive set of manually annotated video shots, such annotation assigning to each shot a probability value representative of an estimated probability of the shot belonging to, for example, one or the aforementioned three groups, namely long-shot, medium-shot and close-up shot. It will be appreciated that there can be more or less than three groups, for example extreme long-shot could be a fourth group which the classifier 170 is operable to utilize.
Reiterating the foregoing, the groups can be defined, for example, as follows:
(A) extreme close-up shot: for example, detail of a person;
(B) close-up shot: for example, all of a person's head; (C) medium close-up shot: for example, head and shoulders of a person;
(D) medium shot: for example, equal importance to person and associated surroundings;
(E) long shot: a person is included in its entirety in an image together with much of associated surroundings; and (F) extreme long shot: a person is captured as a small part of an image with associated surroundings.
The statistical classifier 170 optionally employs processing functions known in contemporary pattern classification. Such functions optionally include at least one of: linear classifiers, neural networks, support vector machines, hidden Markov models and such like. Beneficially, the statistical classifier 170 is conducted through a "training" phase during which it is given examples of field-of-view segments with corresponding manually annotating labels, for example close-up shot, long-shot and so forth, which the classifier 170 is expected to classify. More optionally, training of the classifier 170 is undertaken using several feature vectors that are extracted from shots, each of which has been labeled according to a field of view which is to be detected to classifier 170.
In some situations, screenplay information accompanying video content provides information describing types of shot employed in filming the video content. In the invention, temporal boundaries in video content are identified by way of "dialog alignment". Thus, audio labels are optionally used to locate audio scene descriptions and thereby include additional time stamps. When implementing the present invention, an analysis is employed for parsing screenplay information and thereby finding scene descriptions and matching them with potential time-stamped shot segments. The screenplay is thereby able to assist in identifying types of shots, for example a close-up shot or a long-shot, for assisting with video content indexing.
The present invention is suitable for employing in any situations where indexing of video content is required. For example, the present invention is susceptible to being utilized in video recorders and video-on-demand systems to implement content-based features such as automatic video summarization, key- frame extraction for providing data for table-of-contents for presentation to a user, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos to mention a few examples. The present invention is also applicable to professional video editing software and systems to facilitate video editing and production.
In the accompanying claims, numerals and other symbols included within brackets are included to assist understanding of the claims and are not intended to limit the scope of the claims in any way.
It will be appreciated that embodiments of the invention described in the foregoing are susceptible to being modified without departing from the scope of the invention as defined by the accompanying claims. Expressions such as "comprise", "include", "incorporate", "contain", "is" and
"have" are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present. Reference to the singular is also to be construed to be a eference to the plural and vice versa.

Claims

CLAIMS:
1. A method of indexing video data objects (110), said method comprising steps of:
(a) receiving the video data objects (100) and analyzing them to identify occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images; and
(b) processing the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) for providing video index classifications for the video data objects (l lθ).
2. A method as claimed in claim 1, wherein said different categories of shots or segments (10, 20, 30) include at least one of: a long-shot (10), a medium-shot (20) and a close-up shot (30); said long-shot (10) corresponding to one or more images depicting an entire area of action; said medium- shot (20) corresponding to one or more images depicting one or more persons or objects of interest occupying substantially 50% of an area of said one or more images; and said close-up shot (30) corresponding to a part of a person of object of interest.
3. A method as claimed in claim 1, wherein identification of the occurrences of the shots or segments (10, 20, 30) involves computing at least one of:
(a) a measure of durations of the shots or segments;
(b) a measure of video content bit rates associated with the shots or segments;
(c) a number, size and position of person faces detectable in images comprising the shots or segments;
(d) a presence, size and position of overlaid text in the shots or segments;
(e) audio classification labels for the shots or segments;
(f) a measure of audio change of pace in the shots or segments;
(g) a measure of an average number of feature edges in images of the shots or segments;
(h) a measure of edge histogram computed from the number of feature edges in images of the shots or segments;
(i) a measure of homogeneous texture from images of the shots or segments; (j) a measure of a number and area of regions resulting from image segmentation in the shots or segments;
(k) a measure of intensity and spatial distribution of motion activity within the shots or segments;
(1) an estimation of camera motion occurring within the shots or segments; (m) a measure of a field-of-view difference between neighboring shots or segments;
(n) a measure of a ratio between areas in the images in the shots or segments which are substantially in- focus to those areas which are out-of- focus;
(o) a correlation in time of geometrically invariant properties of regions of images resulting from segmentation;
(p) a measure of average color drift in the shots or segments;
(q) a cumulative frame difference providing a measure of dynamic activity in a shot or segment; and
(r) a measure of temporal interval between substantially similar shots or segments.
4. A method as claimed in claim 3, wherein the correlation in time is based on invariant properties dependent on color and shape.
5. A method as claimed in claim 1 , wherein the step of analyzing to identify the occurrences of the shots or segments (10, 20, 30) employs at least one of: linear classifiers, statistical classifiers, neural networks, support vector machines, hidden Markov models.
6. A method as claimed in claim 5, wherein processing the video data objects includes training using example images and associated shot classifications.
7. A method as claimed in claim 1, said method being operable to provide video indexing in at least one of: video recorders, video-on-demand systems to implement content- based features including at least one of: automatic video summarization, key- frame extraction for providing data to table-of-contents, violent scene detection, video genre classification, intelligent chaptering, and automatic editing of home videos.
8. An apparatus (100) for indexing video data objects, said apparatus (100) comprising an input (110) for receiving the video data objects, and a processor (150, 160, 170, 180) for analyzing the video data objects to identify therefrom occurrences of different categories of shots or segments (10, 20, 30) therein, each shot or segment (10, 20, 30) comprising a sequence of one or more images, each shot or segment (10, 20, 30) being identified in response to image feature content of its one or more images, said processor (150, 160, 170, 180) being operable to process the video data objects (110) based on at least the occurrences of the different categories of shots or segments (10, 20, 30) to provide video index classifications of the video data objects.
9. An apparatus (100) as claimed in claim 8, said apparatus (110) including functional units (150, 160, 170, 180) including:
(a) a shot-cut detector or a video segmentation unit (150) for segmenting the video data objects into corresponding constituent shots or segments;
(b) a feature extractor (160) for extracting features of the constituent shots or segments; (c) a statistical classifier (170) for performing a statistical analysis of the extracted features; and
(d) a video summarizer (180) for analyzing statistical output data from the statistical classifier for generating said video index classifications of the video data objects.
10. An apparatus (100) as claimed in claim 8, said apparatus (100) being adapted for incorporation into video recorders and video-on-demand systems.
11. An apparatus (100) as claimed in claim 8, said apparatus (100) being operable to provide at least one of: automatic video summarization, key frame extraction for providing data for table-of-contents for user-presentation, violent scene detection, video genre classification, intelligent chaptering, automatic editing of home videos.
12. Computer programme product comprising instructions that are executable on computing hardware for implementing a method as claimed in claim 1, said method concerning video indexing of video data objects based on determination of shots or segments included in the video data objects.
13. Record carrier carrying the computer programme product as claimed in claim 12.
PCT/IB2006/050634 2005-03-04 2006-03-01 Method of video indexing WO2006092765A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP05101705.1 2005-03-04
EP05101705 2005-03-04

Publications (2)

Publication Number Publication Date
WO2006092765A2 true WO2006092765A2 (en) 2006-09-08
WO2006092765A3 WO2006092765A3 (en) 2006-11-09

Family

ID=36693610

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/050634 WO2006092765A2 (en) 2005-03-04 2006-03-01 Method of video indexing

Country Status (1)

Country Link
WO (1) WO2006092765A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008127319A3 (en) * 2007-01-31 2009-08-13 Thomson Licensing Method and apparatus for automatically categorizing potential shot and scene detection information
WO2012036658A1 (en) 2010-09-17 2012-03-22 Thomson Licensing Method for semantics based trick mode play in video system
CN103605786A (en) * 2013-11-27 2014-02-26 姚领众 Massive video retrieving method based on sample video clips
EP2922060A1 (en) * 2014-03-17 2015-09-23 Fujitsu Limited Extraction method and device
US9866922B2 (en) 2010-03-31 2018-01-09 Thomson Licensing Trick playback of video data
CN110169055A (en) * 2017-01-20 2019-08-23 华为技术有限公司 A kind of method and apparatus generating shot information
CN111147914A (en) * 2019-12-24 2020-05-12 珠海格力电器股份有限公司 Video processing method, storage medium and electronic equipment

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
A EKIN, A M TEKALP, R MEHROTRA: "Automatic Soccer Video Analysis and summarization" WWW.ECE.ROCHESTER.EDU, [Online] 2003, XP002393278 Retrieved from the Internet: URL:http://www.ece.rochester.edu/users/tek alp/papers/ip_ekin4157.pdf> [retrieved on 2006-08-02] *
EKIN A ET AL: "Framework for tracking and analysis of soccer video" PROCEEDINGS OF THE SPIE - THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING SPIE-INT. SOC. OPT. ENG USA, vol. 4671, 2002, pages 763-774, XP002393277 ISSN: 0277-786X *
PENG XU, LEXING XIE, SHIH-FU CHANG: "Algorithms and system for segmentation and structure analysis in soccer video" IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, [Online] 2001, pages 928-931, XP002393276 ISBN: 0-7695-1198-8 Retrieved from the Internet: URL:http://citeseer.ist.psu.edu/cache/pape rs/cs/26571/ftp:zSzzSzftp.ee.columbia.eduz SzCTR-ResearchzSzadventzSzpubliczSzpapersz Sz01zSzicme01_soccer.pdf/xu01algorithms.pd f> [retrieved on 2006-08-02] *
SHU-CHING CHEN ET AL: "Detection of Soccer Goal Shots Using Joint Multimedia Features and Classification Rules" MDM/KDD'03, 24 August 2003 (2003-08-24), XP002342134 Washington DC *
XIAO-FENG TONG ET AL: "Shot classification in sports video" SIGNAL PROCESSING, 2004. PROCEEDINGS. ICSP '04. 2004 7TH INTERNATIONAL CONFERENCE ON BEIJING, CHINA AUG. 31 - SEPT 4, 2004, PISCATAWAY, NJ, USA,IEEE, 31 August 2004 (2004-08-31), pages 1364-1367, XP010810734 ISBN: 0-7803-8406-7 *
XIE L ET AL: "Structure analysis of soccer video with domain knowledge and hidden Markov models" May 2004 (2004-05), PATTERN RECOGNITION LETTERS, NORTH-HOLLAND PUBL. AMSTERDAM, NL, PAGE(S) 767-775 , XP004500944 ISSN: 0167-8655 the whole document *
YI-HUA ZHOU ET AL: "An SVM-Based Soccer Video Shot Classification" MACHINE LEARNING AND CYBERNETICS, 2005. PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON GUANGZHOU, CHINA 18-21 AUG. 2005, PISCATAWAY, NJ, USA,IEEE, 18 August 2005 (2005-08-18), pages 5398-5403, XP010847738 ISBN: 0-7803-9091-1 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010517469A (en) * 2007-01-31 2010-05-20 トムソン ライセンシング Method and apparatus for automatically classifying potential shot and scene detection information
CN101601302B (en) * 2007-01-31 2012-07-04 汤姆森特许公司 Method and apparatus for automatically categorizing potential shot and scene detection information
US8891020B2 (en) 2007-01-31 2014-11-18 Thomson Licensing Method and apparatus for automatically categorizing potential shot and scene detection information
WO2008127319A3 (en) * 2007-01-31 2009-08-13 Thomson Licensing Method and apparatus for automatically categorizing potential shot and scene detection information
US11418853B2 (en) 2010-03-31 2022-08-16 Interdigital Madison Patent Holdings, Sas Trick playback of video data
US9866922B2 (en) 2010-03-31 2018-01-09 Thomson Licensing Trick playback of video data
WO2012036658A1 (en) 2010-09-17 2012-03-22 Thomson Licensing Method for semantics based trick mode play in video system
US9438876B2 (en) 2010-09-17 2016-09-06 Thomson Licensing Method for semantics based trick mode play in video system
CN103605786A (en) * 2013-11-27 2014-02-26 姚领众 Massive video retrieving method based on sample video clips
EP2922060A1 (en) * 2014-03-17 2015-09-23 Fujitsu Limited Extraction method and device
EP3565243A4 (en) * 2017-01-20 2020-01-01 Huawei Technologies Co., Ltd. Method and apparatus for generating shot information
CN110169055B (en) * 2017-01-20 2021-06-15 华为技术有限公司 Method and device for generating lens information
CN110169055A (en) * 2017-01-20 2019-08-23 华为技术有限公司 A kind of method and apparatus generating shot information
CN111147914A (en) * 2019-12-24 2020-05-12 珠海格力电器股份有限公司 Video processing method, storage medium and electronic equipment

Also Published As

Publication number Publication date
WO2006092765A3 (en) 2006-11-09

Similar Documents

Publication Publication Date Title
Uchihashi et al. Video manga: generating semantically meaningful video summaries
Truong et al. Video abstraction: A systematic review and classification
Li et al. Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques
US7555149B2 (en) Method and system for segmenting videos using face detection
Vijayakumar et al. A study on video data mining
Smoliar et al. Content based video indexing and retrieval
EP1692629B1 (en) System & method for integrative analysis of intrinsic and extrinsic audio-visual data
You et al. A multiple visual models based perceptive analysis framework for multilevel video summarization
US20080187231A1 (en) Summarization of Audio and/or Visual Data
Jiang et al. Automatic consumer video summarization by audio and visual analysis
Chen et al. Detection of soccer goal shots using joint multimedia features and classification rules
WO2006092765A2 (en) Method of video indexing
Jiang et al. Advances in video summarization and skimming
Ferman et al. Effective content representation for video
Tapu et al. DEEP-AD: a multimodal temporal video segmentation framework for online video advertising
El-Bendary et al. PCA-based home videos annotation system
Hammoud Introduction to interactive video
Adami et al. The ToCAI description scheme for indexing and retrieval of multimedia documents
Bailer et al. A distance measure for repeated takes of one scene
Smith et al. Multimodal video characterization and summarization
Bailer et al. Skimming rushes video using retake detection
Rui et al. A unified framework for video summarization, browsing and retrieval
Abdullah et al. Integrating audio visual data for human action detection
Choroś Reduction of faulty detected shot cuts and cross dissolve effects in video segmentation process of different categories of digital videos
You et al. Semantic audiovisual analysis for video summarization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

NENP Non-entry into the national phase in:

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06710993

Country of ref document: EP

Kind code of ref document: A2

WWW Wipo information: withdrawn in national office

Ref document number: 6710993

Country of ref document: EP