WO2009044351A1

WO2009044351A1 - Generation of image data summarizing a sequence of video frames

Info

Publication number: WO2009044351A1
Application number: PCT/IB2008/053995
Authority: WO
Inventors: Johannes Weda; Mauro Barbieri
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2007-10-04
Filing date: 2008-10-01
Publication date: 2009-04-09

Abstract

A method of generating video data summarizinga video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, includes analyzingdata obtained fromthe digital video data to select at least one video frame. At least one frame of the sequence of video frames is scaled down. The video data summarizingthe video sequence is produced so as to include video data representing a scaled down version of each selected video frame. The step(27) of analyzingat least some ofthe video data includes performing a method of analyzingthe video data for identifying video frames corresponding to close-up shots of at least one type.

Description

Generation of image data summarizing a sequence of video frames

FIELD OF THE INVENTION

The invention relates to a method of generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the method including: analyzing data obtained from the digital video data to select at least one video frame, and scaling down at least one frame of the sequence of video frames and producing the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame. The invention also relates to a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one video frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame.

The invention also relates to a computer program.

BACKGROUND OF THE INVENTION

WO 2007/046708 discloses a method of displaying video data within result presentations in information access systems or information search systems. Video summaries may be created as a selection from the original video. After selection of the appropriate frames for the video summary and potential resizing to client devices, the resulting frames are compressed by encoding with a video codec. The simplest selection of frames for the video summary is a section from the beginning of the video. Another way is by analyzing the video, identifying scenes (uninterrupted camera shots), and selecting an image to represent each scene. These images may then be displayed as a slideshow to quickly visualize the video content. Alternatively, a number of frames can be selected from each scene and assembled as a video. A method of selection can be used to extract the most relevant scenes. Selection of which scenes to include in the video summary can be done by looking at the length and motion of a scene. Video thumbnails can be computed based on the same techniques as discussed for video summaries. The video thumbnail will typically extract much less data from the video than the video summary, typically, just a single still frame or a small set of still frames that can be viewed in an animated way.

A problem of the known method is that thumbnails of the frames selected from the selected scenes may not display well on small screens or in small screen areas because too much of a frame with much detail is removed when the thumbnail is created. This makes it difficult for users to use such thumbnails when selecting video segments to watch in full.

SUMMARY OF THE INVENTION

It is an object of the invention to provide a method, system and computer program of the types referred to above that result in video data summarizing a video segment that is informative when displayed on a small screen or within a small screen area.

This object is achieved by the method according to the invention, in which the step of analyzing includes performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.

By identifying and selecting from frames corresponding to close-up shots, frames with relatively low spatial frequencies are selected. Thus, such frames can be scaled down - whereby they are sub-sampled in space - to generate scaled down frames that are still informative. Because the close-up shots generally correspond to important elements of a video segment - they provide a detailed view of the actors, important dialogues for the storyline - they are of themselves relatively informative of the whole video segment. The scaled down selected frames are well-suited for display on small screens or within small parts of a larger screen area.

An embodiment of the method includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence. As a result, information is conveyed about a video segment representing an evolving sequence of events. The summary sequence is suitable for representing a plurality of events. As the video frames are analyzed for close-up shots and the appropriate video frames or at least video frames adjacent thereto are selected, each selected video frame is likely to show an important event. Because the selected plurality of video frames are scaled down, the summary sequence is suitable for displaying scenes representing important events in a clearly recognizable manner on a small screen area.

A variant includes obtaining at least one audio track segment for synchronized reproduction with the video summary sequence. As a result, an increase of the information content of the video summary is achieved.

A variant includes determining how many frames should be assembled into the video summary sequence in dependence on the lengths of the audio track segments.

As a result, synchronization is relatively easy to achieve. Audio information would become unintelligible more quickly if the audio tracks were to be adapted to the length of the video summary sequence, whereas the addition or omission of video frames will not generally affect the intelligibility of the video summary sequence to the same extent.

A variant includes obtaining text data associated with the video segment and synthesizing at least one of the audio track segments, based on the text data. As a result, audio information can be added to the video summary sequence without having to truncate dialogues or background music in any original soundtrack accompanying the video segment to be summarized.

In an embodiment, the method of analyzing video data comprises a method of analyzing at least some of the data representative of the video frames. As a result, the identification of video frames corresponding to close-up shots is not dependent on auxiliary data associated with the sequence of video frames.

In a variant, the method of analyzing at least some of the data representative of the video frames includes analyzing data representative of a version of the sequence of video frames obtainable by at least one of sub-sampling in time and reduction in data size of the video frames.

As a result, the analysis is more efficient and quicker. Objects in close-up shots generally occupy large sections of the frame area, so that sub-sampling and reducing in size will not generally result in the loss of so much information that the analysis no longer yields relevant results. In a variant, the method of analyzing at least some of the data representative of the video frames includes performing a method of detecting video frames containing at least one area corresponding to a face.

As a result, relatively informative video frames are selected, since the faces are indicative of dialogue and represent the main characters in a storyline. In a variant, the step of analyzing data obtained from the digital video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of a total area represented by the video frame. It is thus ensured that close-ups are selected, rather than scenes containing, for example, extras.

An embodiment includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence, wherein the step of selecting video frames includes analyzing video frames to determine at least one of a measure of brightness, of contrast and of camera motion.

As a result, sections are identified within the video segment to be summarized, so that, where there are many close-ups, it is possible to select frames from as many sections as possible. This allows for a comprehensive summary to be created.

According to another aspect of the invention, there is provided a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the system is configured to analyze by performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.

The system is capable of producing video data that is at once informative and suitable for display on a small screen area. It is in particular useful for providing summaries for use by a user interface for accessing the original video segment or a scaled-down or sub- sampled version of the complete original video segment.

In an embodiment, the system is configured to perform a method according to the invention.

According to another aspect, the computer program according to the invention includes a set of instructions capable, when incorporated in a machine-readable medium, of causing a system having information processing capabilities to perform a method according to the invention.

BRIEF DESCRIPTION OF THE DRAWINGS The invention will be explained in further detail with reference to the accompanying drawings, in which:

Fig. 1 is a schematic diagram of a system for accessing video segments comprised of a sequence of video frames; Fig. 2 is a simplified screen shot illustrating a graphical user interface based on summaries of video segments; and

Fig. 3 is a flow chart illustrating a method of producing the summaries.

DETAILED DESCRIPTION OF THE EMODIMENTS In the following, an application using a portable media player 1 will be used to explain a method of generating video data summarizing a video segment comprised of a sequence of video frames to generate an audiovisual summary, as well as uses of the audiovisual summary. The portable media player 1 includes a central processing unit 2 for controlling the operation of the media player 1, and main memory 3. It further includes a data storage device 4 such as a magnetic or optical disk drive. Software enabling the media player 1 to carry out the methods outlined herein can be stored in the data storage device 4 or in Read-Only Memory 5. The media player is provided with means for rendering audiovisual data in perceptible form, which means include a cache memory 6, audio digital signal processor 7, audio amplifier 8, loudspeakers 9 and a display 10. The display 10 is a fϊxed-pixel-array display, and can be a Liquid Crystal

Display, Organic Light-emitting Diode display, or one based on electronic ink technology. It can have a display resolution of 800 x 480 pixels or lower, e.g. 320 x 240 pixels.

The media player 1 also includes user controls for navigating a graphical user interface provided for selection of video segments. The graphical user interface allows selection of video segments for play-back on the media player 1 or on an external display device (not shown) attached to it, or for managing files including data encoding video segments, e.g. to organize them into directories or to transfer them from and to a personal computer 11 attached to the media player 1 via a communications interface 12.

The graphical user interface of the media player 1 is arranged to make audiovisual summaries of video segments available for rendering on the media player 1. In one embodiment, the media player 1 generates audiovisual summaries summarizing video segments encoded in files stored in the data storage device 4.

The graphical user interface may also allow a user of the media player to browse audiovisual summaries of video segments encoded in files available for transfer from a server 13 through a network 14, using a network interface 15 of the media player 1. The transfer may be on demand over an Internet Protocol (IP) network, or it may be a digital television service broadcast over an IP network or direct broadcast network, e.g. a digital cable or terrestrial network such as one based on DVB-T or DVB-H standards. The audiovisual summaries are generated externally and transferred to the media player first, prior to selection of the full video file for transfer or storage. Thus, unnecessary transfer and/or storage of large volumes of data is avoided.

In addition, or alternatively, data encoding video segments may be downloaded to the personal computer 11. The personal computer 11 is configured to provide a graphical user interface for selecting video segments by means of a display device 16 and a user input device 17. To this end, the personal computer 11 generates audiovisual summaries of video segments encoded in data stored on the personal computer 11 and/or it receives audiovisual summaries of video segments encoded in data available for transfer through the network 14 to the computer 11. Figure 2 is a screen shot of a graphical user interface provided on the personal computer 11. Video summary sequences are rendered concurrently in separate areas 18-21 of the screen. Where the first area 18 has been selected, audio track segments associated with the video summary sequence being displayed in that area 18 are reproduced in synchronized fashion with the video summary sequence. Each of the video summary sequences comprises a selection of characteristic video frames from a sequence of which the original video segment is comprised. The video frames constituting the video summary sequence have been scaled down in resolution relative to the video frames on which they are based. In an embodiment, the video summary sequences are comprised of frames having a resolution of 90 x 72 pixels. In effect, the video summary sequences constitute video summaries in thumbnail format. This makes it possible to display video summary sequences concurrently in the four areas 18-21 of the screen. In an embodiment, the video summary sequences are of a relatively short duration, e.g. e in the range of 15 to 20 seconds. This allows for a quick selection to be made, since it is possible to select each of the four areas 18-21 in turn to reproduce the video summary sequence with accompanying audio. In an embodiment (not shown) where, at any one time, only one video summary sequence is shown on a screen, e.g. that of the display 10 of the media player 1, browsing using video summaries is made feasible by the use of short summaries in thumbnail format.

Although it is possible to select and display characteristic frames individually in thumbnail format, as static key frames, the video summary sequences are both easier to represent on a small screen and more informative, because it is easier to appreciate what the storyline is and who the main actors are. Moreover, the video summary sequences, especially when reproduced in synchrony with audio track segments, are more interesting for users.

Figure 3 illustrates one way of generating an audiovisual summary summarizing a video segment comprised of a sequence of video frames. This method is based on data encoding the sequence of video frames and text data provided in association with the sequence of video frames.

Thus, the method includes obtaining text data (step 22). The text data is converted into one or more audio track segments (step 23) for synchronized reproduction with the video summary sequence of the audiovisual summary. The text data should be appropriate to the desired duration of the audiovisual summary when reproduced. A typical length would be about two sentences or lines of text. In one embodiment, the step 22 of obtaining text data includes obtaining a first set of text data and applying a text summarization technique to obtain a second set of text data of the desired length. In another embodiment, if a first set of text data obtained from a first source is too small, additional text data representing information about the video segment to be summarized is obtained, e.g. from a second source. In one embodiment, the text data is obtained from descriptive data associated with a data object comprising data encoding the video segment to be summarized. Such descriptive data is, for example, included with objects generated according to the MPEG-7 standard. Other ways of obtaining the data include electronic program guides and services and databases made accessible via the Internet.

In an embodiment, the synthesized speech is added to audio data extracted from a soundtrack in the video data, as a voice-over for example. This extracted audio data can encode the title song of a film, for example. Having generated the audio data for the audiovisual summary, it is possible to determine the duration of the audio track segment or segments to be reproduced with the video summary sequence. This is used to determine (step 24) how many frames are to be assembled into the video summary sequence.

The video summary sequence is generated in parallel by obtaining video data (step 25) encoding the video segment, which comprises a sequence of video frames. In an embodiment, the video data is in a compressed video data format, for example according to the H.264 or MPEG-2 standard. As is known, such standards make use of I frames, P frames and B frames or I,P and B macrob locks. I frames contain data encoded without reference to any other frames, whereas P frames and B frames require prior decoding of some other frames for complete decoding.

In the illustrated method, a next step 26 involves sub-sampling the video data in time and space. In particular, where the video data is obtained in one of the above- mentioned formats, only the I frames are retained. These are scaled down by reducing the pixel resolution.

In a next step 27, the remaining video data representative of the remaining video frames is analyzed to identify video frames corresponding to close-up shots of at least one type. That is to say that the video frames are analyzed for the presence of characteristics indicating that the frame corresponds to a close-up shot. Depending on the characteristics, more or fewer types of close-up shots can be identified, e.g. those preceded by zooming in, all those including close-up images of humans, etc.

It is possible to identify close-up shots using a method comprising the steps of assigning portions of the frame to at least a first cluster or a second cluster, the clusters having different ranges of depth values associated therewith, and determining the shot type of the frame on the basis of whether both the first and second clusters have been assigned at least one portion or whether there is a stepped or gradual change in the difference between the depth values of the first and second clusters. Such a method is set out more fully in WO 2007/036823. In the example used herein, the step 27 of analyzing the video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of the total area represented by the video frame. Examples of suitable methods are described in Viola, P. and Jones, M., "Rapid object detection using a boosted cascade of simple features", Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, Kauai, U.S.A., pp. 511-518, and in Rong, X. et al, "Robust multipose face detection in images", IEEE Transactions on Circuits and Systems for Video Technology, 14 (1), 1 Jan. 2004, pp. 31-41. These methods also allow to determine the location, size and pose (angles) of the faces, any one of which may be used as additional criterion for selecting video frames in this step 27. In addition to the identification of areas corresponding to faces, further video analyses are applied in some embodiments. The further video analyses can include analysis of the brightness, contrast, and camera motion. They may be applied to frames other than those selected in the preceding step 26, but adjacent to frames determined to contain at least one area corresponding to a face. On the basis of the analysis or analyses, a plurality of video frames is selected (step 28). Frames, such as B and P frames in the case of compressed video data, adjacent the frames determined to contain at least one area corresponding to a face are obtained to construct a number of shots of approximately 75-100 hundred frames each. The exact number is determined in dependence on the length of the audio track segments as determined in the appropriate step 24. In other words, the time duration given by the speech synthesizer is used as input to set the proper thresholds for retrieving the segments with the close-up shots.

Finally (step 29), the thumbnail video with synchronized audio track segments is assembled. The total amount of data thus generated is relatively small, allowing the audiovisual summaries to be used in devices such as personal digital hard disk recorders, media centers, small form factor devices such as mobile phones, personal digital assistants, etc.

It should be noted that the embodiments described above illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. In an alternative embodiment, the step 27 of analyzing at least some of the video data representative of the video frames is augmented by analyzing at least some of the video data representative of an accompanying synchronized original audio track to identify characteristic frames, for example.

'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which perform in operation or are designed to perform a specified function, be it a sole function or in conjunction with other functions, be it in isolation or in co-operation with other elements. 'Computer program' is to be understood to mean any software product stored on a computer-readable medium, such as an optical disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:

1. Method of generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the method including: analyzing data obtained from the digital video data to select at least one video frame, scaling down at least one frame of the sequence of video frames, and producing the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the step (27) of analyzing includes performing a method of analyzing video data for identifying video frames corresponding to close-up shots of at least one type.

2. Method according to claim 1, including selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence.

3. Method according to claim 2, including obtaining at least one audio track segment for synchronized reproduction with the video summary sequence.

4. Method according to claim 3, including determining how many frames to assemble into the video summary sequence in dependence on the lengths of the audio track segments.

5. Method according to claim 3 or 4, including obtaining text data associated with the video segment and synthesizing at least one of the audio track segments based on the text data.

6. Method according to any one of claims 1 to 5, wherein the method of analyzing video data comprises a method of analyzing at least some of the data representative of the video frames.

7. Method according to claim 6, wherein the method of analyzing at least some of the data representative of the video frames includes analyzing data representative of a version of the sequence of video frames obtainable by at least one of sub-sampling in time and reduction in data size of the video frames.

8. Method according to claim 6 or 7, wherein the method of analyzing at least some of the data representative of the video frames includes performing a method of detecting video frames containing at least one area corresponding to a face.

9. Method according to claim 8, wherein the step (27) of analyzing data obtained from the digital video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of a total area represented by the video frame.

10. Method according to any one of claims 6 to 9, including selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence, wherein the step (28) of selecting video frames includes analyzing video frames to determine at least one of a measure of brightness, of contrast and of camera motion.

11. System for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one video frame, and scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the system is configured to analyze by performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.

12. System for generating video data according to claim 11, configured to perform a method according to any one of claims 1 to 10.

13. Computer program including a set of instructions capable, when incorporated in a machine-readable medium, of causing a system (1,11,13) having information processing capabilities to perform a method according to any one of claims 1 to 10.