WO2009044351A1 - Generation of image data summarizing a sequence of video frames - Google Patents

Generation of image data summarizing a sequence of video frames Download PDF

Info

Publication number
WO2009044351A1
WO2009044351A1 PCT/IB2008/053995 IB2008053995W WO2009044351A1 WO 2009044351 A1 WO2009044351 A1 WO 2009044351A1 IB 2008053995 W IB2008053995 W IB 2008053995W WO 2009044351 A1 WO2009044351 A1 WO 2009044351A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
data
frames
sequence
video frames
Prior art date
Application number
PCT/IB2008/053995
Other languages
French (fr)
Inventor
Johannes Weda
Mauro Barbieri
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2009044351A1 publication Critical patent/WO2009044351A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/786Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using motion, e.g. object motion or camera motion
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel

Definitions

  • the invention relates to a method of generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the method including: analyzing data obtained from the digital video data to select at least one video frame, and scaling down at least one frame of the sequence of video frames and producing the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame.
  • the invention also relates to a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one video frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame.
  • the invention also relates to a computer program.
  • WO 2007/046708 discloses a method of displaying video data within result presentations in information access systems or information search systems.
  • Video summaries may be created as a selection from the original video.
  • the resulting frames are compressed by encoding with a video codec.
  • the simplest selection of frames for the video summary is a section from the beginning of the video.
  • Another way is by analyzing the video, identifying scenes (uninterrupted camera shots), and selecting an image to represent each scene. These images may then be displayed as a slideshow to quickly visualize the video content.
  • a number of frames can be selected from each scene and assembled as a video. A method of selection can be used to extract the most relevant scenes.
  • Video thumbnails can be computed based on the same techniques as discussed for video summaries. The video thumbnail will typically extract much less data from the video than the video summary, typically, just a single still frame or a small set of still frames that can be viewed in an animated way.
  • a problem of the known method is that thumbnails of the frames selected from the selected scenes may not display well on small screens or in small screen areas because too much of a frame with much detail is removed when the thumbnail is created. This makes it difficult for users to use such thumbnails when selecting video segments to watch in full.
  • the step of analyzing includes performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.
  • frames with relatively low spatial frequencies are selected.
  • frames with relatively low spatial frequencies are selected.
  • Such frames can be scaled down - whereby they are sub-sampled in space - to generate scaled down frames that are still informative.
  • the close-up shots generally correspond to important elements of a video segment - they provide a detailed view of the actors, important dialogues for the storyline - they are of themselves relatively informative of the whole video segment.
  • the scaled down selected frames are well-suited for display on small screens or within small parts of a larger screen area.
  • An embodiment of the method includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence.
  • the summary sequence is suitable for representing a plurality of events.
  • the video frames are analyzed for close-up shots and the appropriate video frames or at least video frames adjacent thereto are selected, each selected video frame is likely to show an important event. Because the selected plurality of video frames are scaled down, the summary sequence is suitable for displaying scenes representing important events in a clearly recognizable manner on a small screen area.
  • a variant includes obtaining at least one audio track segment for synchronized reproduction with the video summary sequence. As a result, an increase of the information content of the video summary is achieved.
  • a variant includes determining how many frames should be assembled into the video summary sequence in dependence on the lengths of the audio track segments.
  • Audio information would become unintelligible more quickly if the audio tracks were to be adapted to the length of the video summary sequence, whereas the addition or omission of video frames will not generally affect the intelligibility of the video summary sequence to the same extent.
  • a variant includes obtaining text data associated with the video segment and synthesizing at least one of the audio track segments, based on the text data.
  • audio information can be added to the video summary sequence without having to truncate dialogues or background music in any original soundtrack accompanying the video segment to be summarized.
  • the method of analyzing video data comprises a method of analyzing at least some of the data representative of the video frames.
  • the method of analyzing at least some of the data representative of the video frames includes analyzing data representative of a version of the sequence of video frames obtainable by at least one of sub-sampling in time and reduction in data size of the video frames.
  • the method of analyzing at least some of the data representative of the video frames includes performing a method of detecting video frames containing at least one area corresponding to a face.
  • the step of analyzing data obtained from the digital video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of a total area represented by the video frame. It is thus ensured that close-ups are selected, rather than scenes containing, for example, extras.
  • An embodiment includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence, wherein the step of selecting video frames includes analyzing video frames to determine at least one of a measure of brightness, of contrast and of camera motion.
  • a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data the system being configured to: analyze data obtained from the digital video data to select at least one frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the system is configured to analyze by performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.
  • the system is capable of producing video data that is at once informative and suitable for display on a small screen area. It is in particular useful for providing summaries for use by a user interface for accessing the original video segment or a scaled-down or sub- sampled version of the complete original video segment.
  • system is configured to perform a method according to the invention.
  • the computer program according to the invention includes a set of instructions capable, when incorporated in a machine-readable medium, of causing a system having information processing capabilities to perform a method according to the invention.
  • Fig. 1 is a schematic diagram of a system for accessing video segments comprised of a sequence of video frames
  • Fig. 2 is a simplified screen shot illustrating a graphical user interface based on summaries of video segments
  • Fig. 3 is a flow chart illustrating a method of producing the summaries.
  • the portable media player 1 includes a central processing unit 2 for controlling the operation of the media player 1, and main memory 3. It further includes a data storage device 4 such as a magnetic or optical disk drive. Software enabling the media player 1 to carry out the methods outlined herein can be stored in the data storage device 4 or in Read-Only Memory 5.
  • the media player is provided with means for rendering audiovisual data in perceptible form, which means include a cache memory 6, audio digital signal processor 7, audio amplifier 8, loudspeakers 9 and a display 10.
  • the display 10 is a f ⁇ xed-pixel-array display, and can be a Liquid Crystal
  • Display Organic Light-emitting Diode display, or one based on electronic ink technology. It can have a display resolution of 800 x 480 pixels or lower, e.g. 320 x 240 pixels.
  • the media player 1 also includes user controls for navigating a graphical user interface provided for selection of video segments.
  • the graphical user interface allows selection of video segments for play-back on the media player 1 or on an external display device (not shown) attached to it, or for managing files including data encoding video segments, e.g. to organize them into directories or to transfer them from and to a personal computer 11 attached to the media player 1 via a communications interface 12.
  • the graphical user interface of the media player 1 is arranged to make audiovisual summaries of video segments available for rendering on the media player 1.
  • the media player 1 generates audiovisual summaries summarizing video segments encoded in files stored in the data storage device 4.
  • the graphical user interface may also allow a user of the media player to browse audiovisual summaries of video segments encoded in files available for transfer from a server 13 through a network 14, using a network interface 15 of the media player 1.
  • the transfer may be on demand over an Internet Protocol (IP) network, or it may be a digital television service broadcast over an IP network or direct broadcast network, e.g. a digital cable or terrestrial network such as one based on DVB-T or DVB-H standards.
  • IP Internet Protocol
  • the audiovisual summaries are generated externally and transferred to the media player first, prior to selection of the full video file for transfer or storage. Thus, unnecessary transfer and/or storage of large volumes of data is avoided.
  • data encoding video segments may be downloaded to the personal computer 11.
  • the personal computer 11 is configured to provide a graphical user interface for selecting video segments by means of a display device 16 and a user input device 17.
  • the personal computer 11 generates audiovisual summaries of video segments encoded in data stored on the personal computer 11 and/or it receives audiovisual summaries of video segments encoded in data available for transfer through the network 14 to the computer 11.
  • Figure 2 is a screen shot of a graphical user interface provided on the personal computer 11. Video summary sequences are rendered concurrently in separate areas 18-21 of the screen. Where the first area 18 has been selected, audio track segments associated with the video summary sequence being displayed in that area 18 are reproduced in synchronized fashion with the video summary sequence.
  • Each of the video summary sequences comprises a selection of characteristic video frames from a sequence of which the original video segment is comprised.
  • the video frames constituting the video summary sequence have been scaled down in resolution relative to the video frames on which they are based.
  • the video summary sequences are comprised of frames having a resolution of 90 x 72 pixels.
  • the video summary sequences constitute video summaries in thumbnail format. This makes it possible to display video summary sequences concurrently in the four areas 18-21 of the screen.
  • the video summary sequences are of a relatively short duration, e.g. e in the range of 15 to 20 seconds. This allows for a quick selection to be made, since it is possible to select each of the four areas 18-21 in turn to reproduce the video summary sequence with accompanying audio.
  • browsing using video summaries is made feasible by the use of short summaries in thumbnail format.
  • the video summary sequences are both easier to represent on a small screen and more informative, because it is easier to appreciate what the storyline is and who the main actors are. Moreover, the video summary sequences, especially when reproduced in synchrony with audio track segments, are more interesting for users.
  • Figure 3 illustrates one way of generating an audiovisual summary summarizing a video segment comprised of a sequence of video frames. This method is based on data encoding the sequence of video frames and text data provided in association with the sequence of video frames.
  • the method includes obtaining text data (step 22).
  • the text data is converted into one or more audio track segments (step 23) for synchronized reproduction with the video summary sequence of the audiovisual summary.
  • the text data should be appropriate to the desired duration of the audiovisual summary when reproduced.
  • a typical length would be about two sentences or lines of text.
  • the step 22 of obtaining text data includes obtaining a first set of text data and applying a text summarization technique to obtain a second set of text data of the desired length.
  • additional text data representing information about the video segment to be summarized is obtained, e.g. from a second source.
  • the text data is obtained from descriptive data associated with a data object comprising data encoding the video segment to be summarized.
  • descriptive data is, for example, included with objects generated according to the MPEG-7 standard.
  • Other ways of obtaining the data include electronic program guides and services and databases made accessible via the Internet.
  • the synthesized speech is added to audio data extracted from a soundtrack in the video data, as a voice-over for example.
  • This extracted audio data can encode the title song of a film, for example.
  • the video summary sequence is generated in parallel by obtaining video data (step 25) encoding the video segment, which comprises a sequence of video frames.
  • the video data is in a compressed video data format, for example according to the H.264 or MPEG-2 standard.
  • H.264 or MPEG-2 standard make use of I frames, P frames and B frames or I,P and B macrob locks. I frames contain data encoded without reference to any other frames, whereas P frames and B frames require prior decoding of some other frames for complete decoding.
  • a next step 26 involves sub-sampling the video data in time and space.
  • the video data is obtained in one of the above- mentioned formats, only the I frames are retained. These are scaled down by reducing the pixel resolution.
  • the remaining video data representative of the remaining video frames is analyzed to identify video frames corresponding to close-up shots of at least one type. That is to say that the video frames are analyzed for the presence of characteristics indicating that the frame corresponds to a close-up shot. Depending on the characteristics, more or fewer types of close-up shots can be identified, e.g. those preceded by zooming in, all those including close-up images of humans, etc.
  • the step 27 of analyzing the video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of the total area represented by the video frame. Examples of suitable methods are described in Viola, P.
  • a plurality of video frames is selected (step 28).
  • Frames, such as B and P frames in the case of compressed video data, adjacent the frames determined to contain at least one area corresponding to a face are obtained to construct a number of shots of approximately 75-100 hundred frames each. The exact number is determined in dependence on the length of the audio track segments as determined in the appropriate step 24. In other words, the time duration given by the speech synthesizer is used as input to set the proper thresholds for retrieving the segments with the close-up shots.
  • step 29 the thumbnail video with synchronized audio track segments is assembled.
  • the total amount of data thus generated is relatively small, allowing the audiovisual summaries to be used in devices such as personal digital hard disk recorders, media centers, small form factor devices such as mobile phones, personal digital assistants, etc.
  • the step 27 of analyzing at least some of the video data representative of the video frames is augmented by analyzing at least some of the video data representative of an accompanying synchronized original audio track to identify characteristic frames, for example.
  • 'Means' as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which perform in operation or are designed to perform a specified function, be it a sole function or in conjunction with other functions, be it in isolation or in co-operation with other elements.
  • 'Computer program' is to be understood to mean any software product stored on a computer-readable medium, such as an optical disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Abstract

A method of generating video data summarizinga video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, includes analyzingdata obtained fromthe digital video data to select at least one video frame. At least one frame of the sequence of video frames is scaled down. The video data summarizingthe video sequence is produced so as to include video data representing a scaled down version of each selected video frame. The step(27) of analyzingat least some ofthe video data includes performing a method of analyzingthe video data for identifying video frames corresponding to close-up shots of at least one type.

Description

Generation of image data summarizing a sequence of video frames
FIELD OF THE INVENTION
The invention relates to a method of generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the method including: analyzing data obtained from the digital video data to select at least one video frame, and scaling down at least one frame of the sequence of video frames and producing the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame. The invention also relates to a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one video frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame.
The invention also relates to a computer program.
BACKGROUND OF THE INVENTION
WO 2007/046708 discloses a method of displaying video data within result presentations in information access systems or information search systems. Video summaries may be created as a selection from the original video. After selection of the appropriate frames for the video summary and potential resizing to client devices, the resulting frames are compressed by encoding with a video codec. The simplest selection of frames for the video summary is a section from the beginning of the video. Another way is by analyzing the video, identifying scenes (uninterrupted camera shots), and selecting an image to represent each scene. These images may then be displayed as a slideshow to quickly visualize the video content. Alternatively, a number of frames can be selected from each scene and assembled as a video. A method of selection can be used to extract the most relevant scenes. Selection of which scenes to include in the video summary can be done by looking at the length and motion of a scene. Video thumbnails can be computed based on the same techniques as discussed for video summaries. The video thumbnail will typically extract much less data from the video than the video summary, typically, just a single still frame or a small set of still frames that can be viewed in an animated way.
A problem of the known method is that thumbnails of the frames selected from the selected scenes may not display well on small screens or in small screen areas because too much of a frame with much detail is removed when the thumbnail is created. This makes it difficult for users to use such thumbnails when selecting video segments to watch in full.
SUMMARY OF THE INVENTION
It is an object of the invention to provide a method, system and computer program of the types referred to above that result in video data summarizing a video segment that is informative when displayed on a small screen or within a small screen area.
This object is achieved by the method according to the invention, in which the step of analyzing includes performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.
By identifying and selecting from frames corresponding to close-up shots, frames with relatively low spatial frequencies are selected. Thus, such frames can be scaled down - whereby they are sub-sampled in space - to generate scaled down frames that are still informative. Because the close-up shots generally correspond to important elements of a video segment - they provide a detailed view of the actors, important dialogues for the storyline - they are of themselves relatively informative of the whole video segment. The scaled down selected frames are well-suited for display on small screens or within small parts of a larger screen area.
An embodiment of the method includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence. As a result, information is conveyed about a video segment representing an evolving sequence of events. The summary sequence is suitable for representing a plurality of events. As the video frames are analyzed for close-up shots and the appropriate video frames or at least video frames adjacent thereto are selected, each selected video frame is likely to show an important event. Because the selected plurality of video frames are scaled down, the summary sequence is suitable for displaying scenes representing important events in a clearly recognizable manner on a small screen area.
A variant includes obtaining at least one audio track segment for synchronized reproduction with the video summary sequence. As a result, an increase of the information content of the video summary is achieved.
A variant includes determining how many frames should be assembled into the video summary sequence in dependence on the lengths of the audio track segments.
As a result, synchronization is relatively easy to achieve. Audio information would become unintelligible more quickly if the audio tracks were to be adapted to the length of the video summary sequence, whereas the addition or omission of video frames will not generally affect the intelligibility of the video summary sequence to the same extent.
A variant includes obtaining text data associated with the video segment and synthesizing at least one of the audio track segments, based on the text data. As a result, audio information can be added to the video summary sequence without having to truncate dialogues or background music in any original soundtrack accompanying the video segment to be summarized.
In an embodiment, the method of analyzing video data comprises a method of analyzing at least some of the data representative of the video frames. As a result, the identification of video frames corresponding to close-up shots is not dependent on auxiliary data associated with the sequence of video frames.
In a variant, the method of analyzing at least some of the data representative of the video frames includes analyzing data representative of a version of the sequence of video frames obtainable by at least one of sub-sampling in time and reduction in data size of the video frames.
As a result, the analysis is more efficient and quicker. Objects in close-up shots generally occupy large sections of the frame area, so that sub-sampling and reducing in size will not generally result in the loss of so much information that the analysis no longer yields relevant results. In a variant, the method of analyzing at least some of the data representative of the video frames includes performing a method of detecting video frames containing at least one area corresponding to a face.
As a result, relatively informative video frames are selected, since the faces are indicative of dialogue and represent the main characters in a storyline. In a variant, the step of analyzing data obtained from the digital video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of a total area represented by the video frame. It is thus ensured that close-ups are selected, rather than scenes containing, for example, extras.
An embodiment includes selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence, wherein the step of selecting video frames includes analyzing video frames to determine at least one of a measure of brightness, of contrast and of camera motion.
As a result, sections are identified within the video segment to be summarized, so that, where there are many close-ups, it is possible to select frames from as many sections as possible. This allows for a comprehensive summary to be created.
According to another aspect of the invention, there is provided a system for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one frame, scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the system is configured to analyze by performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.
The system is capable of producing video data that is at once informative and suitable for display on a small screen area. It is in particular useful for providing summaries for use by a user interface for accessing the original video segment or a scaled-down or sub- sampled version of the complete original video segment.
In an embodiment, the system is configured to perform a method according to the invention.
According to another aspect, the computer program according to the invention includes a set of instructions capable, when incorporated in a machine-readable medium, of causing a system having information processing capabilities to perform a method according to the invention.
BRIEF DESCRIPTION OF THE DRAWINGS The invention will be explained in further detail with reference to the accompanying drawings, in which:
Fig. 1 is a schematic diagram of a system for accessing video segments comprised of a sequence of video frames; Fig. 2 is a simplified screen shot illustrating a graphical user interface based on summaries of video segments; and
Fig. 3 is a flow chart illustrating a method of producing the summaries.
DETAILED DESCRIPTION OF THE EMODIMENTS In the following, an application using a portable media player 1 will be used to explain a method of generating video data summarizing a video segment comprised of a sequence of video frames to generate an audiovisual summary, as well as uses of the audiovisual summary. The portable media player 1 includes a central processing unit 2 for controlling the operation of the media player 1, and main memory 3. It further includes a data storage device 4 such as a magnetic or optical disk drive. Software enabling the media player 1 to carry out the methods outlined herein can be stored in the data storage device 4 or in Read-Only Memory 5. The media player is provided with means for rendering audiovisual data in perceptible form, which means include a cache memory 6, audio digital signal processor 7, audio amplifier 8, loudspeakers 9 and a display 10. The display 10 is a fϊxed-pixel-array display, and can be a Liquid Crystal
Display, Organic Light-emitting Diode display, or one based on electronic ink technology. It can have a display resolution of 800 x 480 pixels or lower, e.g. 320 x 240 pixels.
The media player 1 also includes user controls for navigating a graphical user interface provided for selection of video segments. The graphical user interface allows selection of video segments for play-back on the media player 1 or on an external display device (not shown) attached to it, or for managing files including data encoding video segments, e.g. to organize them into directories or to transfer them from and to a personal computer 11 attached to the media player 1 via a communications interface 12.
The graphical user interface of the media player 1 is arranged to make audiovisual summaries of video segments available for rendering on the media player 1. In one embodiment, the media player 1 generates audiovisual summaries summarizing video segments encoded in files stored in the data storage device 4.
The graphical user interface may also allow a user of the media player to browse audiovisual summaries of video segments encoded in files available for transfer from a server 13 through a network 14, using a network interface 15 of the media player 1. The transfer may be on demand over an Internet Protocol (IP) network, or it may be a digital television service broadcast over an IP network or direct broadcast network, e.g. a digital cable or terrestrial network such as one based on DVB-T or DVB-H standards. The audiovisual summaries are generated externally and transferred to the media player first, prior to selection of the full video file for transfer or storage. Thus, unnecessary transfer and/or storage of large volumes of data is avoided.
In addition, or alternatively, data encoding video segments may be downloaded to the personal computer 11. The personal computer 11 is configured to provide a graphical user interface for selecting video segments by means of a display device 16 and a user input device 17. To this end, the personal computer 11 generates audiovisual summaries of video segments encoded in data stored on the personal computer 11 and/or it receives audiovisual summaries of video segments encoded in data available for transfer through the network 14 to the computer 11. Figure 2 is a screen shot of a graphical user interface provided on the personal computer 11. Video summary sequences are rendered concurrently in separate areas 18-21 of the screen. Where the first area 18 has been selected, audio track segments associated with the video summary sequence being displayed in that area 18 are reproduced in synchronized fashion with the video summary sequence. Each of the video summary sequences comprises a selection of characteristic video frames from a sequence of which the original video segment is comprised. The video frames constituting the video summary sequence have been scaled down in resolution relative to the video frames on which they are based. In an embodiment, the video summary sequences are comprised of frames having a resolution of 90 x 72 pixels. In effect, the video summary sequences constitute video summaries in thumbnail format. This makes it possible to display video summary sequences concurrently in the four areas 18-21 of the screen. In an embodiment, the video summary sequences are of a relatively short duration, e.g. e in the range of 15 to 20 seconds. This allows for a quick selection to be made, since it is possible to select each of the four areas 18-21 in turn to reproduce the video summary sequence with accompanying audio. In an embodiment (not shown) where, at any one time, only one video summary sequence is shown on a screen, e.g. that of the display 10 of the media player 1, browsing using video summaries is made feasible by the use of short summaries in thumbnail format.
Although it is possible to select and display characteristic frames individually in thumbnail format, as static key frames, the video summary sequences are both easier to represent on a small screen and more informative, because it is easier to appreciate what the storyline is and who the main actors are. Moreover, the video summary sequences, especially when reproduced in synchrony with audio track segments, are more interesting for users.
Figure 3 illustrates one way of generating an audiovisual summary summarizing a video segment comprised of a sequence of video frames. This method is based on data encoding the sequence of video frames and text data provided in association with the sequence of video frames.
Thus, the method includes obtaining text data (step 22). The text data is converted into one or more audio track segments (step 23) for synchronized reproduction with the video summary sequence of the audiovisual summary. The text data should be appropriate to the desired duration of the audiovisual summary when reproduced. A typical length would be about two sentences or lines of text. In one embodiment, the step 22 of obtaining text data includes obtaining a first set of text data and applying a text summarization technique to obtain a second set of text data of the desired length. In another embodiment, if a first set of text data obtained from a first source is too small, additional text data representing information about the video segment to be summarized is obtained, e.g. from a second source. In one embodiment, the text data is obtained from descriptive data associated with a data object comprising data encoding the video segment to be summarized. Such descriptive data is, for example, included with objects generated according to the MPEG-7 standard. Other ways of obtaining the data include electronic program guides and services and databases made accessible via the Internet.
In an embodiment, the synthesized speech is added to audio data extracted from a soundtrack in the video data, as a voice-over for example. This extracted audio data can encode the title song of a film, for example. Having generated the audio data for the audiovisual summary, it is possible to determine the duration of the audio track segment or segments to be reproduced with the video summary sequence. This is used to determine (step 24) how many frames are to be assembled into the video summary sequence.
The video summary sequence is generated in parallel by obtaining video data (step 25) encoding the video segment, which comprises a sequence of video frames. In an embodiment, the video data is in a compressed video data format, for example according to the H.264 or MPEG-2 standard. As is known, such standards make use of I frames, P frames and B frames or I,P and B macrob locks. I frames contain data encoded without reference to any other frames, whereas P frames and B frames require prior decoding of some other frames for complete decoding.
In the illustrated method, a next step 26 involves sub-sampling the video data in time and space. In particular, where the video data is obtained in one of the above- mentioned formats, only the I frames are retained. These are scaled down by reducing the pixel resolution.
In a next step 27, the remaining video data representative of the remaining video frames is analyzed to identify video frames corresponding to close-up shots of at least one type. That is to say that the video frames are analyzed for the presence of characteristics indicating that the frame corresponds to a close-up shot. Depending on the characteristics, more or fewer types of close-up shots can be identified, e.g. those preceded by zooming in, all those including close-up images of humans, etc.
It is possible to identify close-up shots using a method comprising the steps of assigning portions of the frame to at least a first cluster or a second cluster, the clusters having different ranges of depth values associated therewith, and determining the shot type of the frame on the basis of whether both the first and second clusters have been assigned at least one portion or whether there is a stepped or gradual change in the difference between the depth values of the first and second clusters. Such a method is set out more fully in WO 2007/036823. In the example used herein, the step 27 of analyzing the video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of the total area represented by the video frame. Examples of suitable methods are described in Viola, P. and Jones, M., "Rapid object detection using a boosted cascade of simple features", Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, Kauai, U.S.A., pp. 511-518, and in Rong, X. et al, "Robust multipose face detection in images", IEEE Transactions on Circuits and Systems for Video Technology, 14 (1), 1 Jan. 2004, pp. 31-41. These methods also allow to determine the location, size and pose (angles) of the faces, any one of which may be used as additional criterion for selecting video frames in this step 27. In addition to the identification of areas corresponding to faces, further video analyses are applied in some embodiments. The further video analyses can include analysis of the brightness, contrast, and camera motion. They may be applied to frames other than those selected in the preceding step 26, but adjacent to frames determined to contain at least one area corresponding to a face. On the basis of the analysis or analyses, a plurality of video frames is selected (step 28). Frames, such as B and P frames in the case of compressed video data, adjacent the frames determined to contain at least one area corresponding to a face are obtained to construct a number of shots of approximately 75-100 hundred frames each. The exact number is determined in dependence on the length of the audio track segments as determined in the appropriate step 24. In other words, the time duration given by the speech synthesizer is used as input to set the proper thresholds for retrieving the segments with the close-up shots.
Finally (step 29), the thumbnail video with synchronized audio track segments is assembled. The total amount of data thus generated is relatively small, allowing the audiovisual summaries to be used in devices such as personal digital hard disk recorders, media centers, small form factor devices such as mobile phones, personal digital assistants, etc.
It should be noted that the embodiments described above illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. In an alternative embodiment, the step 27 of analyzing at least some of the video data representative of the video frames is augmented by analyzing at least some of the video data representative of an accompanying synchronized original audio track to identify characteristic frames, for example.
'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which perform in operation or are designed to perform a specified function, be it a sole function or in conjunction with other functions, be it in isolation or in co-operation with other elements. 'Computer program' is to be understood to mean any software product stored on a computer-readable medium, such as an optical disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:
1. Method of generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the method including: analyzing data obtained from the digital video data to select at least one video frame, scaling down at least one frame of the sequence of video frames, and producing the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the step (27) of analyzing includes performing a method of analyzing video data for identifying video frames corresponding to close-up shots of at least one type.
2. Method according to claim 1, including selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence.
3. Method according to claim 2, including obtaining at least one audio track segment for synchronized reproduction with the video summary sequence.
4. Method according to claim 3, including determining how many frames to assemble into the video summary sequence in dependence on the lengths of the audio track segments.
5. Method according to claim 3 or 4, including obtaining text data associated with the video segment and synthesizing at least one of the audio track segments based on the text data.
6. Method according to any one of claims 1 to 5, wherein the method of analyzing video data comprises a method of analyzing at least some of the data representative of the video frames.
7. Method according to claim 6, wherein the method of analyzing at least some of the data representative of the video frames includes analyzing data representative of a version of the sequence of video frames obtainable by at least one of sub-sampling in time and reduction in data size of the video frames.
8. Method according to claim 6 or 7, wherein the method of analyzing at least some of the data representative of the video frames includes performing a method of detecting video frames containing at least one area corresponding to a face.
9. Method according to claim 8, wherein the step (27) of analyzing data obtained from the digital video data includes identifying video frames determined to contain at least one area corresponding to a face and occupying more than a pre-determined proportion of a total area represented by the video frame.
10. Method according to any one of claims 6 to 9, including selecting a plurality of video frames and assembling scaled down versions of the selected video frames into a video summary sequence, wherein the step (28) of selecting video frames includes analyzing video frames to determine at least one of a measure of brightness, of contrast and of camera motion.
11. System for generating video data summarizing a video segment comprised of a sequence of video frames, which video segment is encoded in digital video data, the system being configured to: analyze data obtained from the digital video data to select at least one video frame, and scale down at least one frame of the sequence of video frames, and produce the video data summarizing the video sequence so as to include video data representing a scaled down version of each selected video frame, wherein the system is configured to analyze by performing a method of analyzing video data for identifying frames corresponding to close-up shots of at least one type.
12. System for generating video data according to claim 11, configured to perform a method according to any one of claims 1 to 10.
13. Computer program including a set of instructions capable, when incorporated in a machine-readable medium, of causing a system (1,11,13) having information processing capabilities to perform a method according to any one of claims 1 to 10.
PCT/IB2008/053995 2007-10-04 2008-10-01 Generation of image data summarizing a sequence of video frames WO2009044351A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP07117874.3 2007-10-04
EP07117874 2007-10-04

Publications (1)

Publication Number Publication Date
WO2009044351A1 true WO2009044351A1 (en) 2009-04-09

Family

ID=40149762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2008/053995 WO2009044351A1 (en) 2007-10-04 2008-10-01 Generation of image data summarizing a sequence of video frames

Country Status (1)

Country Link
WO (1) WO2009044351A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853286A (en) * 2010-05-20 2010-10-06 上海全土豆网络科技有限公司 Intelligent selection method of video thumbnails
EP3125245A1 (en) * 2015-07-27 2017-02-01 Thomson Licensing Method for selecting at least one sequence of frames and corresponding method for creating an audio and/or video digest, electronic devices, computer readable program product and computer readable storage medium
WO2019054871A1 (en) * 2017-09-15 2019-03-21 Endemol Shine Ip B.V. A media system for providing searchable video data for generating a video comprising parts of said searched video data and a corresponding method
CN113365104A (en) * 2021-06-04 2021-09-07 中国建设银行股份有限公司 Video concentration method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040130567A1 (en) * 2002-08-02 2004-07-08 Ahmet Ekin Automatic soccer video analysis and summarization
WO2007126666A2 (en) * 2006-03-30 2007-11-08 Eastman Kodak Company Method for enabling preview of video files

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040130567A1 (en) * 2002-08-02 2004-07-08 Ahmet Ekin Automatic soccer video analysis and summarization
WO2007126666A2 (en) * 2006-03-30 2007-11-08 Eastman Kodak Company Method for enabling preview of video files

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIENHART R ET AL: "VIDEO ABSTRACTING", COMMUNICATIONS OF THE ACM, vol. 40, no. 12, 1 December 1997 (1997-12-01), pages 55 - 62, XP000765719, ISSN: 0001-0782 *
SMITH M A ET AL: "Video Skimming for Quick Browsing based on Audio and Image Characterization", CARNEGIE MELLON UNIVERSITY, 30 June 1995 (1995-06-30), pages 1 - 22, XP002470969, Retrieved from the Internet <URL:http://www.informedia.cs.cmu.edu/documents/cmu-cs-95-186.pdf> [retrieved on 20080225] *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853286A (en) * 2010-05-20 2010-10-06 上海全土豆网络科技有限公司 Intelligent selection method of video thumbnails
CN101853286B (en) * 2010-05-20 2016-08-10 上海全土豆网络科技有限公司 Intelligent selection method of video thumbnails
EP3125245A1 (en) * 2015-07-27 2017-02-01 Thomson Licensing Method for selecting at least one sequence of frames and corresponding method for creating an audio and/or video digest, electronic devices, computer readable program product and computer readable storage medium
EP3125246A3 (en) * 2015-07-27 2017-03-08 Thomson Licensing Method for selecting sequences of frames and corresponding electronic device, computer readable program product and computer readable storage medium
WO2019054871A1 (en) * 2017-09-15 2019-03-21 Endemol Shine Ip B.V. A media system for providing searchable video data for generating a video comprising parts of said searched video data and a corresponding method
NL2019556B1 (en) * 2017-09-15 2019-03-27 Endemol Shine Ip B V A media system for providing searchable video data for generating a video comprising parts of said searched video data and a corresponding method.
CN113365104A (en) * 2021-06-04 2021-09-07 中国建设银行股份有限公司 Video concentration method and device
CN113365104B (en) * 2021-06-04 2022-09-09 中国建设银行股份有限公司 Video concentration method and device

Similar Documents

Publication Publication Date Title
US9372926B2 (en) Intelligent video summaries in information access
Yeung et al. Video visualization for compact presentation and fast browsing of pictorial content
KR100915847B1 (en) Streaming video bookmarks
JP4200741B2 (en) Video collage creation method and device, video collage display device, and video collage creation program
Truong et al. Video abstraction: A systematic review and classification
Aigrain et al. Content-based representation and retrieval of visual media: A state-of-the-art review
CN101443849B (en) Video browsing user interface
US9966112B1 (en) Systems and methods to associate multimedia tags with user comments and generate user modifiable snippets around a tag time for efficient storage and sharing of tagged items
JP5507386B2 (en) Generating video content from image sets
Bolle et al. Video query: Research directions
CN101300567B (en) Method for media sharing and authoring on the web
CN101150699B (en) Information processing apparatus, information processing method
Nam et al. Dynamic video summarization and visualization
Srinivasan et al. " What is in that video anyway?": In Search of Better Browsing
US20070101266A1 (en) Video summary description scheme and method and system of video summary description data generation for efficient overview and browsing
Mei et al. Near-lossless semantic video summarization and its applications to video analysis
US20090079840A1 (en) Method for intelligently creating, consuming, and sharing video content on mobile devices
WO2003088665A1 (en) Meta data edition device, meta data reproduction device, meta data distribution device, meta data search device, meta data reproduction condition setting device, and meta data distribution method
KR20070090751A (en) Image displaying method and video playback apparatus
JP2001028722A (en) Moving picture management device and moving picture management system
KR20030059398A (en) Multimedia data searching and browsing system
KR101440168B1 (en) Method for creating a new summary of an audiovisual document that already includes a summary and reports and a receiver that can implement said method
US20040181545A1 (en) Generating and rendering annotated video files
US20080320046A1 (en) Video data management apparatus
KR20140115659A (en) Apparatus, method and computer readable recording medium of creating and playing a live picture file

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08835341

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08835341

Country of ref document: EP

Kind code of ref document: A1