WO2009150567A2 - Method and apparatus for generating a summary of an audio/visual data stream - Google Patents

Method and apparatus for generating a summary of an audio/visual data stream Download PDF

Info

Publication number
WO2009150567A2
WO2009150567A2 PCT/IB2009/052318 IB2009052318W WO2009150567A2 WO 2009150567 A2 WO2009150567 A2 WO 2009150567A2 IB 2009052318 W IB2009052318 W IB 2009052318W WO 2009150567 A2 WO2009150567 A2 WO 2009150567A2
Authority
WO
WIPO (PCT)
Prior art keywords
data stream
audio
shots
shot
visual
Prior art date
Application number
PCT/IB2009/052318
Other languages
French (fr)
Other versions
WO2009150567A3 (en
Inventor
Milan Pastrnak
Pedro Fonseca
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to JP2011512260A priority Critical patent/JP2011523291A/en
Priority to EP09762094A priority patent/EP2291844A2/en
Priority to US12/994,164 priority patent/US8542983B2/en
Priority to CN2009801217253A priority patent/CN102057433A/en
Publication of WO2009150567A2 publication Critical patent/WO2009150567A2/en
Publication of WO2009150567A3 publication Critical patent/WO2009150567A3/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers

Definitions

  • the present invention relates to a method and apparatus for generating a summary of an audio/visual data stream.
  • One existing solution is to provide a user with a summary of the event which shows the main highlights.
  • Existing summarization systems typically aim at choosing the best segments of a video sequence that fit a pre-defined time interval. For example, if the user asks for a summary of 5 minutes, the system then detects which are the best segments that fit that summary of 5 minutes.
  • US 2007/0292112 discloses a method of searching a highlight in a film of a tennis game.
  • a plurality of long-field view shots are detected in the film and the audio energy of the long-field view shots is used to determine desired long-field view shots belonging to the highlights.
  • the audio energy is used to identify applause during the long- field view shots to determine the highlights.
  • the present invention seeks to provide a method whereby a summary that includes the most important highlights of an audio/visual data stream is generated.
  • the present invention further seeks to improve the accuracy of detecting the most important highlights.
  • a method of generating a summary of an audio/visual data stream comprising a plurality of consecutive frames having audio and visual properties
  • the method comprising the steps of: detecting a plurality of shots of an audio/visual data stream; determining a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties; selecting a segment of the determined plurality of segments; for each shot of the selected segment of the data stream, extracting the audio in a plurality of consecutive frames which occur after the end of the shot; selecting at least one of the shots based on the extracted audio; and generating a summary to include the selected at least one of the shots.
  • an apparatus for generating a summary of an audio/visual data stream comprising a plurality of consecutive frames having audio and visual properties
  • the apparatus comprising: a shot detector for detecting a plurality of shots of an audio/visual data stream; a determining means for determining a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties; a first selector for selecting a segment of the determined plurality of segments; an extractor for extracting, for each shot of the selected segment of the data stream, the audio in a plurality of consecutive frames which occur after the end of the shot; a second selector for selecting at least one of the shots based on the extracted audio; and a summary generator for generating a summary to include the selected at least one of the shots.
  • the user's experience of watching a summary is enriched since the interesting shots are identified and separated from the original audio/visual data stream thus forming the summary.
  • the summary will depend on how interesting each shot in the data stream is. Further, the criteria of "how interesting" a shot is can be adapted. The application can lower or raise the threshold in order to get correspondingly smaller or larger summaries.
  • This control can be offered in a very simple way to the user. As a result of this control, the summary that is generated includes the most important (e.g. the most interesting) highlights of the audio/visual data stream. The detected events are therefore combined and presented in a summary of a more customized format.
  • the important highlights are accurately detected by only extracting the audio of the frames immediately after the shots and selecting the shots based on the level of that audio.
  • the audio during the shots of the selected segment of the data stream is disregarded. This eliminates any errors in the audio readings that may be caused by unwanted noise such as the commentator's voice or sounds made by the players.
  • extracting the audio after the shots and selecting shots based on the level of that audio the natural delay in the audience response to important events is captured. This method is particularly effective when used in relation to tennis, for example, as the crowd is forbidden to make noise during the game play and can only react after each point has been played, i.e. after each rally.
  • the step of detecting a plurality of shots of an audio/visual data stream may comprise the steps of: comparing visual properties of each frame of the data stream with visual properties of a respective subsequent frame of the data stream; and detecting a plurality of shots, each shot comprising a plurality of consecutive frames for which compared visual properties are similar.
  • the step of determining a plurality of segments of an audio/visual data stream may comprise the steps of: comparing visual properties of each shot of the data stream; and determining a plurality of segments comprising a plurality of the shots for which compared visual properties are similar.
  • the shots containing similar visual properties define the segments. This enables certain events to be determined as highlights. For example, when an important event is present in the data stream, the shots that include the important event are likely to include the same visual properties since the important event will be covered by a plurality of visually similar shots. For example, in a tennis match, the important event may be a rally and the visual features of the shots that include the rally are likely to be similar. When the rally is over, the visual properties are likely to change in a particular shot and so this shot is not included in the segment. This enables the important events of a data stream to be determined in a simple but effective way.
  • the visual properties may comprise at least one of dominant colour, colour structure, colour layout, colour hue histogram, luma histogram, edge histograms, average histogram change and average pixel change.
  • a change in the histogram between two consecutive frames signifies a change in the visual properties of the frames and therefore frames that include the same event (i.e. frames that have the same visual properties) can easily be determined.
  • the step of selecting a segment of the determined plurality of segments comprises the step of: selecting the longest segment of the determined plurality of segments. As a result, the most interesting segment, e.g. the one containing all tennis rallies, can be distinguished from the less interesting segments.
  • the visual properties may also include the content of each of the plurality of consecutive frames and the method may further comprise the step of: detecting and analysing the content of each of the plurality of consecutive frames. This allows a more refined determination of the interesting frames. For example, the court lines present in the frames may be detected and analysed to enable a more accurate determination of important segments. Alternatively, the motion of the ball may be detected and analysed to extract the most interesting segments.
  • the step of extracting the audio in a plurality of consecutive frames which occur after the end of the shot comprises the step of: for each shot of the selected segment of the data stream, calculating the audio power of a plurality of consecutive frames which occur after the end of the shot for a predefined frequency band; and the step of selecting at least one of the shots based on the extracted audio comprises the step of: selecting at least one of the shots, wherein the audio power of the plurality of consecutive frames which occur after the end of the shot for the predefined frequency band exceeds a threshold.
  • the predefined frequency band may be predefined as the whole of the frequency spectrum or as a part of the frequency spectrum.
  • low frequency bands convey the general audio power
  • bands with slightly higher frequencies typically convey information about the human voice (for example, the voice of the commentator)
  • bands with even higher frequencies convey information regarding the general noise made by audience.
  • the step of extracting the audio in a plurality of consecutive frames which occur after the end of the shot comprises the steps of: calculating a first moving average of audio power of the data stream over a first predetermined length of the data stream; calculating a second moving average of audio power of the data stream over a second predetermined length of the data stream; wherein the first predetermined length of the data stream is different from the second predetermined length of the data stream; and comparing the first and second moving averages.
  • the step of selecting at least one of the shots may comprise the step of: selecting each shot in which the difference between the first average and the second average exceeds a threshold.
  • the highlight detection algorithm is more independent of the characteristics of the broadcast, event, audience, commentator etc.
  • the audio power for each frequency band (or alternatively for the entire audio spectrum) is typically computed over a running window that analyses a group of audio frames lasting for a certain duration of time.
  • the audio power is often dependent of the characteristics of the broadcast, event, audience, commentator, etc. For example, if the stadium is full, the overall audio level or power will be much higher than if the stadium is half full but this does not necessarily mean that the match is less interesting.
  • the second averaging window normalises the audio so that the highlight detection algorithm is more independent of such characteristics.
  • the threshold may be a predetermined threshold.
  • the data stream may be representative of a racquet sport and the determined plurality of segments may correspond to a rally.
  • the user might record a tennis match, for example, on his personal video recorder.
  • the device is then able to present the most interesting rallies and skip those that did not get audience attention and therefore might be considered as of no high interest.
  • the technology can provide navigation through individual rallies and skip commercials and breaks between rallies or provide points of the actual game and skip beginning and end of the recording that does not belong to the actual tennis match.
  • Fig. 1 is a simplified schematic of apparatus for generating a summary of an audio/visual data stream
  • Fig. 2 is a flowchart of a method of generating a summary of an audio/visual data stream.
  • the apparatus 100 comprises an input terminal 102 for input of an audio/visual data stream into a shot detector 110.
  • the output of the shot detector 110 is connected to a determining means 112.
  • the output of the determining means 112 is connected to the input of a first selector 113.
  • the output of the first selector 113 is connected to the input of an extractor 114.
  • the output of the extractor 114 is connected to the input of a second selector 116.
  • the output of the second selector 116 is connected to the input of a summary generator 118.
  • the summary generator 118 outputs a summary via an output terminal 120 to a display such as a television or other display means.
  • An audio/visual data stream is received on the input terminal 102 (step 202) and is input into the shot detector 110.
  • the audio/visual data stream comprises a plurality of consecutive frames having audio and visual properties.
  • the audio/visual data stream may be available, either on local storage, received from a broadcast channel, or downloaded from the internet and may, for example, be representative of a racquet sport such as a tennis, badminton, squash, table tennis etc.
  • the shot detector 110 detects a plurality of shots of the audio/visual data stream (step 204).
  • the shot detector 110 uses the visual properties of the frames to detect sudden changes in the visual properties of consecutive frames.
  • the sudden changes in the visual properties may be, for example, sudden changes in the sets of histogram of the original colour spaces such as sudden changes in the original YCbCr colour space (the family of color spaces used in video systems, where Y is the luminance component and Cb and Cr are the blue and red chrominance components).
  • the sudden changes in the visual properties correspond to transitions between shots in the data stream.
  • the shot detector 110 outputs the detected plurality of shots of the audio/visual data stream to the determining means 112.
  • the determining means 112 determines a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties (step 206), the plurality of the shots not necessarily all being consecutive. In other words, the determining means 112 clusters together visually similar shots to form a segment. For example, the determining means 112 clusters together two shots of the data stream if the difference between the visual properties of the two shots is below a predetermined value to form a segment.
  • the visual properties are, for example, at least one of dominant colour, colour structure, colour layout, colour hue histogram, luma histogram, edge histograms, average histogram change and average pixel change.
  • the visual properties may also include the content of each of the shots and the determining means 112 may detect and analyse the content of each of the plurality of shots.
  • the content for example, includes court lines detected in the frames, tennis ball drops detected in the frames, faces detected in the frames or any other content.
  • the determining means 112 outputs the determined plurality of segments to the first selector 113.
  • the first selector 113 selects one segment of the determined plurality of segments (step 208). For example, the first selector 113 selects the longest segment of the determined plurality of segments. In this way, the first selector 113 selects the biggest cluster of similar shots. In some instances, the longest segment may indicate one of more interest or one which is more eventful. In the case of the data stream being representative of a racquet sport, the selected segment may, for example, correspond to rallies since the shots that correspond to rallies are visually very similar and are also the most frequently occurring shots in the broadcast of a racquet sport.
  • the first selecting means 113 outputs the selected segment to the extractor 114.
  • the extractor 114 extracts, for each shot of the selected segment of the data stream, the audio in a plurality of consecutive frames which occur after the end of the shot (step 210).
  • the extractor 114 disregards the audio during the shots. In other words, the extractor 114 extracts the audio power features in the intervals between the shots of the selected segment.
  • the extractor 114 only extracts the audio between the start and the extended end of each interval. This captures, for example, the natural delay in the audience response.
  • the extractor 114 extracts the audio by calculating, for each shot of the selected segment of the data stream, the audio power of a plurality of consecutive frames which occur after the end of the shot for a predefined frequency band.
  • the predefined frequency band may be predefined as a certain part of the frequency spectrum (for example, a frequency band of 1 to 5 kHz).
  • the extractor 114 only calculates the audio in the plurality of consecutive frames which occur after the end of the shot for that part of the frequency spectrum. By frequency filtering the extracted audio in this way, the influence of the different types of audio in the audio/visual data stream is better analysed.
  • low frequency bands convey the general audio power
  • bands with slightly higher frequencies typically convey information about the human voice (for example, the voice of the commentator)
  • bands with even higher frequencies convey information regarding the general noise made by audience.
  • the frequency band may be predefined as the whole of the frequency spectrum (i.e. all frequencies).
  • the extractor 114 calculates the audio in the plurality of consecutive frames which occur after the end of the shot for the whole of the frequency spectrum (i.e. for all frequencies). This calculated audio is the global audio power.
  • the extractor 114 outputs the extracted audio to the second selector 116.
  • the second selector 116 selects at least one of the shots based on the extracted audio (step 212). For example, the second selector 116 selects at least one of the shots, wherein the audio power of the plurality of consecutive frames which occur after the end of the at least one of the shots for the predefined frequency band exceeds a threshold.
  • the threshold may be predetermined and can be set by the user or adjusted automatically in response to a user's response to a level as desired to include more or less interesting highlights.
  • the extractor 114 extracts the audio by calculating two moving averages of audio power over two different lengths of the data stream.
  • the extractor 114 calculates a first moving average of audio power of the data stream over a first predetermined length of the data stream and calculates a second moving average of audio power of the data stream over a second predetermined length of the data stream.
  • the first predetermined length of the data stream is different from the second predetermined length of the data stream.
  • the extractor 114 calculates a first moving average for a short window of the data stream (e.g. 1 second) and a second moving average for a long window of the data stream (e.g. 20 seconds).
  • the second averaging window is typically larger than the first one (usually by an order of 10) and captures the "global" characteristics of the audio.
  • the extractor 114 therefore processes the audio power features in selected intervals of the data stream in order to classify, for example, the response of an audience to events at the court of a tennis match. The extractor 114 then compares the first and second moving averages.
  • the extractor 114 outputs the compared first and second moving averages of the audio power for each shot to the second selector 116.
  • the second selector 116 selects each shot in which the difference between the first running average and the second running average exceeds a threshold.
  • the selector 116 detects any sudden rise of audio power above the general characteristics.
  • the threshold may be predetermined and can be set by the user or adjusted automatically in response to a user's response to a level as desired to include more or less interesting highlights.
  • the second selector 116 outputs the selected at least one of the shots into the summary generator 118.
  • the summary generator 118 generates a summary to include the selected at least one of the shots (step 214) and outputs the summary via the output terminal 120 for display by, for example a television or any other display means.
  • 'Means' as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which reproduce in operation or are designed to reproduce a specified function, be it solely or in conjunction with other functions, be it in isolation or in co-operation with other elements.
  • the invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the apparatus claim enumerating several means, several of these means can be embodied by one and the same item of hardware.
  • 'Computer program product' is to be understood to mean any software product stored on a computer-readable medium, such as a floppy disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Landscapes

  • Television Signal Processing For Recording (AREA)
  • Studio Devices (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

A method of generating a summary of an audio/visualdata streamis provided, the data stream comprising a plurality of consecutive frames having audio and visual properties. A plurality of shots ofan audio/visualdata stream are detected (step 204). A plurality of segments of the audio/visual data stream are determined (step 206), each segment comprising a plurality of the shots of the data stream having similar visual properties. A segment of the determined plurality of segments is selected (step 208). For each shot of said selected segment of said data stream, the audio in a plurality of consecutive frames which occur after the end of said shot is extracted (step 210). At least one of the shots is selected based on the extracted audio (step 212). A summary is generated to include the selected at least one of the shots (step 214).

Description

Method and apparatus for generating a summary of an audio/visual data stream
FIELD OF THE INVENTION
The present invention relates to a method and apparatus for generating a summary of an audio/visual data stream.
BACKGROUND TO THE INVENTION
Watching broadcast sport events has become increasingly popular, as reflected by the increasing number of sport channels. However, the vast amount of available content makes it impossible for a user to watch all of it.
One existing solution is to provide a user with a summary of the event which shows the main highlights. Existing summarization systems typically aim at choosing the best segments of a video sequence that fit a pre-defined time interval. For example, if the user asks for a summary of 5 minutes, the system then detects which are the best segments that fit that summary of 5 minutes.
A very popularly watched sport is tennis and even though there are usually no more than three or four tournaments broadcasted at the same time, the amount of matches
(especially during the initial rounds of the competitions) is high enough to prevent users from watching all of the matches. Moreover, the structure of tennis, which corresponds to an alternating sequence of rallies and breaks are quite often filled with commercials. As a result, it is desirable for the user to be able to watch the highlights as opposed to the complete match, in particular, to watch those rallies that are interesting, spectacular or important for the end result.
US 2007/0292112 discloses a method of searching a highlight in a film of a tennis game. A plurality of long-field view shots are detected in the film and the audio energy of the long-field view shots is used to determine desired long-field view shots belonging to the highlights. For example, the audio energy is used to identify applause during the long- field view shots to determine the highlights.
However, from the method of US 2007/0292112, it is not possible to determine the most important (for example, the most interesting) highlights. Further, the audio energy used to identify applause is not particularly accurate as it is likely to include unwanted noise such as the commentator's voice-over or sounds made by the players such as screams, ball hits, etc.
SUMMARY OF INVENTION The present invention seeks to provide a method whereby a summary that includes the most important highlights of an audio/visual data stream is generated. The present invention further seeks to improve the accuracy of detecting the most important highlights.
This is achieved, according to an aspect of the invention, by a method of generating a summary of an audio/visual data stream, the data stream comprising a plurality of consecutive frames having audio and visual properties, the method comprising the steps of: detecting a plurality of shots of an audio/visual data stream; determining a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties; selecting a segment of the determined plurality of segments; for each shot of the selected segment of the data stream, extracting the audio in a plurality of consecutive frames which occur after the end of the shot; selecting at least one of the shots based on the extracted audio; and generating a summary to include the selected at least one of the shots.
This is also achieved, according to another aspect of the invention, by an apparatus for generating a summary of an audio/visual data stream, the data stream comprising a plurality of consecutive frames having audio and visual properties, the apparatus comprising: a shot detector for detecting a plurality of shots of an audio/visual data stream; a determining means for determining a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties; a first selector for selecting a segment of the determined plurality of segments; an extractor for extracting, for each shot of the selected segment of the data stream, the audio in a plurality of consecutive frames which occur after the end of the shot; a second selector for selecting at least one of the shots based on the extracted audio; and a summary generator for generating a summary to include the selected at least one of the shots. In this way, the user's experience of watching a summary (for example, highlights such as tennis highlights) is enriched since the interesting shots are identified and separated from the original audio/visual data stream thus forming the summary. Advantageously, the summary will depend on how interesting each shot in the data stream is. Further, the criteria of "how interesting" a shot is can be adapted. The application can lower or raise the threshold in order to get correspondingly smaller or larger summaries. This control can be offered in a very simple way to the user. As a result of this control, the summary that is generated includes the most important (e.g. the most interesting) highlights of the audio/visual data stream. The detected events are therefore combined and presented in a summary of a more customized format. Further, the important highlights are accurately detected by only extracting the audio of the frames immediately after the shots and selecting the shots based on the level of that audio. In other words, the audio during the shots of the selected segment of the data stream is disregarded. This eliminates any errors in the audio readings that may be caused by unwanted noise such as the commentator's voice or sounds made by the players. Further, by extracting the audio after the shots and selecting shots based on the level of that audio, the natural delay in the audience response to important events is captured. This method is particularly effective when used in relation to tennis, for example, as the crowd is forbidden to make noise during the game play and can only react after each point has been played, i.e. after each rally. The step of detecting a plurality of shots of an audio/visual data stream may comprise the steps of: comparing visual properties of each frame of the data stream with visual properties of a respective subsequent frame of the data stream; and detecting a plurality of shots, each shot comprising a plurality of consecutive frames for which compared visual properties are similar. This provides an effective way of determining the shots that are focussing on the same event by analysing the change in the visual properties of consecutive frames, for example, when the visual properties of the frames change from a long-field view shot to a short field view shot. The frames that contain similar visual properties are likely to be of the same view shot and can therefore easily be determined. In this way, the transitions between shots are identified thus providing a simple, yet effective way of detecting the different shots in the data stream.
The step of determining a plurality of segments of an audio/visual data stream may comprise the steps of: comparing visual properties of each shot of the data stream; and determining a plurality of segments comprising a plurality of the shots for which compared visual properties are similar. As a result, the shots containing similar visual properties define the segments. This enables certain events to be determined as highlights. For example, when an important event is present in the data stream, the shots that include the important event are likely to include the same visual properties since the important event will be covered by a plurality of visually similar shots. For example, in a tennis match, the important event may be a rally and the visual features of the shots that include the rally are likely to be similar. When the rally is over, the visual properties are likely to change in a particular shot and so this shot is not included in the segment. This enables the important events of a data stream to be determined in a simple but effective way.
The visual properties may comprise at least one of dominant colour, colour structure, colour layout, colour hue histogram, luma histogram, edge histograms, average histogram change and average pixel change. For example, a change in the histogram between two consecutive frames signifies a change in the visual properties of the frames and therefore frames that include the same event (i.e. frames that have the same visual properties) can easily be determined. The step of selecting a segment of the determined plurality of segments comprises the step of: selecting the longest segment of the determined plurality of segments. As a result, the most interesting segment, e.g. the one containing all tennis rallies, can be distinguished from the less interesting segments.
The visual properties may also include the content of each of the plurality of consecutive frames and the method may further comprise the step of: detecting and analysing the content of each of the plurality of consecutive frames. This allows a more refined determination of the interesting frames. For example, the court lines present in the frames may be detected and analysed to enable a more accurate determination of important segments. Alternatively, the motion of the ball may be detected and analysed to extract the most interesting segments.
According to one embodiment, the step of extracting the audio in a plurality of consecutive frames which occur after the end of the shot comprises the step of: for each shot of the selected segment of the data stream, calculating the audio power of a plurality of consecutive frames which occur after the end of the shot for a predefined frequency band; and the step of selecting at least one of the shots based on the extracted audio comprises the step of: selecting at least one of the shots, wherein the audio power of the plurality of consecutive frames which occur after the end of the shot for the predefined frequency band exceeds a threshold.
The predefined frequency band may be predefined as the whole of the frequency spectrum or as a part of the frequency spectrum.
As a result of frequency filtering the extracted audio in this way, the influence of the different types of audio in the audio/visual data stream is better analysed. For example, low frequency bands convey the general audio power, bands with slightly higher frequencies typically convey information about the human voice (for example, the voice of the commentator) and bands with even higher frequencies convey information regarding the general noise made by audience.
According to an alternative embodiment, the step of extracting the audio in a plurality of consecutive frames which occur after the end of the shot comprises the steps of: calculating a first moving average of audio power of the data stream over a first predetermined length of the data stream; calculating a second moving average of audio power of the data stream over a second predetermined length of the data stream; wherein the first predetermined length of the data stream is different from the second predetermined length of the data stream; and comparing the first and second moving averages. The step of selecting at least one of the shots may comprise the step of: selecting each shot in which the difference between the first average and the second average exceeds a threshold.
In this way, the highlight detection algorithm is more independent of the characteristics of the broadcast, event, audience, commentator etc. For example, the audio power for each frequency band (or alternatively for the entire audio spectrum) is typically computed over a running window that analyses a group of audio frames lasting for a certain duration of time. However, the audio power is often dependent of the characteristics of the broadcast, event, audience, commentator, etc. For example, if the stadium is full, the overall audio level or power will be much higher than if the stadium is half full but this does not necessarily mean that the match is less interesting. The second averaging window normalises the audio so that the highlight detection algorithm is more independent of such characteristics.
The threshold may be a predetermined threshold.
The data stream may be representative of a racquet sport and the determined plurality of segments may correspond to a rally. In this way, the user might record a tennis match, for example, on his personal video recorder. The device is then able to present the most interesting rallies and skip those that did not get audience attention and therefore might be considered as of no high interest. Further, the technology can provide navigation through individual rallies and skip commercials and breaks between rallies or provide points of the actual game and skip beginning and end of the recording that does not belong to the actual tennis match.
BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the present invention, reference is now made to the following description taken in conjunction with the accompanying drawings in which:
Fig. 1 is a simplified schematic of apparatus for generating a summary of an audio/visual data stream; and
Fig. 2 is a flowchart of a method of generating a summary of an audio/visual data stream.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION With reference to Figure 1, the apparatus 100 comprises an input terminal 102 for input of an audio/visual data stream into a shot detector 110. The output of the shot detector 110 is connected to a determining means 112. The output of the determining means 112 is connected to the input of a first selector 113. The output of the first selector 113 is connected to the input of an extractor 114. The output of the extractor 114 is connected to the input of a second selector 116. The output of the second selector 116 is connected to the input of a summary generator 118. The summary generator 118 outputs a summary via an output terminal 120 to a display such as a television or other display means.
Operation of the apparatus of Figure 1 will now be described in detail with reference to Figure 2. An audio/visual data stream is received on the input terminal 102 (step 202) and is input into the shot detector 110. The audio/visual data stream comprises a plurality of consecutive frames having audio and visual properties. The audio/visual data stream may be available, either on local storage, received from a broadcast channel, or downloaded from the internet and may, for example, be representative of a racquet sport such as a tennis, badminton, squash, table tennis etc. The shot detector 110 detects a plurality of shots of the audio/visual data stream (step 204). This is achieved by the shot detector 110 comparing the visual properties of each frame of the audio/visual data stream with the visual properties of a relative subsequent frame of the data stream and determining a plurality of shots comprising a plurality of consecutive frames for which compared visual properties are similar. In other words, the shot detector 110 uses the visual properties of the frames to detect sudden changes in the visual properties of consecutive frames. The sudden changes in the visual properties may be, for example, sudden changes in the sets of histogram of the original colour spaces such as sudden changes in the original YCbCr colour space (the family of color spaces used in video systems, where Y is the luminance component and Cb and Cr are the blue and red chrominance components). The sudden changes in the visual properties correspond to transitions between shots in the data stream.
The shot detector 110 outputs the detected plurality of shots of the audio/visual data stream to the determining means 112. The determining means 112 determines a plurality of segments of the audio/visual data stream, each segment comprising a plurality of the shots of the data stream having similar visual properties (step 206), the plurality of the shots not necessarily all being consecutive. In other words, the determining means 112 clusters together visually similar shots to form a segment. For example, the determining means 112 clusters together two shots of the data stream if the difference between the visual properties of the two shots is below a predetermined value to form a segment.
The visual properties are, for example, at least one of dominant colour, colour structure, colour layout, colour hue histogram, luma histogram, edge histograms, average histogram change and average pixel change. The visual properties may also include the content of each of the shots and the determining means 112 may detect and analyse the content of each of the plurality of shots. The content, for example, includes court lines detected in the frames, tennis ball drops detected in the frames, faces detected in the frames or any other content.
The determining means 112 outputs the determined plurality of segments to the first selector 113.
The first selector 113 selects one segment of the determined plurality of segments (step 208). For example, the first selector 113 selects the longest segment of the determined plurality of segments. In this way, the first selector 113 selects the biggest cluster of similar shots. In some instances, the longest segment may indicate one of more interest or one which is more eventful. In the case of the data stream being representative of a racquet sport, the selected segment may, for example, correspond to rallies since the shots that correspond to rallies are visually very similar and are also the most frequently occurring shots in the broadcast of a racquet sport.
The first selecting means 113 outputs the selected segment to the extractor 114.
The extractor 114 extracts, for each shot of the selected segment of the data stream, the audio in a plurality of consecutive frames which occur after the end of the shot (step 210). The extractor 114 disregards the audio during the shots. In other words, the extractor 114 extracts the audio power features in the intervals between the shots of the selected segment. By extracting the audio in a plurality of consecutive frames which occur after the end of the shots, the extractor 114 only extracts the audio between the start and the extended end of each interval. This captures, for example, the natural delay in the audience response. In one embodiment, the extractor 114 extracts the audio by calculating, for each shot of the selected segment of the data stream, the audio power of a plurality of consecutive frames which occur after the end of the shot for a predefined frequency band. The predefined frequency band may be predefined as a certain part of the frequency spectrum (for example, a frequency band of 1 to 5 kHz). In this case, the extractor 114 only calculates the audio in the plurality of consecutive frames which occur after the end of the shot for that part of the frequency spectrum. By frequency filtering the extracted audio in this way, the influence of the different types of audio in the audio/visual data stream is better analysed. For example, low frequency bands convey the general audio power, bands with slightly higher frequencies typically convey information about the human voice (for example, the voice of the commentator) and bands with even higher frequencies convey information regarding the general noise made by audience. Alternatively, the frequency band may be predefined as the whole of the frequency spectrum (i.e. all frequencies). In this case, the extractor 114 calculates the audio in the plurality of consecutive frames which occur after the end of the shot for the whole of the frequency spectrum (i.e. for all frequencies). This calculated audio is the global audio power.
The extractor 114 outputs the extracted audio to the second selector 116. The second selector 116 selects at least one of the shots based on the extracted audio (step 212). For example, the second selector 116 selects at least one of the shots, wherein the audio power of the plurality of consecutive frames which occur after the end of the at least one of the shots for the predefined frequency band exceeds a threshold.
In this way, the shots that provoked a more intense response are determined. These shots are most likely to be more interesting to the audience or the commentator. The threshold may be predetermined and can be set by the user or adjusted automatically in response to a user's response to a level as desired to include more or less interesting highlights.
In an alternative embodiment, the extractor 114 extracts the audio by calculating two moving averages of audio power over two different lengths of the data stream. In other words, the extractor 114 calculates a first moving average of audio power of the data stream over a first predetermined length of the data stream and calculates a second moving average of audio power of the data stream over a second predetermined length of the data stream. The first predetermined length of the data stream is different from the second predetermined length of the data stream. For example, the extractor 114 calculates a first moving average for a short window of the data stream (e.g. 1 second) and a second moving average for a long window of the data stream (e.g. 20 seconds). The second averaging window is typically larger than the first one (usually by an order of 10) and captures the "global" characteristics of the audio. The extractor 114 therefore processes the audio power features in selected intervals of the data stream in order to classify, for example, the response of an audience to events at the court of a tennis match. The extractor 114 then compares the first and second moving averages.
The extractor 114 outputs the compared first and second moving averages of the audio power for each shot to the second selector 116.
The second selector 116 selects each shot in which the difference between the first running average and the second running average exceeds a threshold. In other words, by comparing the audio power computed for the first window with the audio power computed for the second window, the selector 116 detects any sudden rise of audio power above the general characteristics. Where the difference between the first running average and the second running average exceeds a threshold, the response of the audience is considered as one reflecting a highlight. Again, the threshold may be predetermined and can be set by the user or adjusted automatically in response to a user's response to a level as desired to include more or less interesting highlights.
The second selector 116 outputs the selected at least one of the shots into the summary generator 118. The summary generator 118 generates a summary to include the selected at least one of the shots (step 214) and outputs the summary via the output terminal 120 for display by, for example a television or any other display means.
Although embodiments of the present invention have been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous modifications without departing from the scope of the invention as set out in the following claims.
'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which reproduce in operation or are designed to reproduce a specified function, be it solely or in conjunction with other functions, be it in isolation or in co-operation with other elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the apparatus claim enumerating several means, several of these means can be embodied by one and the same item of hardware. 'Computer program product' is to be understood to mean any software product stored on a computer-readable medium, such as a floppy disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:
1. A method of generating a summary of an audio/visual data stream, said data stream comprising a plurality of consecutive frames having audio and visual properties, the method comprising the steps of: detecting (204) a plurality of shots of an audio/visual data stream; determining (206) a plurality of segments of said audio/visual data stream, each segment comprising a plurality of said shots of said data stream having similar visual properties; selecting (208) a segment of said determined plurality of segments; for each shot of said selected segment of said data stream, extracting (210) the audio in a plurality of consecutive frames which occur after the end of said shot; selecting (212) at least one of said shots based on the extracted audio; and generating (214) a summary to include said selected at least one of said shots.
2. A method according to claim 1, wherein the step of selecting (212) at least one of said shots based on the extracted audio comprises the step of: selecting at least one of said shots, wherein the extracted audio in a plurality of consecutive frames which occur after the end of said at least one of said shots exceeds a predetermined threshold.
3. A method according to claim 1, wherein the step of detecting (204) a plurality of shots of an audio/visual data stream comprises the steps of: comparing visual properties of each frame of said data stream with visual properties of a respective subsequent frame of said data stream; and detecting a plurality of shots, each shot comprising a plurality of consecutive frames for which compared visual properties are similar.
4. A method according to claim 1, wherein the step of determining (206) a plurality of segments of an audio/visual data stream comprises the steps of: comparing visual properties of each shot of said data stream; and determining a plurality of segments comprising a plurality of said shots for which compared visual properties are similar.
5. A method according to claim 1, wherein the step of selecting (208) a segment of said determined plurality of segments comprises the step of: selecting the longest segment of said determined plurality of segments.
6. A method according to claim 1, wherein the visual properties includes the content of each of said shots and the method further comprises the step of: detecting and analysing the content of each of said shots.
7. A method according to claim 1, wherein the step of extracting (210) the audio in a plurality of consecutive frames which occur after the end of said shot comprises the step of: for each shot of said selected segment of said data stream, calculating the audio power of a plurality of consecutive frames which occur after the end of said shot for a predefined frequency band; and wherein the step of selecting (212) at least one of said shots based on the extracted audio comprises the step of: selecting at least one of said shots, wherein the audio power of said plurality of consecutive frames which occur after the end of said shot for said predefined frequency band exceeds a threshold.
8. A method according to claim 7, wherein the predefined frequency band is predefined as the whole of the frequency spectrum.
9. A method according to claim 7, wherein the predefined frequency band is predefined as a part of the frequency spectrum.
10. A method according to claim 1, wherein the step of extracting (210) the audio in a plurality of consecutive frames which occur after the end of said shot comprises the steps of: calculating a first moving average of audio power of said data stream over a first predetermined length of said data stream; calculating a second moving average of audio power of said data stream over a second predetermined length of said data stream; wherein said first predetermined length of said data stream is different from said second predetermined length of said data stream; and comparing said first and second moving averages.
11. A method according to claim 10, wherein the step of selecting (212) at least one of said shots comprises the step of: selecting each shot in which the difference between said first average and said second average exceeds a threshold.
12. A method according to claim 1, wherein said data stream is representative of a racquet sport and said selected segment corresponds to rallies.
13. A computer program product comprising a plurality o f program code portions for carrying out the method according to any one of the preceding claims.
14. Apparatus (100) for generating a summary of an audio/visual data stream, said data stream comprising a plurality of consecutive frames having audio and visual properties, the apparatus comprising: a shot detector (110) for detecting a plurality of shots of an audio/visual data stream; a determining means (112) for determining a plurality of segments of said audio/visual data stream, each segment comprising a plurality of said shots of said data stream having similar visual properties; a first selector (113) for selecting a segment of said determined plurality of segments; an extractor (114) for extracting, for each shot of said selected segment of said data stream, the audio in a plurality of consecutive frames which occur after the end of said shot; a second selector (116) for selecting at least one of said shots based on the extracted audio; and a summary generator (118) for generating a summary to include said selected at least one of said shots.
PCT/IB2009/052318 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio/visual data stream WO2009150567A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2011512260A JP2011523291A (en) 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio / visual data stream
EP09762094A EP2291844A2 (en) 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio/visual data stream
US12/994,164 US8542983B2 (en) 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio/visual data stream
CN2009801217253A CN102057433A (en) 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio/visual data stream

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP08157834.6 2008-06-09
EP08157834 2008-06-09

Publications (2)

Publication Number Publication Date
WO2009150567A2 true WO2009150567A2 (en) 2009-12-17
WO2009150567A3 WO2009150567A3 (en) 2010-02-04

Family

ID=41268285

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2009/052318 WO2009150567A2 (en) 2008-06-09 2009-06-02 Method and apparatus for generating a summary of an audio/visual data stream

Country Status (6)

Country Link
US (1) US8542983B2 (en)
EP (1) EP2291844A2 (en)
JP (1) JP2011523291A (en)
KR (1) KR20110023878A (en)
CN (1) CN102057433A (en)
WO (1) WO2009150567A2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104683933A (en) * 2013-11-29 2015-06-03 杜比实验室特许公司 Audio object extraction method
CN107077595A (en) 2014-09-08 2017-08-18 谷歌公司 Selection and presentation representative frame are for video preview
US9965685B2 (en) * 2015-06-12 2018-05-08 Google Llc Method and system for detecting an audio event for smart home devices
US10223613B2 (en) * 2016-05-31 2019-03-05 Microsoft Technology Licensing, Llc Machine intelligent predictive communication and control system

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5717818A (en) * 1992-08-18 1998-02-10 Hitachi, Ltd. Audio signal storing apparatus having a function for converting speech speed
US6480823B1 (en) * 1998-03-24 2002-11-12 Matsushita Electric Industrial Co., Ltd. Speech detection for noisy conditions
US6823303B1 (en) * 1998-08-24 2004-11-23 Conexant Systems, Inc. Speech encoder using voice activity detection in coding noise
US6453291B1 (en) * 1999-02-04 2002-09-17 Motorola, Inc. Apparatus and method for voice activity detection in a communication system
JP3789246B2 (en) * 1999-02-25 2006-06-21 株式会社リコー Speech segment detection device, speech segment detection method, speech recognition device, speech recognition method, and recording medium
US20040167767A1 (en) * 2003-02-25 2004-08-26 Ziyou Xiong Method and system for extracting sports highlights from audio signals
US20060059120A1 (en) * 2004-08-27 2006-03-16 Ziyou Xiong Identifying video highlights using audio-visual objects
KR20060116335A (en) 2005-05-09 2006-11-15 삼성전자주식회사 Apparatus and method for summaring moving-picture using events, and compter-readable storage storing compter program controlling the apparatus
JP2006324743A (en) * 2005-05-17 2006-11-30 Toshiba Corp Delimiter information setting method and apparatus for video signal utilizing silence part
JP4757876B2 (en) * 2005-09-30 2011-08-24 パイオニア株式会社 Digest creation device and program thereof
US7584428B2 (en) * 2006-02-09 2009-09-01 Mavs Lab. Inc. Apparatus and method for detecting highlights of media stream
US20070292112A1 (en) * 2006-06-15 2007-12-20 Lee Shih-Hung Searching method of searching highlight in film of tennis game
JP4810335B2 (en) * 2006-07-06 2011-11-09 株式会社東芝 Wideband audio signal encoding apparatus and wideband audio signal decoding apparatus
JP4872871B2 (en) * 2007-09-27 2012-02-08 ソニー株式会社 Sound source direction detecting device, sound source direction detecting method, and sound source direction detecting camera
KR101437830B1 (en) * 2007-11-13 2014-11-03 삼성전자주식회사 Method and apparatus for detecting voice activity
US8311390B2 (en) * 2008-05-14 2012-11-13 Digitalsmiths, Inc. Systems and methods for identifying pre-inserted and/or potential advertisement breaks in a video sequence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None

Also Published As

Publication number Publication date
JP2011523291A (en) 2011-08-04
KR20110023878A (en) 2011-03-08
WO2009150567A3 (en) 2010-02-04
US20110075993A1 (en) 2011-03-31
US8542983B2 (en) 2013-09-24
EP2291844A2 (en) 2011-03-09
CN102057433A (en) 2011-05-11

Similar Documents

Publication Publication Date Title
US7657836B2 (en) Summarization of soccer video content
US7424204B2 (en) Video information summarizing apparatus and method for generating digest information, and video information summarizing program for generating digest information
Hanjalic Adaptive extraction of highlights from a sport video based on excitement modeling
US8018491B2 (en) Summarization of football video content
US7120873B2 (en) Summarization of sumo video content
US8682654B2 (en) Systems and methods for classifying sports video
Li et al. A general framework for sports video summarization with its application to soccer
US8634699B2 (en) Information signal processing method and apparatus, and computer program product
US8290345B2 (en) Digest generation for television broadcast program
JP4265970B2 (en) Video summarization using motion activity descriptors correlated with audio features
EP1081960A1 (en) Signal processing method and video/voice processing device
EP1067800A1 (en) Signal processing method and video/voice processing device
WO2004014061A2 (en) Automatic soccer video analysis and summarization
US20100002137A1 (en) Method and apparatus for generating a summary of a video data stream
JP2003052003A (en) Processing method of video containing baseball game
US20080269924A1 (en) Method of summarizing sports video and apparatus thereof
KR100612874B1 (en) Method and apparatus for summarizing sports video
US8542983B2 (en) Method and apparatus for generating a summary of an audio/visual data stream
JP2010081531A (en) Video processor and method of processing video
KR100510098B1 (en) Method and Apparatus for Automatic Detection of Golf Video Event

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200980121725.3

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09762094

Country of ref document: EP

Kind code of ref document: A2

REEP Request for entry into the european phase

Ref document number: 2009762094

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2009762094

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 12994164

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2011512260

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 8399/CHENP/2010

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 20117000276

Country of ref document: KR

Kind code of ref document: A