WO2022189359A1 - Method and device for generating an audio-video abstract - Google Patents

Method and device for generating an audio-video abstract Download PDF

Info

Publication number
WO2022189359A1
WO2022189359A1 PCT/EP2022/055755 EP2022055755W WO2022189359A1 WO 2022189359 A1 WO2022189359 A1 WO 2022189359A1 EP 2022055755 W EP2022055755 W EP 2022055755W WO 2022189359 A1 WO2022189359 A1 WO 2022189359A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video
parts
abstract
keywords
Prior art date
Application number
PCT/EP2022/055755
Other languages
French (fr)
Inventor
Gwenaelle Marquant
Thomas Morin
Serge Defrance
Original Assignee
Interdigital Ce Patent Holdings, Sas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interdigital Ce Patent Holdings, Sas filed Critical Interdigital Ce Patent Holdings, Sas
Publication of WO2022189359A1 publication Critical patent/WO2022189359A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences

Definitions

  • the present disclosure generally relates to applications for audio-video. At least one embodiment relates to the generation of audio-video abstracts.
  • Audio-video abstracts typically represent short summaries of audio-video content, such as, for example, a sporting event, a news program or a movie.
  • a viewer may want to only view the audio-video abstract (e.g., via a video replay service, via a video-on-demand service) rather than watching the entire audio-video content.
  • the audio-video abstracts are generally focused to a global audience and are directed towards specific topics that considered to be of common interest.
  • Such audio video abstracts may not be of interest to an individual viewer. For example, if the individual viewer is interested in recent football matches wherein a specific player appears or in movies starring a specific actor, many audio-video abstracts are available to the viewer. However, if, for example, a football fan is only interested in free kicks by the opposing team and/or saves by the keeper, few if any audio-video abstracts may be found.
  • the disclosure is directed to generating audio-video abstracts.
  • the described embodiments may be implemented on devices, such as, for example, mobile phones, tablets, set top boxes and digital televisions.
  • Some methods or processes implemented by elements of the disclosure may be computer implemented. Accordingly, such elements may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as “circuit”, “module” or “system”. Furthermore, such elements may take the form of a computer program product embodied in any tangible medium of expression having computer useable program code embodied in the medium.
  • a tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid-state memory device and the like.
  • a transient carrier medium may include a signal such as an electrical signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g., a microwave or RF signal.
  • FIG. 1 is an exemplary system for generating audio-video abstracts
  • FIG. 2 is a flowchart of an embodiment of a method for generating an audio-video abstract
  • FIG. 3 is audio-video content as a function of time
  • FIG. 4 is the audio-video content of FIG. 3, further including segments indicative of team colors for the team performing an action in a corresponding audio-video segment;
  • FIG. 5 is the segmented audio-video content of FIG. 4 further including audio level plotted as a function of time and associated with an action taking place in the corresponding segment;
  • FIG. 6 is selection of segments for generating an abstract according to an embodiment;
  • FIG. 7 is a user interface according to an embodiment of the disclosure.
  • FIG. 8 is an embodiment of a user interface in which three (3) categories are selected and a maximum abstract duration for the audio-video abstract is indicated;
  • FIG. 9 is another embodiment depicting a graphical user interface which may be used to generate audio-video abstracts according to the disclosure;
  • FIG. 10 is a list of selections available to a user of the graphical user interface shown in FIG. 9;
  • FIG. 11 is a graphical user interface shown in FIG. 10 after a user selection is made
  • FIG. 12 is the graphical user interface shown in FIG. 10 after user selections are made.
  • FIG. 13 is a flow chart of an embodiment of a method according to the present disclosure. PET ATT, ED DESCRIPTION
  • video or ‘audio/video’ or ‘audio-video’ is used to indicate a video accompanied or not by associated audio.
  • video abstract is used to indicate a video (possibly including audio) that is constituted of excerpts (parts, segments) of a longer (e.g., full size) video (that may include audio).
  • a video abstract may include audio, e.g., audio that is associated with the video excerpts.
  • a video abstract is for example a (personalized) summary of a longer (e.g., full size) video sequence, or audio/video sequence.
  • FIG. 1 is a block diagram of an system 100 in which various aspects of the embodiments may be implemented.
  • the system 100 may include front-end 110, back-end 120 and a network 130.
  • Back-end 120 includes at least one processor 124 and an audio/video encoder 128.
  • Processor 124 may execute instructions configured to process an audio-video sequence, comprising associated audio or not, into excerpts (parts, segments) and extract metadata therefrom.
  • the excerpts may be based on scene cuts, scene change detection, etc., as will be further discussed below.
  • the audio/video encoder 128 converts the audio/video sequence into a compressed format (e.g., MPEG2, MPEG4, MKV) suitable for being transmitted (e.g., broadcast, multicast, unicast) according to a transport protocol (e.g., MPEG2 Transport Stream).
  • the extracted metadata may include, for example, keywords, timestamps, video frame numbers.
  • the processor 124 may include embedded memory (not shown), an input-output interface (not shown), and various other circuitries as known in the art. Program code may be loaded into processor 124 to perform the various processes described hereinbelow.
  • Back-end 120 may include at least one memory (e.g., a volatile memory, a non-volatile memory) configured to store program code to be loaded into the processor 120 for subsequent execution.
  • Back-end 120 may include a storage device (not shown), which may include non volatile memory, including but not limited to EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive, e.g., for storage of audio/video sequences and metadata.
  • the storage device may comprise an internal storage device, an attached storage device, and or a network accessible storage device, as non-limiting examples.
  • Back-end 120 may be coupled to network 130. Encoded audio/video is transmitted to front-end 110 via network 130.
  • Front-end 110 includes an audio/video decoder 112, at least one processor 116 and possibly a user interface/display 118.
  • the at least one processor 116 may be configured for storing and processing audio/video.
  • Device 110 may include memory (not shown) that comprise program instructions that are executable for at least one processor 116 and that are configured to implement embodiments disclosed herein.
  • the audio/video decoder 112 converts the encoded audio/video sequences received from device 120 via network 130 to a decompressed format.
  • the decoded audio/video is provided to the at least one processor 116.
  • the at least one processor 116 may include a graphics processing unit or GPU.
  • the at least one processor 116 may be configured to implement the further described embodiments for generating video abstracts and may execute therefore a process that implements methods according to described embodiments that may be further referred to as an audio-video abstract generation engine.
  • the presently disclosed methods and devices may be used to generate an audio-video abstract from a longer (e.g., a full-size or full-length) audio/video content, based on preference(s) (e.g., user selectable preference(s)) or (pre-) configuration or user (viewing behavior) profiles.
  • Device 120 may transmit audio/video sequences to device 110 while another device (not shown) may transmit metadata related to the audio/video sequences to device 110.
  • device 120 is a video-on-demand server, or a video broadcast server, while the above mentioned another device may be a dedicated metadata server.
  • FIG. 2 is a flowchart of a particular embodiment of a method 200 for generating an audio-video abstract.
  • Embodiments of the disclosed methods may be carried out by front-end device 110 (FIG. 1).
  • audio-video content is provided to the front-end device 110.
  • the audio video content that is provided is mapped to timeslots (to a timestamp, to multiple timestamps, for example a start timestamp and an end timestamp, wherein a timestamp points to an audio video frame) relative to a timeline of the audio-video and is categorized with keywords.
  • the keywords with the associated timestamps referring to the original (full size/length) audio-video are further referred to by (extracted) metadata.
  • a segment may for example be a scene consisting of a group of contiguous video (camera) shots that may be, for example, semantically consistent and that may form a semantic unit.
  • audio-video content may be segmented based on various phases of play during the football match, such as, goal scoring, slow motion replays, referee whistle blows and high audio level.
  • audio-video content may for example be segmented based on chaptering the program, detection of change of subject (e.g., determined through use of artificial intelligence).
  • Segments may overlap.
  • a segment from 0 to 10 minutes may be semantically labeled with metadata: “Team A attack” from 0 to 5 minutes + “player A” from 3 to 8 minutes + “player B” from 6 to 10 minutes + “slow motion replay” from 9 to 10 minutes + “goal” at 10 minutes relative to a timeline of the original (full, longer) content.
  • Metadata associated with a segment may include keywords and timestamps, and audio-video frame numbers. Keywords may be associated with a segment based on the content of the video, audio or both video and audio, relative to the segment. Deep learning methods (artificial intelligence) are for example able to perform such association.
  • FIG. 3 is an audio-video sequence shown on a time line indicating metadata that includes keywords (‘FK’ for ‘Free Kick’ (e.g., 305), ‘S’ for ‘Shot on Goal’ (e.g., 320), ‘G’ for ‘Goal’ (e.g., 310), ‘C’ for ‘Corner Shot’ (e.g., 315), ‘Red’ for ‘Red Card’ (e.g., 325)) that point to segments (see FIG. 4) of the audio-video sequence on the timeline.
  • keywords ‘FK’ for ‘Free Kick’ (e.g., 305), ‘S’ for ‘Shot on Goal’ (e.g., 320), ‘G’ for ‘Goal’ (e.g., 310), ‘C’ for ‘Corner Shot’ (e.g., 315), ‘Red’ for ‘Red Card’ (e.g., 325)
  • keywords ‘FK’ for ‘Free Kick’ (e.g., 305)
  • S for ‘Shot
  • a segment may be associated with a keyword pertinent to an action of the football match, such as, for example, a free kick (FK) 305, a goal (G) 310, a corner kick (C) 315, a shot on goal (S) 320 and red card (Red) 325.
  • keywords pertinent to an action may be for example ‘fight’, ‘dance’ or ‘sing’.
  • keywords e.g., for a sporting match may include the name and/or colors of teams playing the match, or of individual players, or e.g., for a movie
  • associated keywords i.e., metadata
  • FIG. 4 a graph 200 shows audio-video segments for the football match previously illustrated in FIG. 3 that are plotted as a function of the match time (t).
  • the segments shown on the timeline of figure 4 may be blue 410 or green 420 colored (here depicted as white rectangles (representing the blue color) and forward hatched rectangles (representing the green color) respectively).
  • the blue 410 or green 420 colored segments are plotted along a time axis (a time line) representing the football match video time (0 is start of the match; an end-time may be indicated (not shown) indicating the end time of the match), wherein the color (hatching) of the segments (Fig. 4) may be indicative of the color of the team (e.g., blue team, green team) member performing the action of the football match associated with the corresponding audio-video segment.
  • audio level or excitement level may be part of the metadata relative to a segment.
  • the audio level of the commentator and/or of the crowd or public may be extracted from the audio-video and may be associated with an action of the football match that corresponds to an audio-video segment.
  • the audio level may be representative of an excitement level (e.g., of the commentator, of the crowd or public attending the football match).
  • 500 depicts extracted audio-video segments for the football match previously illustrated in FIG. 4 are shown as a function of the match time (t), further including crowd audio level 510 and commentator audio level 515.
  • the crowd audio level 510 and commentator audio level 515 are associated with an action of the football match that corresponds to an extracted audio-video segment.
  • Score information may be used for segmentation, and score information may be part of metadata of a segment.
  • keywords for example relating to a subject discussed
  • the ‘extraction’ may be a purely logical operation, consisting in storing pointers (e.g., video frame numbers and/or timestamps) to an audio-video sequence in metadata, in order to be able to, at a later time, generate an audio-video abstract.
  • pointers may define a segment by a start pointer (pointing to the start of the segment) and an end pointer (pointing to the end of the segment).
  • a segment start and/or end pointer may be associated with a keyword or a set of keywords (a keyword string or keyword phrase), e.g., a keyword (keyword string) indicating an action (e.g., ‘attack of player B of the blue team’).
  • keywords e.g., a keyword string or keyword phrase
  • an action e.g., ‘attack of player B of the blue team’
  • one or more of the audio-video segments provided in step 210 are identified for the audio-video abstract.
  • One or more audio-video segments are selected based on at least one keyword associated with the audio-video content and matching the keywords associated with the audio-video contents to a set of keywords that are e.g., selected by a user using a user interface or that are stored in a user profile or (user) configuration that is for example constructed based on information entered by a user using a graphical user interface or based on information obtained from monitoring user viewing behavior.
  • FIG. 6 is a user interface which may be used to implement embodiments of the disclosure.
  • the user interface may include a time line 610 that contains a representation of the extracted audio-video segments of the audio-video content 620 on the time line depicted there above.
  • the extracted audio-video segments, including keywords and further metadata, that are shown above time line 610 are those previously shown in FIG. 5.
  • the segments that are selected on the time line 610 correspond for example to audio video segments 630 a viewer may be interested in based on the keywords (e.g., ‘FK’, ‘S’, ‘C’, ‘G’, and ‘Red’) and further metadata that is shown above the time line 610.
  • the time line 610 is identified with the legend “Selection”.
  • the selected segments may be relative to the “blue” team and to actions free kicks (FK), comer kicks (C), goals (G) and shots on goal (S) of that team.
  • FIG. 7 is another user interface 700 which may be used to implement embodiments of the disclosure.
  • the graphical user interface (GUI) 700 may be used to generate a (user) profile or (user) configuration with the purpose of generating an audio-video abstract based on the (user) profile/configuration.
  • a graphical user interface is populated with audio video segments 710 that are selected according to their mapping to the keywords and other metadata (e.g., as preconfigured, or configured by a user/viewer).
  • the audio-video segments 710 may be identified as Category 1 to Category N.
  • a category may be based on a (set of) keyword(s) and a (set of) keyword(s) may be mapped to one or more segments.
  • a maximum duration 720 may be specified for the to-be-generated audio-video abstract.
  • a timeline 730 of the original audio-video may be displayed along with a duration counter 740 corresponding to the duration of the selected segments for the audio-video abstract.
  • the abstract duration may be presented as a time or as a percentage of the original audio/video duration.
  • the timeline 730 indicates where, chronologically, the selected audio-video segments, selected according to the selected metadata category, appear (no segments are shown on timeline 730, but the presentation may be similar to that of figure 4 or 6 (timeline 630).
  • the segments may be color-coded based on the different categories selected see FIG. 8.
  • three (3) categories have been selected 815 (Category 1, Category k, Category N) and a maximum duration 820 (5 min, 30 sec) for the to- be-generated audio-video abstract is indicated.
  • a category is selected (e.g., using checkbox 815)
  • a color e.g., 805a (e.g., green for category 1), 805b (e.g., yellow for category K), 805c (e.g., purple for category N)
  • colors are represented by the use of different types of hatching.
  • Each selected category 815 may have a different color.
  • the associated color-coded category segments 825 is/are indicated on timeline 730 corresponding to the full (original, not abstracted) content and the abstract duration counter 740 indicates the totaled duration of all selected segments.
  • a category may be based on a (set of keyword(s) and a (set of) keyword(s) may be mapped to one or more segments.
  • Category k yellow maps to three audio-video segments (to a 2nd, to a 3rd and to a before last segment following the timeline from left to right), as indicated by arrows. The arrows are shown for explanative purposes and may, or may not, be included in the user interface.
  • An audio-video abstract corresponding to the selections may be generated by selecting the “Build my abstract” button 850.
  • FIG. 9 is a further embodiment depicting a graphical user interface which may be used to generate audio-video abstracts according to the disclosure.
  • a graphical user interface (GUI) for a video replay store is shown, such as, for example, a video replay store that enables replay of recent rugby matches.
  • GUI graphical user interface
  • two options are of particular interest: replay option 910 and audio-video abstract generation option 920.
  • the replay option 910 replays the sporting event in its entirety.
  • the audio-video abstract option 920 opens a new window (see FIG. 10) where a viewer may make selections for generating his/her audio-video abstract of the selected sporting event.
  • a selection menu 1010 displays a list of choices/preferences 1005a-b.
  • two columns (left column, right column) of example selection choices/preferences are shown that are specific for two-team sports events such as a football match, a tennis match, and a rugby match; i.e., a left column 1005a for selection of actions of a blue team, and a right column 1005b for selections of a white team (‘blue’ and ‘white’ may refer here to colors of player’s T- shirts).
  • the option “Only blue team” selects, for the generation of the audio video abstract, only actions of the blue team.
  • a similar selectable option “Only white team” (right column, top) enables selection of only actions of the white team.
  • Further selectable options are ‘Try’ (Rugby terminology - indicating a way of scoring points), ‘Scrum’ (Rugby terminology - a method of restarting play in rugby football that involves players packing closely together with their heads down), ‘Crucial phase of play’ (an important moment in the sports event).
  • Crucial phase of play in an audio-video of a sports event may for example be detected in the audio-video (e.g., by the audio-video abstract generation engine) and then be added to the metadata by any of the following:
  • a relatively high audio level of the commentator’s voice i.e., relatively high compared to the mean audio level of the commentator’s voice
  • a relatively high pitch of the commentator’s voice i.e., relatively high compared to a mean pitch of the commentator’s voice
  • a relatively high speed of the commentator’s speech (fast talking) (i.e., relatively high compared to a mean speed of the commentator’s speech);
  • Metadata associated with the audio-video may be added by hand to the metadata associated with the audio-video (e.g., by a video editor).
  • Still further options that can be selected may be ‘red/yellow cards’ (the referee gives a red of yellow card to a player), the selection of slow motion sequence(s), or selection of actions happening in the 22 meter line (specific to rugby), or following actions of a specific player (selectable through 1025), e.g., of player 9 (indicated by #9 in the figure, where ‘#’ stands for ‘number’) of the blue team, and of players 9 and 6 of the white team, or only events related to the first half of the match (selectable option 1035a), or only events related to the second half of the match (selectable option 1035b)
  • Metadata comprise keywords/key phrases with associated timestamps referring to the timeline of the original (full) audio-video).
  • the metadata may be generated automatically or manually (e.g., using a video editor) during, or after, the sports event, e.g., in a step 210 (FIG. 2).
  • a maximum duration 1020 may be selected for the audio-video abstract. If a maximum duration of the audio-video abstract is specified, the audio-video abstract generation engine will keep the generated audio-video abstract within (certain limits of) the specified duration.
  • FIG. 11 shows an embodiment of a UI that may be shown when selections have been made (e.g., using the UI shown in FIG. 10).
  • a timeline 1130 is displayed which provides visual feedback.
  • one selection is made, shown by a black circle selecting the ‘try’ option in the left selection column.
  • the corresponding selected segments 1105 (which are generated by the audio-video abstract generation engine based on the selected options and the metadata associated with the original (full) audio-video of the sports event) are depicted on timeline 1130. Additionally, the total time duration 1140 for the selected segments is displayed.
  • the timeline 1130 enables to visualize the selected segments as well as the total duration 1140 of his/her personal audio-video abstract to be generated based on the selected option(s).
  • FIG. 12 shows the UI of figure 11 when multiple selections 1205 are made.
  • the corresponding selected audio-video segments 1210, 1215 are depicted on timeline 1230a.
  • segments 1210 correspond to actions of the “blue” team and are depicted on timeline 1230a with a “blue” color (shown here by hatched rectangles).
  • the segments 1215 correspond to the “white” team and are depicted on timeline 1230a with a “white” color (shown here by non-hatched rectangles).
  • the “white” and “blue” designation for the segments provides for a quick visualization and enables to easily distinguish the selected segments for the two teams of the sporting event.
  • Timeline 1230b shows, the timeline of the audio-video abstract that will be generated from the selected audio-video segments when the ‘build’ button is activated. It can be seen that the non-selected parts of the full (original) audio-video sequence of the sports event are not selected for the generation of the audio-video abstract and only selected segments of the original (full) audio-video sequence of the sports event contribute to (are selected for, are included in) the audio-video abstract.
  • the arrows indicate which segment on timeline 1230a contributes to which segment of the audio-video abstract shown on timeline 1230b.
  • the segments in the audio-video abstract are separated by an introductive audio-video sequence (for example, by a set of audio-video frames that introduces the content of the segment that follows, e.g., ‘Blue Team: try action’, ‘White team: red card for player number 9”), the introductive audio-video sequences being added (inlayed) to the audio-video abstract when being generated, e.g., as prepending audio-video frames.
  • an introductive audio-video sequence for example, by a set of audio-video frames that introduces the content of the segment that follows, e.g., ‘Blue Team: try action’, ‘White team: red card for player number 9”
  • the audio-video segments that are selected from the original audio-video sequence according to the selection(s) 1205 are put in an order different than shown on timeline 1230b when the audio-video abstract is generated; for example, segments related to actions of the blue team may be regrouped and may be followed by similarly regrouped actions of the white team.
  • the personalized audio-video abstract may be generated (e.g., by the audio-video abstract generation engine).
  • the abstract generation may be triggered by selection of the “Build” abstract button 850, 1050 (FIG.8, FIGs. 10-12).
  • step 230 the audio-video abstract is built (constructed, generated) based on the original (full) audio-video sequence associated with metadata and the selected options (e.g., as selected by a user using the UIs shown in FIGs. 7-12).
  • an enhanced audio-video abstract can also be generated by optionally adding information inlays between segments, such as, for example, video frame inlays including time, score, or state of play.
  • a user may edit the selected audio-video segments and modify them; for example, using timeline 1230b, a user may select a segment, remove the thus selected segment, or move the thus selected segment to a different location on timeline 1230b, so that the segment will be located at a different time in the generated audio-video abstract.
  • the audio-video abstracts may be generated fully automatically without viewer/user intervention, i.e., without user selections via graphical user interfaces as shown in the figures.
  • the device implementing disclosed embodiment(s) may generate audio-video abstracts based on a configuration accessible to the device, i.e., stored in the device itself, or in a profile (e.g., a user profile or a device profile) stored in a network (cloud).
  • the profile may therefore contain a set of keywords that are similar to those selected by a user via the user interfaces shown in figures, and that are then parsed by the audio-video abstract generation engine to create an audio-video abstract of an audio-video sequence.
  • the keywords may selected for being present in the profile according to user actions, for example, when a user often watches video sequences of football matches concerning the ‘blue’ team, and skips through the video sequence using forward and reverse play actions to mainly view goals of the blue team, the profile may be filled with keywords based on the user actions during the watching of sports events related to the blue team.
  • the profile may be adjusted according to multiple viewing of audio-video sequences so that the profile best corresponds to what seems to be of particular interest to the user based on his viewing behavior.
  • the profile may include information similar to the information (selections) done by a user via the user interfaces presented in the appended figures, including, for example, a maximum abstract duration, that may be preconfigured by the user, or that may be determined from viewer behavior monitoring.
  • FIG. 13 shows an embodiment of a method 1300 according to the present disclosure.
  • audio-video content and associated metadata is received.
  • the metadata is mapped to parts (segments) (contains pointers to) of the audio-video content and includes first keywords (or a set of keywords or key phrases) categorizing the parts (the segments).
  • one or more parts (segments) of the audio-video content are selected based on matching of at least one of first keywords in the metadata to at least one of second keywords stored in the configuration information.
  • an audio-video abstract is generated, based on the selected one or more parts of the audio-video content.
  • a method implemented by a device, the method comprising:
  • the method further comprises, as part of the generating the audio-video abstract, prepending the selected one or more parts by audio-video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
  • At least part of the configuration information is obtained via a user interface.
  • At least part of the configuration information is obtained by monitoring viewer behavior.
  • the method comprises receiving a maximum desired duration for the audio-video abstract from the user interface and storing the received maximum desired duration in the configuration information.
  • the method comprises determining a maximum desired duration for the audio-video abstract based on monitoring viewer behavior and storing the determined maximum desired duration in the configuration information.
  • the present disclosure also relates to a device comprising at least one processor, configured to:
  • the at least one processor is further configured to, as part of the generating the audio-video abstract, prepend the selected one or more parts by audio video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
  • the at least one processor is further configured to obtain at least part of the configuration information via a user interface. According to an embodiment, the at least one processor is further configured to obtain at least part of the configuration information by monitoring viewer behavior.
  • the at least one processor is further configured to receive a maximum desired duration for the audio-video abstract from the user interface and to store the received maximum desired duration in the configuration information.
  • the at least one processor is further configured to determine a maximum desired duration for the audio-video abstract based on monitoring, by the at least one processor, viewer behavior, and to store the determined maximum desired duration in the configuration information.
  • the present disclosure also relates to computer program product comprising instructions which when executed cause a processor to implement any one of the described embodiments of the disclosed method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Graphics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A method and device for generating audio-video abstracts is disclosed. Audio-video content is provided. The audio-video content includes metadata mapped to segments or parts of the audio-video content and the segments or parts of the audio-video content are categorized with keywords in the metadata. One or more audio-video segments or parts of the audio-video content are selected based on matching at least one first keyword included the associated metadata to at least one second keyword included in configuration information. An audio-video abstract is generated based on the one or more selected audio-video segments or parts.

Description

METHOD AND DEVICE FOR GENERATING AN AUDIO- VIDEO ABSTRACT
TECHNICAL FIELD
The present disclosure generally relates to applications for audio-video. At least one embodiment relates to the generation of audio-video abstracts.
BACKGROUND
Audio-video abstracts typically represent short summaries of audio-video content, such as, for example, a sporting event, a news program or a movie. A viewer may want to only view the audio-video abstract (e.g., via a video replay service, via a video-on-demand service) rather than watching the entire audio-video content.
The audio-video abstracts are generally focused to a global audience and are directed towards specific topics that considered to be of common interest. Unfortunately, such audio video abstracts may not be of interest to an individual viewer. For example, if the individual viewer is interested in recent football matches wherein a specific player appears or in movies starring a specific actor, many audio-video abstracts are available to the viewer. However, if, for example, a football fan is only interested in free kicks by the opposing team and/or saves by the keeper, few if any audio-video abstracts may be found.
In this regard, the viewer is unable to easily obtain access to audio-video abstracts that matches with his/her individual taste or preferences. The embodiments herein have been devised with the foregoing in mind.
SUMMARY
The disclosure is directed to generating audio-video abstracts. The described embodiments may be implemented on devices, such as, for example, mobile phones, tablets, set top boxes and digital televisions.
According to a first aspect of the disclosure, there are disclosed methods according to the appended claims. According to a second aspect of the disclosure, there is disclosed a device according to the appended claims.
Some methods or processes implemented by elements of the disclosure may be computer implemented. Accordingly, such elements may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as “circuit”, “module” or “system”. Furthermore, such elements may take the form of a computer program product embodied in any tangible medium of expression having computer useable program code embodied in the medium.
Since elements of the disclosure can be implemented in software, the present disclosure can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid-state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g., a microwave or RF signal.
BRIEF DESCRIPTION OF THE DRAWINGS
Other features and advantages of embodiments shall appear from the following description, given by way of indicative and non-exhaustive examples and from the appended drawings, of which:
FIG. 1 is an exemplary system for generating audio-video abstracts;
FIG. 2 is a flowchart of an embodiment of a method for generating an audio-video abstract;
FIG. 3 is audio-video content as a function of time;
FIG. 4 is the audio-video content of FIG. 3, further including segments indicative of team colors for the team performing an action in a corresponding audio-video segment;
FIG. 5 is the segmented audio-video content of FIG. 4 further including audio level plotted as a function of time and associated with an action taking place in the corresponding segment; FIG. 6 is selection of segments for generating an abstract according to an embodiment;
FIG. 7 is a user interface according to an embodiment of the disclosure;
FIG. 8 is an embodiment of a user interface in which three (3) categories are selected and a maximum abstract duration for the audio-video abstract is indicated; FIG. 9 is another embodiment depicting a graphical user interface which may be used to generate audio-video abstracts according to the disclosure;
FIG. 10 is a list of selections available to a user of the graphical user interface shown in FIG. 9;
FIG. 11 is a graphical user interface shown in FIG. 10 after a user selection is made; FIG. 12 is the graphical user interface shown in FIG. 10 after user selections are made; and
FIG. 13 is a flow chart of an embodiment of a method according to the present disclosure. PET ATT, ED DESCRIPTION
In the following, the term ‘video’ or ‘audio/video’ or ‘audio-video’ is used to indicate a video accompanied or not by associated audio. In the following, the term ‘video abstract’ is used to indicate a video (possibly including audio) that is constituted of excerpts (parts, segments) of a longer (e.g., full size) video (that may include audio). A video abstract may include audio, e.g., audio that is associated with the video excerpts. A video abstract is for example a (personalized) summary of a longer (e.g., full size) video sequence, or audio/video sequence.
FIG. 1 is a block diagram of an system 100 in which various aspects of the embodiments may be implemented. The system 100 may include front-end 110, back-end 120 and a network 130.
Back-end 120 includes at least one processor 124 and an audio/video encoder 128. Processor 124 may execute instructions configured to process an audio-video sequence, comprising associated audio or not, into excerpts (parts, segments) and extract metadata therefrom. For example, the excerpts may be based on scene cuts, scene change detection, etc., as will be further discussed below. The audio/video encoder 128 converts the audio/video sequence into a compressed format (e.g., MPEG2, MPEG4, MKV) suitable for being transmitted (e.g., broadcast, multicast, unicast) according to a transport protocol (e.g., MPEG2 Transport Stream). The extracted metadata may include, for example, keywords, timestamps, video frame numbers.
The processor 124 may include embedded memory (not shown), an input-output interface (not shown), and various other circuitries as known in the art. Program code may be loaded into processor 124 to perform the various processes described hereinbelow.
Back-end 120 may include at least one memory (e.g., a volatile memory, a non-volatile memory) configured to store program code to be loaded into the processor 120 for subsequent execution. Back-end 120 may include a storage device (not shown), which may include non volatile memory, including but not limited to EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive, e.g., for storage of audio/video sequences and metadata. The storage device may comprise an internal storage device, an attached storage device, and or a network accessible storage device, as non-limiting examples.
Back-end 120 may be coupled to network 130. Encoded audio/video is transmitted to front-end 110 via network 130.
Front-end 110 includes an audio/video decoder 112, at least one processor 116 and possibly a user interface/display 118. The at least one processor 116 may be configured for storing and processing audio/video. Device 110 may include memory (not shown) that comprise program instructions that are executable for at least one processor 116 and that are configured to implement embodiments disclosed herein.
The audio/video decoder 112 converts the encoded audio/video sequences received from device 120 via network 130 to a decompressed format. The decoded audio/video is provided to the at least one processor 116. The at least one processor 116 may include a graphics processing unit or GPU. The at least one processor 116 may be configured to implement the further described embodiments for generating video abstracts and may execute therefore a process that implements methods according to described embodiments that may be further referred to as an audio-video abstract generation engine. For example, the presently disclosed methods and devices may be used to generate an audio-video abstract from a longer (e.g., a full-size or full-length) audio/video content, based on preference(s) (e.g., user selectable preference(s)) or (pre-) configuration or user (viewing behavior) profiles. Device 120 may transmit audio/video sequences to device 110 while another device (not shown) may transmit metadata related to the audio/video sequences to device 110. For example, device 120 is a video-on-demand server, or a video broadcast server, while the above mentioned another device may be a dedicated metadata server.
FIG. 2 is a flowchart of a particular embodiment of a method 200 for generating an audio-video abstract.
Embodiments of the disclosed methods may be carried out by front-end device 110 (FIG. 1). In step 210, audio-video content is provided to the front-end device 110. The audio video content that is provided is mapped to timeslots (to a timestamp, to multiple timestamps, for example a start timestamp and an end timestamp, wherein a timestamp points to an audio video frame) relative to a timeline of the audio-video and is categorized with keywords. The keywords with the associated timestamps referring to the original (full size/length) audio-video are further referred to by (extracted) metadata.
Several solutions may be used to map the audio-video content to timeslots and to split (to partition, to segment) audio-video content into (extracted) segments, parts, or excerpts corresponding to the mapped timeslots. Solutions for mapping the audio-video content to timeslots and to split (segment, partition) the audio-video content may be based, for example, on detection of scene cuts, scene change detection, etc. (further possibilities for detecting where an audio-video may be split to create a new segment will be described further on). A segment may for example be a scene consisting of a group of contiguous video (camera) shots that may be, for example, semantically consistent and that may form a semantic unit. As a non-limiting example, for a football match, audio-video content may be segmented based on various phases of play during the football match, such as, goal scoring, slow motion replays, referee whistle blows and high audio level. For a news program audio-video content may for example be segmented based on chaptering the program, detection of change of subject (e.g., determined through use of artificial intelligence).
Segments may overlap. As a non-limiting example, a segment from 0 to 10 minutes may be semantically labeled with metadata: “Team A attack” from 0 to 5 minutes + “player A” from 3 to 8 minutes + “player B” from 6 to 10 minutes + “slow motion replay” from 9 to 10 minutes + “goal” at 10 minutes relative to a timeline of the original (full, longer) content.
When, or while, the audio-video content is segmented, metadata is created and associated with the segments. Metadata associated with a segment may include keywords and timestamps, and audio-video frame numbers. Keywords may be associated with a segment based on the content of the video, audio or both video and audio, relative to the segment. Deep learning methods (artificial intelligence) are for example able to perform such association.
FIG. 3 is an audio-video sequence shown on a time line indicating metadata that includes keywords (‘FK’ for ‘Free Kick’ (e.g., 305), ‘S’ for ‘Shot on Goal’ (e.g., 320), ‘G’ for ‘Goal’ (e.g., 310), ‘C’ for ‘Corner Shot’ (e.g., 315), ‘Red’ for ‘Red Card’ (e.g., 325)) that point to segments (see FIG. 4) of the audio-video sequence on the timeline. In the case of a football match, a segment may correspond to a phase of play (actions) that is plotted as a function of the time (t) of the match. A segment may be associated with a keyword pertinent to an action of the football match, such as, for example, a free kick (FK) 305, a goal (G) 310, a corner kick (C) 315, a shot on goal (S) 320 and red card (Red) 325. For other types of audio-video content, e.g., a movie, keywords pertinent to an action may be for example ‘fight’, ‘dance’ or ‘sing’.
According to a further embodiment, keywords, e.g., for a sporting match may include the name and/or colors of teams playing the match, or of individual players, or e.g., for a movie, associated keywords (i.e., metadata) may include the name of actors. For example, as shown in FIG. 4, a graph 200 shows audio-video segments for the football match previously illustrated in FIG. 3 that are plotted as a function of the match time (t). For example, the segments shown on the timeline of figure 4 may be blue 410 or green 420 colored (here depicted as white rectangles (representing the blue color) and forward hatched rectangles (representing the green color) respectively). The blue 410 or green 420 colored segments are plotted along a time axis (a time line) representing the football match video time (0 is start of the match; an end-time may be indicated (not shown) indicating the end time of the match), wherein the color (hatching) of the segments (Fig. 4) may be indicative of the color of the team (e.g., blue team, green team) member performing the action of the football match associated with the corresponding audio-video segment.
Additional details of the audio-video content may be included in the metadata, such as audio level or excitement level and may be part of the metadata relative to a segment. As a non-limiting example, the audio level of the commentator and/or of the crowd or public may be extracted from the audio-video and may be associated with an action of the football match that corresponds to an audio-video segment. The audio level may be representative of an excitement level (e.g., of the commentator, of the crowd or public attending the football match).
In FIG. 5, 500 depicts extracted audio-video segments for the football match previously illustrated in FIG. 4 are shown as a function of the match time (t), further including crowd audio level 510 and commentator audio level 515. The crowd audio level 510 and commentator audio level 515 are associated with an action of the football match that corresponds to an extracted audio-video segment.
Other non-limiting examples, such as, in a tennis match, extracting the score is indicative as to whether a match or game is at a break point. Score information may be used for segmentation, and score information may be part of metadata of a segment. In the case of a news program, keywords (for example relating to a subject discussed) may be extracted for example using speech recognition methods and be used to split segments and may be added to metadata relative to a segment.
While using the term ‘extracting’ when related to audio-video segments, it is not necessary to physically extract (and store) audio-video segments. The ‘extraction’ may be a purely logical operation, consisting in storing pointers (e.g., video frame numbers and/or timestamps) to an audio-video sequence in metadata, in order to be able to, at a later time, generate an audio-video abstract. For example, pointers may define a segment by a start pointer (pointing to the start of the segment) and an end pointer (pointing to the end of the segment). In the metadata, a segment start and/or end pointer may be associated with a keyword or a set of keywords (a keyword string or keyword phrase), e.g., a keyword (keyword string) indicating an action (e.g., ‘attack of player B of the blue team’).
Referring again to FIG. 2 at step 220, one or more of the audio-video segments provided in step 210 are identified for the audio-video abstract. One or more audio-video segments are selected based on at least one keyword associated with the audio-video content and matching the keywords associated with the audio-video contents to a set of keywords that are e.g., selected by a user using a user interface or that are stored in a user profile or (user) configuration that is for example constructed based on information entered by a user using a graphical user interface or based on information obtained from monitoring user viewing behavior.
FIG. 6 is a user interface which may be used to implement embodiments of the disclosure. The user interface may include a time line 610 that contains a representation of the extracted audio-video segments of the audio-video content 620 on the time line depicted there above.
In the example shown in FIG. 6, the extracted audio-video segments, including keywords and further metadata, that are shown above time line 610 are those previously shown in FIG. 5. The segments that are selected on the time line 610 correspond for example to audio video segments 630 a viewer may be interested in based on the keywords (e.g., ‘FK’, ‘S’, ‘C’, ‘G’, and ‘Red’) and further metadata that is shown above the time line 610. In FIG. 6, for example, the time line 610 is identified with the legend “Selection”. For example, the selected segments may be relative to the “blue” team and to actions free kicks (FK), comer kicks (C), goals (G) and shots on goal (S) of that team.
FIG. 7 is another user interface 700 which may be used to implement embodiments of the disclosure. The graphical user interface (GUI) 700 may be used to generate a (user) profile or (user) configuration with the purpose of generating an audio-video abstract based on the (user) profile/configuration.
In this non-limiting example, a graphical user interface (GUI) is populated with audio video segments 710 that are selected according to their mapping to the keywords and other metadata (e.g., as preconfigured, or configured by a user/viewer). The audio-video segments 710 may be identified as Category 1 to Category N. A category may be based on a (set of) keyword(s) and a (set of) keyword(s) may be mapped to one or more segments.
A maximum duration 720 may be specified for the to-be-generated audio-video abstract. A timeline 730 of the original audio-video may be displayed along with a duration counter 740 corresponding to the duration of the selected segments for the audio-video abstract. The abstract duration may be presented as a time or as a percentage of the original audio/video duration. The timeline 730 indicates where, chronologically, the selected audio-video segments, selected according to the selected metadata category, appear (no segments are shown on timeline 730, but the presentation may be similar to that of figure 4 or 6 (timeline 630). The segments may be color-coded based on the different categories selected see FIG. 8.
According to the embodiment of FIG. 8, three (3) categories have been selected 815 (Category 1, Category k, Category N) and a maximum duration 820 (5 min, 30 sec) for the to- be-generated audio-video abstract is indicated. When a category is selected (e.g., using checkbox 815), a color (e.g., 805a (e.g., green for category 1), 805b (e.g., yellow for category K), 805c (e.g., purple for category N)) is assigned to the category. In figure 8, colors are represented by the use of different types of hatching. Each selected category 815 may have a different color. The associated color-coded category segments 825 is/are indicated on timeline 730 corresponding to the full (original, not abstracted) content and the abstract duration counter 740 indicates the totaled duration of all selected segments. A category may be based on a (set of keyword(s) and a (set of) keyword(s) may be mapped to one or more segments. As a non limiting example, Category k (yellow) maps to three audio-video segments (to a 2nd, to a 3rd and to a before last segment following the timeline from left to right), as indicated by arrows. The arrows are shown for explanative purposes and may, or may not, be included in the user interface. An audio-video abstract corresponding to the selections may be generated by selecting the “Build my abstract” button 850.
FIG. 9 is a further embodiment depicting a graphical user interface which may be used to generate audio-video abstracts according to the disclosure. A graphical user interface (GUI) for a video replay store is shown, such as, for example, a video replay store that enables replay of recent rugby matches. In this graphical user interface two options are of particular interest: replay option 910 and audio-video abstract generation option 920. The replay option 910 replays the sporting event in its entirety. The audio-video abstract option 920 opens a new window (see FIG. 10) where a viewer may make selections for generating his/her audio-video abstract of the selected sporting event.
Now referring to FIG. 10, which represents a UI according to an embodiment after the user has selected the audio-video abstract option 920 (see FIG. 9), a selection menu 1010 displays a list of choices/preferences 1005a-b. According to the embodiment of figure 10, two columns (left column, right column) of example selection choices/preferences are shown that are specific for two-team sports events such as a football match, a tennis match, and a rugby match; i.e., a left column 1005a for selection of actions of a blue team, and a right column 1005b for selections of a white team (‘blue’ and ‘white’ may refer here to colors of player’s T- shirts). The option “Only blue team” (left column, top) selects, for the generation of the audio video abstract, only actions of the blue team. A similar selectable option “Only white team” (right column, top) enables selection of only actions of the white team. Further selectable options are ‘Try’ (Rugby terminology - indicating a way of scoring points), ‘Scrum’ (Rugby terminology - a method of restarting play in rugby football that involves players packing closely together with their heads down), ‘Crucial phase of play’ (an important moment in the sports event). Crucial phase of play in an audio-video of a sports event may for example be detected in the audio-video (e.g., by the audio-video abstract generation engine) and then be added to the metadata by any of the following:
• A relatively high audio level of the commentator’s voice (i.e., relatively high compared to the mean audio level of the commentator’s voice);
• A relatively high pitch of the commentator’s voice (i.e., relatively high compared to a mean pitch of the commentator’s voice);
• Keywords pronounced by the commentator’s voice (i.e., ‘attack’, ‘goal’);
• A relatively high speed of the commentator’s speech (fast talking) (i.e., relatively high compared to a mean speed of the commentator’s speech);
• A relatively high audio level of the audio generated by the crowd/public/fans attending the match.
Or may be added by hand to the metadata associated with the audio-video (e.g., by a video editor).
Still further options that can be selected may be ‘red/yellow cards’ (the referee gives a red of yellow card to a player), the selection of slow motion sequence(s), or selection of actions happening in the 22 meter line (specific to rugby), or following actions of a specific player (selectable through 1025), e.g., of player 9 (indicated by #9 in the figure, where ‘#’ stands for ‘number’) of the blue team, and of players 9 and 6 of the white team, or only events related to the first half of the match (selectable option 1035a), or only events related to the second half of the match (selectable option 1035b)
All these selectable options correspond to metadata associated with the audio-video of the sports event (metadata comprise keywords/key phrases with associated timestamps referring to the timeline of the original (full) audio-video). The metadata may be generated automatically or manually (e.g., using a video editor) during, or after, the sports event, e.g., in a step 210 (FIG. 2).
A maximum duration 1020 may be selected for the audio-video abstract. If a maximum duration of the audio-video abstract is specified, the audio-video abstract generation engine will keep the generated audio-video abstract within (certain limits of) the specified duration.
In FIG. 11, shows an embodiment of a UI that may be shown when selections have been made (e.g., using the UI shown in FIG. 10). A timeline 1130 is displayed which provides visual feedback. In this example, one selection is made, shown by a black circle selecting the ‘try’ option in the left selection column. The corresponding selected segments 1105 (which are generated by the audio-video abstract generation engine based on the selected options and the metadata associated with the original (full) audio-video of the sports event) are depicted on timeline 1130. Additionally, the total time duration 1140 for the selected segments is displayed. The timeline 1130 enables to visualize the selected segments as well as the total duration 1140 of his/her personal audio-video abstract to be generated based on the selected option(s).
FIG. 12 shows the UI of figure 11 when multiple selections 1205 are made. The corresponding selected audio-video segments 1210, 1215 are depicted on timeline 1230a. For this example, segments 1210 correspond to actions of the “blue” team and are depicted on timeline 1230a with a “blue” color (shown here by hatched rectangles). The segments 1215 correspond to the “white” team and are depicted on timeline 1230a with a “white” color (shown here by non-hatched rectangles). The “white” and “blue” designation for the segments provides for a quick visualization and enables to easily distinguish the selected segments for the two teams of the sporting event. Timeline 1230b shows, the timeline of the audio-video abstract that will be generated from the selected audio-video segments when the ‘build’ button is activated. It can be seen that the non-selected parts of the full (original) audio-video sequence of the sports event are not selected for the generation of the audio-video abstract and only selected segments of the original (full) audio-video sequence of the sports event contribute to (are selected for, are included in) the audio-video abstract. The arrows indicate which segment on timeline 1230a contributes to which segment of the audio-video abstract shown on timeline 1230b. According to an embodiment, the segments in the audio-video abstract are separated by an introductive audio-video sequence (for example, by a set of audio-video frames that introduces the content of the segment that follows, e.g., ‘Blue Team: try action’, ‘White team: red card for player number 9”), the introductive audio-video sequences being added (inlayed) to the audio-video abstract when being generated, e.g., as prepending audio-video frames. According to an embodiment, the audio-video segments that are selected from the original audio-video sequence according to the selection(s) 1205 are put in an order different than shown on timeline 1230b when the audio-video abstract is generated; for example, segments related to actions of the blue team may be regrouped and may be followed by similarly regrouped actions of the white team.
Referring again to FIG. 2 at step 230 and FIG. 8, after selection of one or more of the audio-video segments identified in step 220 the personalized audio-video abstract may be generated (e.g., by the audio-video abstract generation engine). The abstract generation may be triggered by selection of the “Build” abstract button 850, 1050 (FIG.8, FIGs. 10-12).
In step 230 (FIG. 2) the audio-video abstract is built (constructed, generated) based on the original (full) audio-video sequence associated with metadata and the selected options (e.g., as selected by a user using the UIs shown in FIGs. 7-12). According to a further embodiment, an enhanced audio-video abstract can also be generated by optionally adding information inlays between segments, such as, for example, video frame inlays including time, score, or state of play. According to a further embodiment, a user may edit the selected audio-video segments and modify them; for example, using timeline 1230b, a user may select a segment, remove the thus selected segment, or move the thus selected segment to a different location on timeline 1230b, so that the segment will be located at a different time in the generated audio-video abstract.
According to a further embodiment, the audio-video abstracts may be generated fully automatically without viewer/user intervention, i.e., without user selections via graphical user interfaces as shown in the figures. For example, according to an embodiment, the device implementing disclosed embodiment(s) may generate audio-video abstracts based on a configuration accessible to the device, i.e., stored in the device itself, or in a profile (e.g., a user profile or a device profile) stored in a network (cloud). The profile may therefore contain a set of keywords that are similar to those selected by a user via the user interfaces shown in figures, and that are then parsed by the audio-video abstract generation engine to create an audio-video abstract of an audio-video sequence. The keywords may selected for being present in the profile according to user actions, for example, when a user often watches video sequences of football matches concerning the ‘blue’ team, and skips through the video sequence using forward and reverse play actions to mainly view goals of the blue team, the profile may be filled with keywords based on the user actions during the watching of sports events related to the blue team. The profile may be adjusted according to multiple viewing of audio-video sequences so that the profile best corresponds to what seems to be of particular interest to the user based on his viewing behavior. The profile may include information similar to the information (selections) done by a user via the user interfaces presented in the appended figures, including, for example, a maximum abstract duration, that may be preconfigured by the user, or that may be determined from viewer behavior monitoring.
FIG. 13 shows an embodiment of a method 1300 according to the present disclosure. In a step 1301, audio-video content and associated metadata is received. The metadata is mapped to parts (segments) (contains pointers to) of the audio-video content and includes first keywords (or a set of keywords or key phrases) categorizing the parts (the segments). In a step 1302, one or more parts (segments) of the audio-video content are selected based on matching of at least one of first keywords in the metadata to at least one of second keywords stored in the configuration information. In a step 1303, an audio-video abstract is generated, based on the selected one or more parts of the audio-video content.
The embodiments disclosed are explained with the help of sports events and audio video sequences related to sports events. The reader will readily understand that the present disclosed embodiments are not limited to sports events or audio-video sequences related to sports events, but may be used to create audio-video abstracts of other types of events and audio-video sequences related to these other types of events, such as movies and documentaries.
According to an embodiment, there is disclosed a method, implemented by a device, the method comprising:
• receiving audio-video content and associated metadata mapped to parts of the audio-video content and first keywords categorizing the parts;
• selecting one or more parts of the audio-video content based on matching of at least one of first keywords to at least one of second keywords stored in configuration information; and • generating an audio-video abstract based on the selected one or more parts of the audio video content.
According to an embodiment, the method further comprises, as part of the generating the audio-video abstract, prepending the selected one or more parts by audio-video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
According to an embodiment, at least part of the configuration information is obtained via a user interface.
According to an embodiment, at least part of the configuration information is obtained by monitoring viewer behavior.
According to an embodiment, the method comprises receiving a maximum desired duration for the audio-video abstract from the user interface and storing the received maximum desired duration in the configuration information.
According to an embodiment, the method comprises determining a maximum desired duration for the audio-video abstract based on monitoring viewer behavior and storing the determined maximum desired duration in the configuration information.
The present disclosure also relates to a device comprising at least one processor, configured to:
• receive audio-video content and associated metadata mapped to parts of the audio-video content and first keywords categorizing the parts;
• select one or more parts of the audio-video content based on matching of at least one of first keywords to at least one of second keywords stored in configuration information;
• generate an audio-video abstract based on the selected one or more parts of the audio-video content.
According to an embodiment, the at least one processor is further configured to, as part of the generating the audio-video abstract, prepend the selected one or more parts by audio video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
According to an embodiment, the at least one processor is further configured to obtain at least part of the configuration information via a user interface. According to an embodiment, the at least one processor is further configured to obtain at least part of the configuration information by monitoring viewer behavior.
According to an embodiment, the at least one processor is further configured to receive a maximum desired duration for the audio-video abstract from the user interface and to store the received maximum desired duration in the configuration information.
According to an embodiment, the at least one processor is further configured to determine a maximum desired duration for the audio-video abstract based on monitoring, by the at least one processor, viewer behavior, and to store the determined maximum desired duration in the configuration information.
The present disclosure also relates to computer program product comprising instructions which when executed cause a processor to implement any one of the described embodiments of the disclosed method.
Although the present embodiments have been described hereinabove with reference to specific embodiments, the present disclosure is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the claims.
Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular, the different features from different embodiments may be interchanged where appropriate.

Claims

1. A method, implemented by a device, the method comprising: receiving audio-video content and associated metadata mapped to parts of the audio video content and first keywords categorizing the parts; selecting one or more parts of the audio-video content based on matching of at least one of first keywords to at least one of second keywords stored in configuration information; and generating an audio-video abstract based on the selected one or more parts of the audio video content.
2. The method according to claim 1, further comprising, as part of the generating the audio video abstract, prepending the selected one or more parts by audio-video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
3. The method according to claim 1 or 2, wherein at least part of the configuration information is obtained via a user interface.
4. The method according to claim 1 or 2, wherein at least part of the configuration information is obtained by monitoring viewer behavior.
5. The method according to claim 3, further comprising: receiving a maximum desired duration for the audio-video abstract from the user interface and storing the received maximum desired duration in the configuration information.
6. The method according to claim 4, further comprising: determining a maximum desired duration for the audio-video abstract based on monitoring viewer behavior and storing the determined maximum desired duration in the configuration information.
7. A device comprising at least one processor, configured to: receive audio-video content and associated metadata mapped to parts of the audio-video content and first keywords categorizing the parts; select one or more parts of the audio-video content based on matching of at least one of first keywords to at least one of second keywords stored in configuration information; and generate an audio-video abstract based on the selected one or more parts of the audio- video content.
8. The device according to claim 7, wherein the at least one processor is further configured to, as part of the generating the audio-video abstract, prepend the selected one or more parts by audio-video frames comprising a description of the selected one or more parts based on the matching at least one of first and second keywords.
9. The device according to claim 7 or 8, wherein the at least one processor is further configured to obtain at least part of the configuration information via a user interface.
10. The device according to claim 7 or 8, wherein the at least one processor is further configured to obtain at least part of the configuration information by monitoring viewer behavior.
11. The device according to claim 9, wherein the at least one processor is further configured to receive a maximum desired duration for the audio-video abstract from the user interface and to store the received maximum desired duration in the configuration information.
12. The device according to claim 10, wherein the at least one processor is further configured to determine a maximum desired duration for the audio-video abstract based on monitoring, by the at least one processor, viewer behavior, and to store the determined maximum desired duration in the configuration information.
13. A computer program product comprising instructions which when executed cause a processor to implement the method of any one of claims 1 to 6.
PCT/EP2022/055755 2021-03-08 2022-03-07 Method and device for generating an audio-video abstract WO2022189359A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP21305268 2021-03-08
EP21305268.1 2021-03-08

Publications (1)

Publication Number Publication Date
WO2022189359A1 true WO2022189359A1 (en) 2022-09-15

Family

ID=75302438

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/055755 WO2022189359A1 (en) 2021-03-08 2022-03-07 Method and device for generating an audio-video abstract

Country Status (1)

Country Link
WO (1) WO2022189359A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325869A1 (en) * 2012-06-01 2013-12-05 Yahoo! Inc. Creating a content index using data on user actions
WO2016057416A1 (en) * 2014-10-09 2016-04-14 Thuuz, Inc. Generating a customized highlight sequence depicting one or more events
US20190373310A1 (en) * 2018-06-05 2019-12-05 Thuuz, Inc. Audio processing for detecting occurrences of crowd noise in sporting event television programming

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325869A1 (en) * 2012-06-01 2013-12-05 Yahoo! Inc. Creating a content index using data on user actions
WO2016057416A1 (en) * 2014-10-09 2016-04-14 Thuuz, Inc. Generating a customized highlight sequence depicting one or more events
US20190373310A1 (en) * 2018-06-05 2019-12-05 Thuuz, Inc. Audio processing for detecting occurrences of crowd noise in sporting event television programming

Similar Documents

Publication Publication Date Title
AU2019269599B2 (en) Video processing for embedded information card localization and content extraction
US11899637B2 (en) Event-related media management system
CN107615766B (en) System and method for creating and distributing multimedia content
US20150312652A1 (en) Automatic generation of videos via a segment list
JP4683281B2 (en) REPRODUCTION SYSTEM, REPRODUCTION DEVICE, REPRODUCTION METHOD, INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND PROGRAM
EP2089820B1 (en) Method and apparatus for generating a summary of a video data stream
JP5135024B2 (en) Apparatus, method, and program for notifying content scene appearance
US20180314758A1 (en) Browsing videos via a segment list
WO2021241430A1 (en) Information processing device, information processing method, and program
KR101440168B1 (en) Method for creating a new summary of an audiovisual document that already includes a summary and reports and a receiver that can implement said method
US20130007818A1 (en) Provisioning interactive video content from a video on-demand (vod) server
KR100721409B1 (en) Method for searching situation of moving picture and system using the same
US11264048B1 (en) Audio processing for detecting occurrences of loud sound characterized by brief audio bursts
CN110798692A (en) Video live broadcast method, server and storage medium
WO2022189359A1 (en) Method and device for generating an audio-video abstract
JP2007208651A (en) Content viewing apparatus
JP5033653B2 (en) Video recording / reproducing apparatus and video reproducing apparatus
JP2007174260A (en) Device for producing digest information
JP2006332765A (en) Contents searching/reproducing method, contents searching/reproducing apparatus, and program and recording medium
EP4332871A1 (en) Information processing device, information processing method, and program
JP5600557B2 (en) Content introduction video creation device and program thereof
CN112753227A (en) Audio processing for detecting the occurrence of crowd noise in a sporting event television program
JP2006311208A (en) Video editing device, video editing method, video editing program and recording medium of program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22713398

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22713398

Country of ref document: EP

Kind code of ref document: A1