CN113795882B - Emotion-based multimedia content summarization - Google Patents

Emotion-based multimedia content summarization Download PDF

Info

Publication number
CN113795882B
CN113795882B CN201980092247.1A CN201980092247A CN113795882B CN 113795882 B CN113795882 B CN 113795882B CN 201980092247 A CN201980092247 A CN 201980092247A CN 113795882 B CN113795882 B CN 113795882B
Authority
CN
China
Prior art keywords
emotion
full
length
based time
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201980092247.1A
Other languages
Chinese (zh)
Other versions
CN113795882A (en
Inventor
塔里克·乔杜里
唐健
德克兰·奥沙利文
欧文·康兰
杰里米·德巴蒂斯塔
法布里齐奥·奥兰迪
马吉德·拉蒂菲
马修·尼科尔森
伊斯兰·哈桑
基利安·麦凯比
德克兰·麦基本
丹尼尔·特纳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN113795882A publication Critical patent/CN113795882A/en
Application granted granted Critical
Publication of CN113795882B publication Critical patent/CN113795882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • G11B27/105Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Methods and systems for generating emotion-based summary video for full-length movies are provided herein. The method comprises the following steps: receiving a full-length movie; receiving a Knowledge Graph (KG) annotation model of the full-length movie, wherein the KG annotation model is generated by annotating extracted features of the full-length movie; segmenting the full-length movie into a plurality of emotion-based time intervals based on the analysis of the KGs, wherein each time interval represents a certain dominant emotion; calculating a score for each of the plurality of emotion-based time intervals based on one or more of a plurality of metrics, wherein the metrics represent a level of relevance of the respective emotion-based time interval to the narrative of the full-length movie; generating a sentiment-based summary video by concatenating a subset of the plurality of sentiment-based time intervals having scores exceeding a predefined threshold; and outputting the emotion-based summarized video for presentation to one or more users.

Description

Emotion-based multimedia content summarization
Technical Field
In certain embodiments of the present invention, the present invention relates to generating a digest video for multimedia content, and more particularly, but not exclusively, to generating an emotion-based digest video for multimedia content, particularly full-length movies.
Background
Multimedia content, such as video content and/or audio content, is increasing at an alarming rate, providing an endless choice for consuming the content.
A large portion of multimedia content, such as a movie, a television show, and/or a television program, may have a very long duration.
Thus, creating a summary video to visually and/or audibly summarize such multimedia content, particularly full-length movies and/or multi-season series movies, and the like, in a much shorter duration may be highly desirable and beneficial for a number of applications, services, uses, and objectives. Such applications may include, for example, enabling a user to select multimedia for consumption and/or to categorize multimedia (based on genre and narration, etc.) from categories, and the like.
However, one of the main challenges in creating these summarized videos is to create an efficient summarized video that presents narratives of multimedia content, e.g., plot, progress, dominant facts and/or moments of truth, etc., in a concise and coherent manner so that users (viewers) viewing the summarized video can accurately understand the narratives of the multimedia content.
Disclosure of Invention
It is an object of embodiments of the present invention to provide a solution that alleviates or solves the disadvantages and problems of conventional solutions. The above and further objects are solved by the subject matter of the independent claims. Further advantageous embodiments can be found in the dependent claims.
The present invention aims to provide a solution for creating a video summary summarizing multimedia content, in particular full-length movies, in a coherent, concise and accurate manner.
According to a first aspect of the present invention, there is provided a method for generating a summary video based on emotion for a full-length movie, comprising:
-receiving a full-length movie;
-receiving a knowledge map (KG) created for the full-length movie, wherein the KG comprises an annotated model of the full-length movie generated by annotating features extracted for the full-length movie;
-segmenting the full-length movie into a plurality of mood-based time intervals based on an analysis of the KGs, wherein each time interval represents a certain dominant mood;
-calculating a score for each of the plurality of emotion-based time intervals according to one or more of a plurality of metrics, wherein the metrics represent a level of relevance of the respective emotion-based time interval to the narrative of the full-length movie;
-generating an emotion-based summary video by concatenating a subset of the plurality of emotion-based time intervals having scores exceeding a predefined threshold; and
-outputting the mood-based summary video for presentation to one or more users.
According to a second aspect of the present invention, there is provided a system for generating an emotion-based summarized video for a full-length movie, comprising one or more processors for executing code, wherein the code comprises:
-code instructions for receiving a full-length movie;
-code instructions for receiving a KG created for the full-length movie, wherein the KG comprises an annotated model of the full-length movie generated by annotating features extracted for the full-length movie;
-code instructions for segmenting the full-length movie into a plurality of emotion-based time intervals based on an analysis of the KG, wherein each time interval represents a certain dominant emotion;
-code instructions for calculating a score for each of the plurality of emotion-based time intervals based on one or more of a plurality of metrics, wherein the metrics represent a level of relevance of the respective emotion-based time interval to the narrative of the full-length movie;
-code instructions for generating an emotion-based summary video by concatenating a subset of the plurality of emotion-based time intervals having scores exceeding a predefined threshold; and
-code instructions for outputting the emotion-based summarized video for presentation to one or more users.
According to a third aspect of the present invention, there is provided a computer readable storage medium comprising computer program code instructions executable by a computer for performing the above identified method.
In another implementation of the first and/or second aspect, each annotated feature in the KG annotation model is associated with a timestamp indicating a temporal position of the respective feature in a timeline of the full-length movie. The time stamp may map each feature represented by the KG annotation model to a time of occurrence on a time axis of the full-length movie. Mapping the features onto the time axis is necessary to accurately associate the features with their times to identify a plurality of emotion-based time intervals and to apply the metric used to calculate the score to each emotion-based time interval.
In another implementation form of the first and/or second aspect, the full-length movie is segmented into a plurality of emotion-based time intervals based on an analysis of the KG according to one or more features representing one or more of background music, semantic content of speech, an emotional facial expression of the character, and an emotion-indicative gesture of the character. These features may be highly indicative of the mood, in particular the dominant mood, expressed within the respective mood-based time interval.
In another implementation form of the first and/or second aspect, the plurality of metrics includes: the number of actors appearing in a mood-based time interval, the length of time each actor appears in the mood-based time interval, and the number of actions associated with each actor in the mood-based time interval. These metrics can provide an accurate and easily measurable method for assessing the relevance of each emotion-based time interval to the narrative of the full-length movie.
In an optional implementation of the first and/or second aspect, at least some of the subset of emotion-based time intervals are selected according to a score of a diversity metric calculated for each of the plurality of emotion-based time intervals, wherein the diversity metric represents a difference of each emotion-based time interval with respect to its neighboring emotion-based time interval with respect to one or more interval attributes, the one or more interval attributes being one of a group consisting of a person appearing within an emotion-based time interval, a dominant emotion of an emotion-based time interval, and an action seen within an emotion-based time interval. The emotion-based time intervals are selected based on the diversity scores such that a different set of emotion-based time intervals can be selected that can convey a detailed, broad, and/or comprehensive narrative scope. Using the diversity score may also avoid selecting redundant emotion-based time intervals that may differ little and/or not significantly from other selected emotion-based time intervals.
In an optional implementation of the first and/or second aspect, the subset of emotion-based time intervals is selected in dependence on a length of time defined for the emotion-based summary video. Adjusting the choice of emotion-based time intervals according to the predefined length of time (length) of the summary video may allow a high flexibility in choosing emotion-based time intervals to best provide, present and/or convey narratives within the time constraints applicable to the summary video.
In another implementation manner of the first and/or second aspect, the KG annotation model is created by automatically lifting features extracted from at least one of: video content of the full-length movie, audio content of the full-length movie, voice content of the full-length movie, one or more subtitle records associated with the full-length movie, and metadata records associated with the full-length movie. The KG annotation model may be a tool that provides very rich, extensive and accurate information describing the full-length movie, and may be used to extract accurate and extensive features, to segment the full-length movie into mood-based time intervals and/or to calculate scores for mood-based time intervals, etc.
In an optional implementation manner of the first and/or second aspect, the KG annotation model uses one or more artificial annotation features extracted for the full-length movie. Manual annotation features may be used to enhance the KG annotation model in cases where automated tools may be somewhat limited.
Other systems, methods, features and advantages of the invention will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, exemplary methods and/or materials are described below. In case of conflict, the present patent specification will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
Methods and/or systems implementing embodiments of the invention may involve performing or completing selected tasks manually, automatically, or a combination thereof. Furthermore, according to actual instrumentation and equipment of method and/or system embodiments of the present invention, several selected tasks could be implemented by hardware, software or firmware or a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks of an exemplary embodiment of a method and/or system described in accordance with the invention are performed by a data processor, e.g., a computing platform, for executing a plurality of instructions. Optionally, the data processor includes volatile memory for storing instructions and/or data, and/or non-volatile memory, such as a magnetic disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is also provided. Optionally, a display and/or a user input device, such as a keyboard or mouse, may also be provided.
Drawings
Some embodiments of the invention are described herein, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the embodiments of the present invention. Thus, it will be apparent to one skilled in the art from this description of the drawings how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a flow diagram of an exemplary process for generating an emotion-based summary video for a full-length movie, according to some embodiments of the present invention;
FIG. 2 is a schematic diagram of an exemplary system for generating an emotion-based summary video for a full-length movie, according to some embodiments of the present invention;
FIG. 3 is a diagram of an exemplary emotion-based segmentation of a full-length video provided by some embodiments of the present invention;
FIG. 4 is a graph of experimental score distributions provided by users presented with emotion-based summary videos to rate their understanding of the narrative of a full-length movie, according to some embodiments of the present invention;
fig. 5 is a chart of experimental results provided by some embodiments of the present invention for evaluating emotion-based summary videos created for three full-length movies.
Detailed Description
In certain embodiments of the present invention, the present invention relates to generating digest video for multimedia content, and more particularly, but not by way of limitation, generating emotion-based digest video for multimedia content, particularly full-length movies.
Creating a summarized video that visually (rather than textually) summarizes multimedia content, particularly full-length movies, may be highly desirable and beneficial for a variety of applications, services, uses, and goals, e.g., enabling a user to select multimedia to consume and/or categorize multimedia from categories (based on genre and narrative, etc.), etc.
However, in order to make the summary video of the full-length movie available, the video summary, which is much shorter in duration, should be concise and coherent, while conveying (delivering) narratives of the full-length movie, such as plot, progress, prevailing facts and/or moments of criticality.
Thus, the digest video may be created by selecting and concatenating a plurality of segments (time intervals) extracted from the full-length movie, and by combining, the total duration (length) of the plurality of segments (time intervals) may be shorter than the original full-length movie, for example, about 15% to 25% of the original full-length movie.
There may be various methods and concepts for selecting the time interval to be included in the summary video from the full-length movie. The time intervals should be short enough to provide sufficient granularity and localization, thus allowing selection of time intervals that can reliably and accurately convey narratives of the full-length movie. However, the time intervals should be long enough to create a coherent summary video, wherein the selected time intervals are effectively logically connected to each other.
According to some embodiments of the present invention, methods, systems, and computer program products are provided for generating summary videos by concatenating selected time intervals from an original full-length movie to visually summarize multimedia content, particularly full-length movies. The digest video is created based on the mood (feelings) represented over the time interval (e.g., happy, sad, depressed, anxious, excited, happy, angry, generous, feeble, and/or patience, etc.).
Studies and experiments (as described below) demonstrate that emotion-based time intervals, typically on the order of a few minutes long, are sufficiently short (in duration) to serve as efficient segmentation (splitting) units for segmenting a full-length movie into multiple high-resolution time intervals. However, emotion-based time intervals have proven to be long enough to express substantial aspects of the narrative of a full-length movie in a reliable, consistent and concise manner.
Thus, an emotion-based summary video is created for the full-length movie by concatenating a subset of emotion-based time intervals selected from a plurality of emotion-based time intervals. The plurality of emotion-based time intervals are created by segmenting the full-length movie according to the emotion represented by each of the emotion-based time intervals. Further, since one or more of the mood-based time intervals may represent multiple moods, the full-length movie may be segmented into multiple mood-based time intervals according to the dominant mood represented by each of the mood-based time intervals.
The full-length video is segmented based on analysis of a Knowledge Graph (KG) annotation model created by manually and/or automatically promoting features extracted from the full-length video. A KG annotation model outside the scope of the present invention can be created for a full-length movie by boosting and annotating features extracted from one or more data sources associated with the full-length movie. For example, the one or more data sources are video (visual) content, audio content, voice content, subtitle records associated with full-length movies, metadata records associated with full-length movies, textual descriptions and/or summaries of full-length movies, lists of actors and/or characters, and so forth. Thus, the KG annotation model can provide very rich, extensive, and accurate information to describe the full-length movie, the scenes of the full-length movie, and/or the features extracted in the full-length movie.
The full-length video may be segmented into emotion-based time intervals based on one or more emotion-indicative features extracted from the KG annotation model, e.g. background music, semantic content of speech, emotion-indicative facial expressions of the person and/or emotion-indicative gestures of the person, etc.
After the full-length movie is partitioned into a plurality of emotion-based time intervals, a summary video may be created by concatenating a selected subset of the emotion-based time intervals used to convey, present, and/or deliver the narrative for the full-length movie.
To select emotion-based time intervals to be included in the video summary, a set of metrics is defined to enable assessment of the level of relevance of each emotion-based time interval to the narrative, in particular, the contribution of each emotion-based time interval to the user (viewer) viewing the summary video to the understanding of the narrative.
The newly defined metrics may include relevance metrics such as, for example, a number of heroes that occurred within a respective emotion-based time interval, a length of occurrence of each hero within a respective emotion-based time interval, and/or a number of actions associated with each hero within a respective emotion-based time interval, among others.
A hero appearing in a full-length movie may have a major relevance, contribution and/or impact on the narration, plot and/or progress of the full-length movie, particularly in comparison to other characters, such as a match, a stop and/or a crowd actor, and the like. Thus, the number of heros that occur within a mood-based time interval of the full-length movie and the length of time they occur within the mood-based time interval can be highly indicative of and reflective of the level of relevance (degree of relevance, expressiveness, consistency, and/or alignment) of the mood-based time interval to the narrative of the full-length movie. Further, actions related to heroes detected within emotion-based time intervals (e.g., actions performed by, applied to, and related to the character) may also be highly indicative of the level of relevance of the respective emotion-based time interval to the narrative of the full-length movie.
A relevance score for each of the emotion-based time intervals may be calculated by one or more relevance metrics. Further, the relevance scores computed for one or more emotion-based time intervals may be based on an aggregation of relevance scores computed from a plurality of relevance metrics.
Further, the defined metric may include a diversity metric defined to represent one or more emotion-based time intervals to which each emotion-based time interval is adjacent, i.e., the difference between a previous emotion-based time interval and a subsequent emotion-based time interval. The diversity metric may be defined by one or more interval attributes associated with each emotion-based time interval relative to its neighboring emotion-based time intervals. For example, the diversity measure may comprise (identity) differences of persons appearing in the mood-based time interval compared to adjacent intervals, differences between moods, in particular, differences between a dominant mood represented in the mood-based time interval and a mood represented in an adjacent interval, and/or differences between an action described in the mood-based time interval and an action described in an adjacent time interval, etc.
A diversity score for each of the mood-based time intervals may be calculated from the diversity metrics, typically by aggregating the diversity scores calculated for each diversity metric with respect to one or more interval attributes. Thus, the diversity score indicates how different each emotion-based time interval is from its neighboring emotion-based time intervals. Identifying the diversity and differences between each emotion-based time interval and its neighbors enables selection of a different set of emotion-based time intervals, including a broad range of narratives for the full-length movie, while avoiding selection of similar, possibly redundant, emotion-based time intervals.
Thus, the subset of emotion-based time intervals selected for the summary video may include emotion-based time intervals selected according to their scores, e.g., an aggregation of their relevance scores and diversity scores. For example, an emotion-based time interval may be selected for the subset used to create the summary video, each score exceeding some predefined threshold. In another example, a number of highest scoring emotion-based time intervals may be selected for the subset used to create the summary video.
Optionally, the emotion-based time intervals and/or the number thereof are selected for the subset that constitutes the summarized video in accordance with one or more timing parameters, such as the total time duration defined for the summarized video and/or the time duration of one or more of the emotion-based time intervals, etc.
The selected subset of emotion-based time intervals may be concatenated to generate an emotion-based video summary of the full-length movie, which may be output for presentation to one or more users (viewers).
Summary video based on emotion may present major advantages and benefits over existing methods and systems of creating video summaries.
First, automatically creating a summarized video may greatly reduce the effort and/or time required to manually create a summarized video, which may be accomplished, at least in part, by some existing methods.
Moreover, some existing approaches may use one or more video inference and/or interpretation tools, algorithms, and/or techniques to automatically (at least partially) create a summarized video. However, these approaches may typically process the entire full-length movie, which may require a large amount or even an excessive amount of computing resources (e.g., processing resources and storage resources, etc.) and/or computing time. On the other hand, emotion-based video summarization is based on processing the full-length video for a limited time interval, thus greatly reducing the computational resources and/or computational time required to create the summarized video.
Furthermore, one of the main challenges in creating a digest video is that the created digest video can highly represent the narration of a full-length movie while maintaining consistency in a much shorter time (predefined short time) than the original full-length movie. Some existing methods for automatically creating a summarized video may include short-time segments of the summarized video, typically motion-related segments, extracted from a full-length movie. These short segments of time may not be able to accurately, logically, and/or consistently deliver a full-length narration of the movie. Instead, emotion-based time intervals may define efficient segmentation units that are long enough to contain substantial segments of the full-length movie, conveying its narrative in a coherent manner, and short enough to allow selection of a large number of time intervals that present a collection of different segments (differing in content and/or context) of the full-length movie, conveying an accurate and broad narrative summary.
Furthermore, the introduction of new metrics, in particular relevance metrics, enables an efficient automatic selection of emotion-based important time intervals, thereby further reducing the computational resources and/or computational time required to create a summarized video.
Furthermore, applying the newly introduced diversity metric for automatically selecting emotion-based time intervals can be used to select a wide and diverse set of emotion-based time intervals, can highly represent narrative of full-length movies, especially for complex narrative movies in which the main aspects of the narrative may be distributed among many segments of the movie. Furthermore, the use of diversity scores may also be used to avoid selecting redundant emotion-based time intervals that may differ little and/or not significantly from other selected emotion-based time intervals.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium (or media) including computer-readable program instructions for causing a processor to perform various aspects of the present invention.
The computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a flash memory, a Static Random Access Memory (SRAM), a compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanically coded device such as a punch card or a bump structure with recorded instructions in a slot, and any suitable combination of the preceding. Computer-readable storage media as used herein should not be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or may be downloaded to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
The computer-readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language (e.g., smalltalk, C + +, etc.) and a conventional procedural programming language (e.g., the "C" programming language, or a similar programming language).
The computer readable program instructions may execute entirely on the user's computer or partly on the user's computer, or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (e.g., through the internet provided by an internet service provider). In certain embodiments, an electronic circuit comprising programmable logic circuits, field Programmable Gate Arrays (FPGAs) or Programmable Logic Arrays (PLAs), etc., can execute computer-readable program instructions to personalize the electronic circuit using state information of the computer-readable program instructions to implement aspects of the present invention.
Aspects of the present invention are described herein in connection with flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Reference is now made to the drawings. Fig. 1 is a flow diagram of an exemplary process for generating an emotion-based summary video for a full-length movie, according to some embodiments of the present invention. Exemplary process 100 may be performed to generate an emotion-based summary video summarizing multimedia content, particularly a full-length movie. The mood-based summary video can be greatly shortened, possibly containing one or more segments of the full-length movie that are highly correlated with the overall narrative (ontology) of the full-length movie, thereby reliably conveying a summary of the full-length movie.
The generated emotion-based summarized video may be presented to one or more users (viewers) for one or more goals, uses, and/or applications, such as movie selection and/or movie classification, and the like.
Further reference is made to fig. 2. Fig. 2 is a schematic diagram of an exemplary system for generating an emotion-based summarized video for a full-length movie, according to some embodiments of the present invention. Exemplary video summarization system 200, e.g., a computer, server, computing node and/or cluster of computing nodes, etc., may execute a process, such as process 100, to generate (create) emotion-based summarized video for one or more full-length movies.
The video summarization system 200 may include an I/O interface 210, a processor 212 for executing the process 100, and a memory 214 for storing code (program memory) and/or data.
The I/O network interface 210 may include one or more interfaces, ports, and/or interconnects for exchanging data with one or more external resources.
For example, I/O interface 210 may include one or more network interfaces for connecting to one or more wired and/or wireless networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a cellular network, and/or the internet. Through the network interface provided by the I/O interface 210, the video summarization system 200 may communicate with one or more remote network resources (e.g., devices, computers, servers, computing nodes, clusters of computing nodes, servers, systems, services, storage resources, cloud systems, cloud services, and/or cloud platforms, etc.).
In another example, the I/O interface 210 may include one or more interfaces and/or ports, such as a Universal Serial Bus (USB) port and/or a Serial port, etc., for connecting to one or more connectable media devices, such as a storage medium (e.g., a flash drive and memory stick, etc.) and/or a mobile device (e.g., a laptop, a smartphone, a tablet, etc.), etc.
The video summarization system 200 may receive one or more full-length movies 250, e.g., novel movies, documentary movies, educational movies, and/or series movies comprising multiple seasons, etc., via the I/O interface 210.
The video summarization system 200 may also receive the KG annotation model 255 created for each full-length movie 250 through the I/O interface 210. KG annotation models 255 outside the scope of the present invention may be created for full-length movies 250 by boosting and annotating features extracted from the video content of each full-length movie 250, the audio content of each full-length movie 250, the voice content of each full-length movie 250, at least one subtitle record associated with each full-length movie 250, and/or the metadata records associated with each full-length movie 250, among others. Thus, the KG annotation model 255, which may be created automatically, manually, and/or a combination of both, may provide a very rich, broad, and accurate source of information describing the full-length movie 250, the scenes of the full-length movie 250, and/or the features extracted from the full-length movie 250.
The video summarization system 200 may output a video summary 260 via the I/O interface 210. The video summary 260 summarizes the full-length movie 250 in a very short time compared to the full-length movie 250, e.g., 15%, 20%, 25%, etc. of the length of the full-length movie 250.
Processor 212 may include one or more homogeneous or heterogeneous processors, each including one or more processing nodes for parallel processing, as a cluster and/or one or more multi-core processors. Processor 212 may execute one or more software (code) modules, e.g., processes, applications, agents, utilities, tools, scripts, etc., each comprising a plurality of program instructions stored in a non-transitory medium, such as memory 214, and executed by one or more processors, such as processor 212. For example, processor 212 may execute a video summarizer 220 that implements process 100.
Video summarization system 200 may further comprise one or more hardware components to support the execution of video summarizer 220, such as, for example, circuitry, integrated Circuits (ICs), application-specific integrated circuits (ASICs), field Programmable Gate Arrays (FPGAs), digital Signal Processors (DSPs), and/or Graphics Processing Units (GPUs), among others.
Thus, video summarizer 220 may be executed, used and/or implemented by one or more software modules, one or more hardware components and/or combinations thereof.
The memory 214 for storing data and/or code (program storage) may include one or more non-transitory storage devices, such as persistent non-volatile devices, e.g., ROM, flash memory arrays, hard disk drives, solid State Disks (SSD), and/or magnetic disks, etc. Memory 214 may also typically include one or more volatile devices such as Random Access Memory (RAM) devices, cache memory, and/or the like. Optionally, the storage 214 further includes one or more network storage resources accessible to the video summarizer 220 via the I/O interface 210, such as a storage server, a Network Attached Storage (NAS), and/or a network driver, etc.
Optionally, the video summarization system 200 and/or the video summarizer 220 are provided, executed and/or used, at least in part, by one or more cloud computing services. For example, the one or more Cloud computing services are infrastructure as a Service (IaaS), platform as a Service (PaaS), software as a Service (SaaS), and/or the like, and/or Amazon Web Service, google Cloud, and/or Microsoft Azure, and the like, provided by one or more Cloud infrastructures.
Additionally, video summarization system 200, and in particular processor 212, may also execute one or more applications, services and/or hosts to enable one or more users, e.g., content experts and/or knowledge engineers, to interact with video summarization 220 to access, evaluate, define, adjust and/or control process 100 and/or portions thereof. The user may access, for example, video summarizer 220, full-length movie 250, KG annotation model 255, summary video 260, and/or one or more temporary products of process 100.
Access to the video summarization system 200 may be implemented in one or more architectures, deployments, and/or methods. For example, video summarization system 200 may execute one or more host applications and/or web applications, etc., providing one or more remote users access to video summarizer 220. The remote user may access the video summarization system 200 through one or more networks using a client device, e.g., a computer, server, and/or mobile device, etc., executing a browser, and/or local proxy, etc. access application, where the one or more networks are connected to the video summarization system 200 through the I/O interface 210. In another example, one or more users may be able to access the video summarization system 200 through one or more Human Machine Interfaces (HMI) provided by the I/O interface 210, such as a keyboard, a pointer (e.g., a mouse, a trackball, etc.), a display, and/or a touch screen, among others.
Optionally, one or more of the video summarizer 220, the full-length movie 250, the KG annotation model 255, the summary video 260, and/or the temporal products of the process 100 may be accessed through one or more databases, applications, and/or interfaces. For example, one or more databases, such as the SPARQL database, may be deployed in association with the video summarization system 200 (and in particular the video summarizer 220). Thus, a user may issue one or more database queries and/or issue one or more update instructions to interact with video summarizer 220 and evaluate, define, adjust and/or control process 100 and/or portions thereof.
As shown at 102, process 100 begins with video summarizer 220 receiving full-length movie 250. Full-length movies 250, e.g., novel movies, documentary movies, and/or educational movies, etc., may typically include very long (e.g., 90 minutes, 120 minutes, and/or 180 minutes, etc.) video streams. The full-length movie 250 may also include a series movie with multiple seasons and/or a mini-series movie, etc.
As shown at 104, the video summarizer 220 receives the KG annotation model 255 created for the full-length movie 250. A KG annotation model 255 beyond the scope of the present invention can be created for the full-length movie 250 to create an enhanced feature set for the full-length movie 250, providing rich, broad, and accurate information describing the full-length movie 250, one or more scenes of the full-length movie 250, a narrative of the full-length movie 250, and/or an ontology of the full-length movie 250, etc.
The KG annotation model 255 may be created by extracting and annotating features extracted from the full-length movie 250 and/or one or more data sources and/or data records associated with and/or corresponding to the full-length movie 250. For example, KG annotation model 255 may include enhanced features created by boosting features extracted from the video (visual) content of full-length movie 250. In another example, the KG annotation model 255 may include enhanced features created by boosting features extracted from the audio content of the full-length movie 250. In another example, the KG annotation model 255 may include enhanced features created by boosting features extracted from the speech content of the full-length movie 250. In another example, KG annotation model 255 may include enhanced features created by boosting features extracted from one or more subtitle records associated with full-length movie 250. In another example, the KG annotation model 255 may include enhanced features created by boosting features extracted from one or more metadata records associated with the full-length movie 250.
The KG annotation model 255 may be created manually, automatically, and/or a combination of both. For example, one or more Natural Language Processing (NLP) methods, algorithms, and/or tools may be applied to the full-length movie 250, and in particular, to features extracted from the full-length movie 250 to label the features to enhance, and/or enrich the extracted features. In addition, one or more trained Machine Learning (ML) models may be applied to the extracted features from the full-length movie 250 for labeling and boosting the extracted features. The ML model, in particular the NLP ML model, may be trained by one or more training data sets comprising sample features extracted, simulated and/or manipulated for a plurality of full-length movies of (optionally) the same genre as the received full-length movie, e.g. a neural network, a deep neural network and/or a support vector machine, etc.
Each annotation feature described by the KG annotation model 255 can be associated with (assigned with) a timestamp that temporally maps the respective annotation feature to a temporal location in the timeline of the full-length movie 250. With a timestamp associated with each annotated feature, the video summarizer 220 can thus identify the temporal location of the respective annotated feature in the timeline of the full-length movie 250.
As shown at 106, video summarizer 220 analyzes KG annotation model 255 and segments full-length movie 250 into time intervals, particularly mood-based time intervals S, based on the analysis i (i =0, \8230;, N). Each mood-based time interval represents a certain dominant mood, e.g., happy, sad, depressed, anxious, excited, happy, angry, generous, sad and/or impatient, etc.
While one or more emotion-based time intervals may represent multiple emotions, video summarizer 220 may identify a single dominant emotion. Within each mood-based time interval, the single dominant mood is expressed with a higher estimated intensity, strength, and/or magnitude, etc. than the other moods represented within the respective mood-based time interval.
Video summarizer 220 may identify a plurality of emotion-based time intervals by analyzing one or more emotion-indicative annotation features described in KG annotation model 255. Since each annotated feature is associated with a respective timestamp, video summarizer 220 may accurately map the annotated features to a temporal position in the timeline of full-length movie 250 to set a start and end time for each of a plurality of emotion-based time intervals.
Mood-indicative annotation features can include, for example, features that reflect background music (acoustic music) played over one or more mood-based time intervals. For example, dramatic music may be highly indicative of a dramatic scene that may express mood such as depression and/or injury. In another example, romantic music may be highly indicative of romantic scenes that may express an emotion of happiness, joy, relaxation, and/or the like. In another example, rhythmic music may be highly indicative of motion scenes that may express emotions such as anxiety, excitement, and/or fear.
In another example, emotion-indicative tagging features may also include features representing semantic content of speech detected within one or more emotion-based time intervals, e.g., keywords and/or context words, etc. For example, words expressing love and/or love may be highly indicative of romantic scenes, which may express emotions such as happy, and/or relaxed. In another example, words expressing weapons, cars, violence may be highly indicative of action and/or battle scenes, may express emotions such as anxiety, excitement, and/or fear.
Mood indicating annotation features may also include features representing pitch, intonation, and/or volume, which may be combined with corresponding semantic content expression features such that semantic content may be associated with pitch, intonation, and/or volume. For example, keywords spoken in low volume and/or low-pitched whisper may be highly indicative of romantic and/or dramatic scenes, may express sadness, depression, feelings of injury, happiness, joy, and/or relaxation, among other emotions. In another example, keywords spoken at high intonation and high volume may highly indicate action and/or battle scenes, may express emotions such as anxiety, excitement, and/or fear.
In another example, emotion-indicative annotation features may include features representing one or more emotion-indicative (expressive) facial expressions of one or more characters identified within one or more emotion-based time intervals. Mood-indicating facial expressions may reflect, for example, anger, happiness, injury, anxiety, excitement, fear, relaxation, and/or love, among others.
In another example, emotion-indicative tagging features may include features representing one or more emotion-indicative (expressive) gestures of one or more persons recognized within one or more emotion-based time intervals. Mood-indicative gestures may include, for example, hugs, kisses, fighting, running, and/or driving, etc., may be highly indicative of a person's mood, for example, love, anxiety, excitement, and/or fear, etc.
Video summarizer 220 may also aggregate a plurality of emotion-indicative annotation features in order to evaluate the dominant emotion represented in each of the one or more emotion-based time intervals and segment full-length movie 250 accordingly.
As shown at 108, video summarizer 220 may calculate a score, and in particular, a relevance score for each of a plurality of emotion-based time intervals, that represents the relevance of the respective emotion-based time interval to the narrative and/or ontology of full-length movie 250.
The video summarizer 220 may calculate a relevance score based on one or more relevance metrics specifically defined and applied to evaluate the relevance of the time intervals extracted from the full-length movie 250 to the narrative and/or ontology of the full-length movie 250. In other words, the metric is defined and applied to reflect the level of relevance (degree), i.e., the degree of relevance, expressiveness, consistency, and/or alignment of the extracted time interval with the narrative of the full-length movie 250.
The relevance metrics can be for the characters in the full-length movie 250 and the actions associated with those characters. To facilitate the application of the relevance metric approach, video summarizer 220 may analyze KG annotation model 255 (and in particular the annotation features of KG annotation model 255) to identify the characters presented in full-length movie 250.
Further, based on the analysis of KG annotation model 255, video summarizer 220 may identify which of these characters are protagonists, i.e., leaders playing a dominant role in full-length movie 250, and which of these characters are parities, corners, and/or mass actors.
Based on the analysis of KG annotation model 255, video summarizer 220 may also identify actions associated with each character in full-length movie 250, e.g., actions taken by the character, actions applied to the character and/or actions involved with the character, etc.
In particular, video summarizer 220 associates a character, a hero, and an action with a plurality of emotion-based time intervals according to timestamps that represent characteristics of the time of occurrence and/or the time of the action of the character on the time axis of full-length movie 250. As such, each of the emotion-based time intervals is associated with one or more characters seen during the emotion-based time intervals and one or more actions associated with the characters during the emotion-based time intervals.
Reference is now made to fig. 3. Fig. 3 is a diagram of an exemplary emotion-based segmentation of a full-length video according to some embodiments of the present invention. A video summarizer, such as video summarizer 220, may analyze a KG annotation model (e.g., KG annotation model 255) created for a full-length movie comprising a plurality of scenes (e.g., full-length movie 250) to segment full-length movie 250 into a plurality of emotion-based time intervals.
For example, assume that the total (overall) duration of the full-length movie 250 is Tn. Based on analysis of KG annotation model 255, video summarizer 220 may identify a time T n–50 Start to time T n–10 A certain scenario of ending. Based on the analysis of KG annotation model 255, video summarizer 220 may also identify representations from time T n–47 Start to T n–32 Dominant MOOD of ending MOOD q Based on mood intervals. Video summarizer 220 may also identify one or more actions seen in the scene, e.g., from T, based on analysis of KG annotation model 255 n–38 Start to T n–33 Completed action A w From T n–33 Start to T n–28 Completed action A x From T n–28 Start to T n–25 Completed action A y And from T n–25 Start to T n–19 Completed action A z
Video summarizer 220 may associate each of the actions with one or more characters identified within the time of the action, and may further associate each of the actions with a respective mood-based time interval based on the detection time of the respective action on the time axis of full-length movie 250 as identified by the timestamp assigned to the feature of the respective action.
Based on the analysis of KG annotation model 255, the video summarizer associates each of the emotion-based time intervals with one or more characters, heros or other characters, and actions related to the identified characters. Once the association is complete, a relevance metric may be applied.
The relevance metric may include, for example, the number of heroes that occur within a certain mood-based time interval.
The hero may have a primary relevance, contribution and/or impact on the narration, plot and/or progress of the full-length movie 250. In particular, the relevance, e.g., relevance, contribution, and/or impact, of the hero to the full-length movie 250 may be much higher than other characters described in the full-length movie 250, e.g., the match, the hero, and/or the mass actors, etc. Thus, the number of heros described within a time interval of the full-length movie 250 may highly indicate and reflect the level of relevance (relevance, expressiveness, consistency, and/or degree of alignment) of the time interval to the narrative of the full-length movie 250.
For example, a time interval showing a large number of actors of a full-length movie 250 may be highly correlated with the narrative of the full-length movie 250, while a time interval showing a small number of actors, e.g., one actor, may be less correlated with the narrative of the full-length movie 250. In another example, the time interval of a full-length movie without a hero appearing may have a low and potentially negligible association, contribution and/or impact on narration, ontology, plot and/or progress of the full-length movie 250.
Video summarizer 220 may calculate a relevance score for one or more emotion-based time intervals from the hero metrics based on the number of heros identified in each emotion-based time interval. Video summarizer 220 may be rootCalculating respective mood-based time intervals S according to the exemplary formula shown in equation 1 below i Principal angle fraction of (I) ImpChar (S) i ) The exemplary formula indicates how many of all the characters described in each emotion-based time interval are heroes.
Equation 1:
Figure GDA0003214419260000121
|MainChar∈S i l reflects individual mood-based time intervals S i The number of internally identified principals, | Char ∈ S | i) reflecting the respective emotion-based time interval S i The number of all persons identified within.
The correlation metric may also include the time of occurrence (duration) of each principal described within the time interval. Since principals may have a primary relevance to the narrative, plot and/or progress of full-length movie 250, the time of occurrence of these principals within a time interval may also be highly indicative of and reflect the level of relevance (relevance, expressiveness, consistency and/or degree of alignment) of the time interval to the narrative of full-length movie 250. The metric may be calculated to represent, for example, a relationship (e.g., in part) of the length of occurrence of each hero seen within the respective mood-based time interval to the total (overall) length of the respective mood-based time interval.
Video summarizer 220 may calculate a relevance score for one or more mood-based time intervals based on the hero duration metrics for one or more heros occurring within each mood-based time interval. Additionally, if multiple heros occur within each emotion-based time interval, video summarizer 220 may also calculate and/or adjust a relevance score by aggregating hero duration metrics for the multiple heros shown within each emotion-based time interval.
Another relevance metric applicable to calculating a relevance score for one or more mood-based time intervals is the number of actions (hereinafter referred to as designated significant actions) detected within the respective mood-based time interval in relation to each hero, e.g., actions performed by, applied to, and/or involved by the hero, etc.
Since the heros have a primary relevance to the narration of the full-length movie 250, the important actions associated with these heros may also have a primary relevance to the narration of the full-length movie 250. In particular, the relevance, e.g., relevance, contribution, and/or impact, of important actions related to the hero may be much higher than actions related to other characters (i.e., actions performed by, applied to, and involved with the character).
Thus, the number of significant actions identified within a time interval of full-length movie 250 that are associated with one or more heros seen within the time interval may be highly indicative of and reflect the level of relevance (relevance, expressiveness, consistency, and/or degree of alignment) of the time interval to the narrative of full-length movie 250.
Video summarizer 220 may calculate a relevance score for one or more emotion-based time intervals from the important motion metrics based on the number of motions identified for each hero identified within the respective emotion-based time interval. Video summarizer 220 may calculate respective emotion-based time intervals S according to the exemplary formula shown in equation 2 below i Is determined by the important motion score ImpAct (S) i ) The exemplary formula indicates how much of all actions detected within the respective mood-based time interval are relevant to the hero.
Equation 2:
Figure GDA0003214419260000122
C n specifying respective mood-based time intervals S i Internally recognized character, main (C) n ) Specifying respective mood-based time intervals S i Internally recognized principal, | Actions ∈ S i I specify respective mood-based time intervals S i All persons identified inC n The associated action(s) is (are) taken,
Figure GDA0003214419260000131
specifying respective mood-based time intervals S i The identified principal is associated with the important action.
Video summarizer 220 may calculate and/or adjust the relevance scores for one or more emotion-based time intervals of full-length movie 250 by aggregating the relevance scores calculated for each emotion-based time interval according to a plurality of metrics selected from the hero metric, hero duration metric, and important action metric.
As shown at 110, video summarizer 220 may calculate a diversity score for one or more emotion-based time intervals, the diversity score representing a difference between each emotion-based time interval and its neighboring one or more emotion-based time intervals, i.e., a previous emotion-based time interval and a subsequent emotion-based time interval.
Thus, the diversity score indicates how different each emotion-based time interval is from its neighboring emotion-based time intervals. Identifying the diversity and differences between each emotion-based time interval and its neighbors enables selection of a different set of emotion-based time intervals, including a broad range of narratives for full-length movie 250, while avoiding selection of similar, possibly redundant, emotion-based time intervals.
In particular, video summarizer 220 computes diversity scores based on a diversity metric defined by one or more interval attributes of respective emotion-based time intervals and their neighboring emotion-based time intervals. These interval attributes may include, for example, character attributes reflecting a character (identity) appearing within the emotion-based time interval, emotion attributes reflecting an emotion represented within the emotion-based time interval, and/or action attributes reflecting an action described within the emotion-based time interval.
Thus, the diversity metric may represent the difference between each emotion-based time interval and its neighboring emotion-based time interval for each interval attribute. For example, the first diversity metric may represent the difference between a person appearing in each emotion-based time interval and a person appearing in an adjacent emotion-based time interval. In another example, the second clustering metric may represent a difference between emotions, specifically a dominant emotion represented in each emotion-based time interval and an emotion represented in an adjacent emotion-based time interval. In another example, the third diversity metric may represent a difference between an action described within each emotion-based time interval and an action described within an adjacent emotion-based time interval.
Video summarizer 220 may calculate a partial diversity score for each interval attribute between two consecutive emotion-based time intervals. For example, the partial diversity score may be calculated as the intersection of the individual interval attributes identified in the following two emotion-based time intervals, S, compared to the intersection of the individual interval attributes identified in the following two emotion-based time intervals, S i And S i+1 The union of the individual interval attributes identified is shown below in equation 3.
Equation 3:
Figure GDA0003214419260000132
Figure GDA0003214419260000133
Figure GDA0003214419260000134
C∈S i for mood-based time intervals S i A set of persons appearing in, C ∈ S i+1 For mood-based time intervals S i+1 A collection of people that appear within. Similarly, M ∈ S i For mood-based time intervals S i Set of emotions represented internally, M ∈ S i+1 For mood-based time intervals S i+1 Set of emotions represented in, A ∈ S i As a time based on emotionInterval S i See in action set, A ∈ S i+1 For mood-based time intervals S i+1 The set of actions seen within.
Video summarizer 220 may calculate a relative diversity score for each two consecutive emotion-based time intervals by aggregating two or more separate diversity scores calculated for each interval attribute separately. For example, video summarizer 220 may calculate a relative diversity score according to the exemplary formula shown in equation 4 below.
Equation 4:
Figure GDA0003214419260000141
video summarizer 220 may then calculate each emotion-based time interval S based on the relative diversity scores calculated for each two consecutive emotion-based time intervals i Diversity fraction Div (S) of i ) For example, according to the exemplary formula shown in equation 5 below.
Equation 5:
Figure GDA0003214419260000142
since the first emotion-based time interval and the last emotion-based time interval have only one adjacent emotion-based time interval, video summarizer 220 may set the diversity scores for the two emotion-based time intervals equal to the relative diversity scores computed for the two emotion-based time intervals, i.e., assuming that full-length movie 250 is divided into n emotion-based time intervals, d (S) for the first emotion-based time interval (0,1) ) And for the last emotion-based time interval of (S) (n-1,n) )。
As shown at 112, the video summarizer 220 may generate an emotion-based summary video for the full-length movie 250 that is intended to summarize the full-length movie 250 and present narratives, stories and/or schedules of the full-length movie 250 for a much shorter duration, e.g., 15%, 20% and/or 25%, than the full-length movie 250.
Video summarizer 220 may generate an emotion-based summary video by concatenating a subset of emotion-based time intervals selected from a plurality of emotion-based time intervals together based on the scores calculated for each emotion-based time interval. The score calculated by video summarizer 220 for each emotion-based time interval includes a relevance score calculated for the respective emotion-based time interval and a diversity score calculated for the respective emotion-based time interval (optional).
Accordingly, the score calculated by video summarizer 220 for each emotion-based time interval may represent an aggregate score, e.g., a weighted average of scores calculated according to metrics described previously herein. In particular, a weighted average score is calculated based on a correlation score calculated from the correlation metric in combination with (optionally) a diversity score calculated based on the diversity metric derived from the interval property.
Video summarizer 220 may apply one or more methods, techniques, and/or implementation modes to select a subset of emotion-based time intervals for generating emotion-based summarized video.
For example, video summarizer 220 may select all emotion-based time intervals that have scores that exceed some predefined threshold. In another example, video summarizer 220 may select a predefined number of emotion-based time intervals with the highest scores.
Alternatively, video summarizer 220 may select a subset of emotion-based time intervals based on a predefined time duration for the emotion-based summary video to be created for full-length movie 250. For example, assuming that a certain duration is predefined for the emotion-based summary video, video summarizer 220 may select a number of the highest-scoring emotion-based time intervals having a combined (time) duration less than or equal to the predefined duration.
The resulting sentiment-based summary video is thus a sequence of sentiment-based time intervals selected according to the scores assigned by the metrics described herein before. In particular, the emotion-based time interval is shorter than the scene and longer than the action, thereby providing a convenient and efficient means for creating an emotion-based summarized video that accurately, concisely and consistently represents the narrative, plot and progress of the full-length movie 250 in a much shorter period of time, e.g., 20 to 30 minutes.
As shown at 112, video summarizer 220 outputs mood-based summary videos that one or more users may use for one or more uses, purposes, goals, and/or applications. For example, one or more users may view a summary video based on emotions to determine whether they wish to view full-length movie 250. In another example, one or more users may view a summary video based on emotions to categorize the full-length movie 250 into one or more categories and/or libraries, and/or the like.
Optionally, video summarizer 220 may adjust one or more weights assigned to a relevance score, a diversity score, and/or any component thereof to adjust (decrease or increase) the contribution and/or relevance of one or more metrics to adjust the score computed from these metrics. Video summarizer 220 may also apply one or more ML models trained on a large data set comprising multiple full-length movies, such as full-length movie 250. Each model is associated with a respective KG annotation model (e.g., KG annotation model 255) to adjust the weights assigned to each metric. In particular, the training data sets may be tagged with feedback scores assigned by users who viewed emotion-based summary videos created for multiple full-length movies. The feedback score may reflect an understanding and/or concept of a narration, ontology, and/or plot of a full-length movie based on viewing individual emotion-based summary videos.
It has been experimentally proven that the user's understanding and/or conception of narratives of full-length movies through the presented individual emotion-based summary videos is high.
The computing hardware used for the experiment is selected to support the intensive computer vision processing required to execute the crowd behavior anomaly detector 220 of the process 100. In particular, the computing hardware is based on a workstation comprising two cores with 6 cores per CPU
Figure GDA0003214419260000151
E5-2600 CPU, 15MB SmartCache running at 2.00GB supported by 24GB DRAM, and one with 8GB DRAM
Figure GDA0003214419260000152
GeForce GTX 1080 Xtreme D5X GPU. Video summarizer 220 is implemented in Python and JavaScript programming languages and uses source code provided on a GitHub.
However, the computing hardware and programming code language used for the experiments should not be construed as limiting, as numerous other implementations may be employed to implement the video summarization system 200 and the video summarizer 220 executing the process 100.
The focus of experimental evaluation is to gauge two main aspects:
1) User understanding, which refers to how much the users presented with the emotion-based summarized video understand (appreciate) the narrative (i.e., plot, story, etc.) of the full-length movie for which the emotion-based summarized video was created, and whether these users are able to understand the main aspects, moments, and/or concepts of the emotion-based summarized video description.
The purpose of this evaluation aspect is to evaluate the quality of the emotion-based summary video, in particular to gauge the user's understanding after viewing the emotion-based summary video. This aspect should refer to the amount of information the user obtains from viewing the emotion-based summary video, and should not reflect the user's subjective view of the quality of the emotion-based summary video (if possible).
2) The degree of alignment with the existing summarized video refers to how similar the emotion-based summarized video is to the existing online movie summary that human beings manually generate (e.g., movie stories provided by wikipedia and/or IMDb, etc.), and whether the emotion-based summarized video shares common moments and important parts of the full-length movie story with the manually generated movie summary.
The purpose of this evaluation aspect is to determine the extent to which automatically generated emotion-based summary videos include key events for their respective full-length movies by comparison with text summaries authored by humans. These text summaries may be obtained, for example, through movie episodes described by wikipedia, IMDb, and/or TV review websites, among others.
For three full-length movies: disc spy 5: mystery state, 007: the Royal battle gambling house and the girl of dragon tattoo were evaluated. These movies are chosen to represent a range of complexity (plot, character, structure, turning point, etc.) of the movie narration, where disc spy 5: mystery's degree may be simpler, and' girl of dragon's tattoo's degree may be more complicated, 007: the war royal casino "may be the most complex.
To evaluate the user understanding of the first aspect, a plurality of users unfamiliar with three evaluated full-length movies and not having seen these movies before are presented with three emotion-based summary videos created for the evaluated full-length movies. The user then fills out a questionnaire, including questions intended to determine how well the user understands the narration of the full-length movie. Questionnaires were prepared according to the leckt scale known in the art with a scale of 1 to 7. Where 1 indicates a very low level of understanding and 7 indicates a very high level of understanding.
The evaluation results are shown in fig. 4. Fig. 4 is an experimental score distribution chart provided by users presented with emotion-based summary videos to rate their understanding of the narrative of a full-length movie, provided by some embodiments of the present invention.
The chart 400 presents an average cumulative score that rates the understanding of the three full-length movies by multiple users based on viewing the emotion-based summary videos created for the three full-length movies. As expected, the user's expressed level of understanding of the three full-length movies matches the recited level of complexity of the full-length movies.
To assess the degree of alignment of the second aspect with existing summaries, a plurality of persons (evaluators) familiar with the three assessed full-length movies identify and mark important events and/or moments of criticality, etc. in one or more manually generated movie summaries of the three assessed full-length movies, and negotiate agreement on these important events and/or moments of criticality, etc.
For example, from full-length movie 007: the wikipedia page of the war royal casino, the following text snippets are analyzed and processed to extract key movie facts:
"the six military specialties James Bond surreptitiously kills the traitor Dryden and his terrorist contact Fisher, responsible for the six military departments, by parking the bragg embassy in the uk, thereby gaining a killer status and obtaining the specialties with code 00. The military valve Steven Obanno \8230; "the" Steven Obanno of the Saint-Jorda, mystery, mister's contact, mister, to terrorist finance family Le Chiffre introduced "
The following facts were extracted from the fragments:
the-James Bond becomes a special worker with the code number of 00
James Bond killed traitor Dryden, a responsible person of six departments of the military
James Bond in Bragg embassies in the UK kills Dryden
James Bond killed terrorist Fisher
Mr. White introduced to Le Chiffre the military valve Steven Obannio of the Saint's resistant army
……
The task of the evaluator is to check how many of these facts extracted from the text summary are presented in the corresponding emotion-based summarized video. Evaluators are presented with a list of facts extracted from the text summaries and after viewing the emotion-based summary videos they must fully or partially mark the facts contained in each emotion-based summary video. Although text summaries generated manually by humans are available on well-known websites such as wikipedia or IMDb and/or specific community-based online wiki etc., text summaries obtained from wikipedia episodes and IMDb are used because they all enforce strict guidelines for the author and provide summaries that are very similar in structure and granularity.
The advantage of this assessment strategy is that it does not require questionnaires or user-based assessments, which may be subjective, and provides an accurate and specific assessment of the effectiveness of the generated mood-based summary video. Simply counting the key facts contained in the emotion-based summary video provides a reliable assessment and summarization method of the choice of emotion-based time intervals, as described in process 100.
The evaluation was performed according to the following guidelines:
of the total 7 evaluators, 6 evaluators evaluated each movie summary to agree on facts and/or moments of criticality, etc.
The selected movies and genres comprise three full-length movies evaluated: disc spy 5: mystery state, 007: royal battle gambling house and girl of dragon figure.
Three video summaries have been evaluated, one for each movie according to a specific summarization strategy selected by the evaluation team.
The summarization strategy for generating an emotion-based summary video comprises three different summarization strategies, each using a different metric for calculating a score for an emotion-based time interval, in particular the presence of a lead actor, an emotion, an activity type, etc.
A similar total duration (length) of about 30 minutes is set for all mood-based summarized videos generated according to different summarization strategies, so that similar compression rates are applied to the mood-based summarized videos of different strategies. The length of the full-length movie evaluated (length) exceeded 2 hours, so that the length of the emotion-based summary video was about 25% of the length of the respective full-length movie.
Two text summaries (one from wikipedia plot and one from IMDb summary) were used for each of the three full length movies evaluated.
The list of extracted facts created by the evaluator for the text abstract includes the following number of facts:
disc spy 5: mysterious country level ": 51 facts
007: the war royal casino: 65 facts
Girl of dragon tattoo: 47 facts
Table 1 below gives the results of experiments performed by 7 evaluators for three full-length movies evaluated. It should be noted that while experiments are performed on a small scale, and thus may lack statistical significance, these experiments may provide insight into the generalized algorithms developed.
Table 1:
Figure GDA0003214419260000171
the user's understanding of the first assessment aspect is presented in a "degree of understanding" line for each full-length movie, representing the respective assessors' understanding of the respective full-length movie after viewing the respective summarized videos at a 1-7 scale. As shown in Table 1, the evaluators had consistent results, were essentially consistent, and had similar values. The standard deviation of the percentages reported in table 1 ranged from at least ± 4.1% to ± 8.9%. However, as mentioned above, in situations where the evaluators are limited, deriving meaningful statistics may be quite limited.
The degree of alignment with the text excerpt for the second evaluation aspect is reflected in the "present", "partial" and "absent" rows in table 1. "Presence" represents the percentage of facts found by each evaluator that align (match) between each summarized video and its corresponding text summary. "portion" represents the percentage of facts found by each evaluator that are partially aligned (matched) between each summarized video and its corresponding text excerpt. "missing" represents the percentage of fact that each summarized video found by each evaluator was missing compared to its corresponding text summary.
As shown in table 1, the average percentage of missing key facts (right-most column) for the mood-based summary videos created for the three full-length movies evaluated ranged between 53% and 59%. Thus, 41% to 47% of the key facts (partial or complete) are contained in the mood-based summary videos created for the three full-length movies evaluated. This may be a good result, especially considering: based on a rough estimate, a mood-based summary video that includes all the facts of a respective full-length movie (e.g. "disc spy 5: mystery") may have a duration of about one hour. Thus, it is achieved that the mood-based summary video is more consistent with the respective text summary, i.e. half the time (about 30 minutes), which may be a significant improvement.
The result of the alignment of the emotion-based digest video with the text digest is shown in fig. 5. Fig. 5 is a chart of experimental results provided by some embodiments of the present invention for evaluating emotion-based summary videos created for three full-length movies. Pie chart 502 shows the results of the evaluation personnel for a person named disk spy 5: mystery nationality "full-length movie-created distribution of emotion-based summary video assessments of presence, parts, and absence facts. Pie chart 504 shows the results of the evaluation personnel for a test set of 007: the distribution of the presence, parts and absence facts of emotion-based summary video assessments created by the war royal casino full-length movie. Pie chart 506 shows the distribution of presence, partial, and lack of facts of assessments by an evaluator for a mood-based summary video created for a full-length movie of dragon tattoo.
It is expected that during the life of a patent growing from this application, many relevant multi-core processors and shared memories will be developed and the scope of the term "processor and memory" is intended to include all such new technologies a priori.
The term "about" as used herein means ± 10%.
The terms "including," comprising, "" having, "and variations thereof mean" including, but not limited to.
The term "consisting of (8230) \ 8230; composition" means "including and limited to".
As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, the term "compound" or "at least one compound" may comprise a plurality of compounds, including mixtures thereof.
Throughout this application, various embodiments of the present invention may be presented in a range format. It is to be understood that the description of the range format is merely for convenience and brevity and should not be construed as a fixed limitation on the scope of the present invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within the range such as 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
When a range of numbers is indicated herein, any recited number (fractional or integer) within the indicated range is meant to be included. The phrases "in the first indicated number and the second indicated number range" and "from the first indicated number to the second indicated number range" and used interchangeably herein are meant to include the first and second indicated numbers and all fractions and integers in between.
It is appreciated that certain features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as any suitable other embodiment of the invention. Certain features described in the context of various embodiments are not considered essential features of those embodiments unless the embodiments are not otherwise invalid.
While the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, all such alternatives, modifications, and variations that fall within the spirit and scope of the appended claims are intended to be included within this invention.
All publications, patents and patent specifications mentioned in this specification are herein incorporated in the specification by reference, and likewise, each individual publication, patent or patent specification is specifically and individually incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority documents of the present application are incorporated herein by reference.

Claims (9)

1. A method for generating an emotion-based summarized video for a full-length movie, comprising:
receiving a full-length movie;
receiving a Knowledge Graph (KG) created for the full-length movie, wherein the KG includes an annotation model of the full-length movie generated by annotating features extracted for the full-length movie;
segmenting the full-length movie into a plurality of emotion-based time intervals based on the analysis of the KGs, wherein each time interval represents a certain dominant emotion;
calculating a score for each of the plurality of emotion-based time intervals based on at least one of a plurality of metrics, wherein the metrics represent a level of relevance of the respective emotion-based time interval to the narrative of the full-length movie;
generating a sentiment-based summary video by concatenating a subset of the plurality of sentiment-based time intervals having scores exceeding a predefined threshold; and
outputting the emotion-based summarized video for presentation to at least one user;
the method further comprises the following steps:
selecting at least some of the subset of emotion-based time intervals according to a score of a diversity metric calculated for each of the plurality of emotion-based time intervals, the diversity metric representing a difference in each emotion-based time interval relative to its neighboring emotion-based time intervals with respect to at least one interval attribute, the at least one interval attribute being one of a group consisting of a person appearing within an emotion-based time interval, a dominant emotion of an emotion-based time interval, and an action seen within an emotion-based time interval.
2. The method of claim 1, wherein each annotated feature in the KG annotation model is associated with a timestamp indicating a temporal position of the respective feature in a timeline of the full-length movie.
3. The method of claim 1, wherein the full-length movie is segmented into a plurality of emotion-based time intervals based on an analysis of the KGs according to at least one feature representing at least one of background music, semantic content of speech, emotion-indicative facial expressions of characters, and emotion-indicative gestures of characters.
4. The method of claim 1, wherein the plurality of metrics comprises: the number of actors that occur within a mood-based time interval, the length of occurrence of each actor within the mood-based time interval, and the number of actions associated with each actor within the mood-based time interval.
5. The method according to any one of claims 1-4, further comprising: selecting a subset of the emotion-based time intervals according to a length of time defined for the emotion-based summary video.
6. The method of claim 1, wherein the KG annotation model is created by automatically boosting features extracted from at least one of: video content of the full-length movie, audio content of the full-length movie, voice content of the full-length movie, at least one subtitle record associated with the full-length movie, and a metadata record associated with the full-length movie.
7. The method of claim 6, wherein the KG labeling model uses at least one artificial labeling feature extracted for the full-length movie.
8. A system for generating an emotion-based summary video for a full-length movie, comprising:
at least one processor configured to execute code, wherein the code is executed by the at least one processor to implement the method of any one of claims 1 to 7.
9. A computer readable storage medium comprising computer executable computer program code instructions which, when run on a computer, perform the method of any one of claims 1 to 7.
CN201980092247.1A 2019-09-27 2019-09-27 Emotion-based multimedia content summarization Active CN113795882B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/076266 WO2021058116A1 (en) 2019-09-27 2019-09-27 Mood based multimedia content summarization

Publications (2)

Publication Number Publication Date
CN113795882A CN113795882A (en) 2021-12-14
CN113795882B true CN113795882B (en) 2022-11-25

Family

ID=68084846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980092247.1A Active CN113795882B (en) 2019-09-27 2019-09-27 Emotion-based multimedia content summarization

Country Status (2)

Country Link
CN (1) CN113795882B (en)
WO (1) WO2021058116A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220006926A (en) * 2020-07-09 2022-01-18 삼성전자주식회사 Device and method for generating summary video
CN117807995B (en) * 2024-02-29 2024-06-04 浪潮电子信息产业股份有限公司 Emotion-guided abstract generation method, system, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1267056A (en) * 1992-12-17 2000-09-20 三星电子株式会社 Disk recording medium, and replaying method and device
CN1333628A (en) * 2000-03-30 2002-01-30 索尼公司 Magnetic tape record equipment and method, magnetic tape reproducing equipment and method and record medium
WO2005101413A1 (en) * 2004-04-15 2005-10-27 Koninklijke Philips Electronics N.V. Method of generating a content item having a specific emotional influence on a user
CN104065977A (en) * 2014-06-06 2014-09-24 百度在线网络技术(北京)有限公司 Audio/video file processing method and device
CN107948732A (en) * 2017-12-04 2018-04-20 京东方科技集团股份有限公司 Playback method, video play device and the system of video

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080187231A1 (en) * 2005-03-10 2008-08-07 Koninklijke Philips Electronics, N.V. Summarization of Audio and/or Visual Data
US20140298364A1 (en) * 2013-03-26 2014-10-02 Rawllin International Inc. Recommendations for media content based on emotion
KR102184987B1 (en) * 2013-11-15 2020-12-01 엘지전자 주식회사 Picture display device and operating method thereof
US20150243325A1 (en) * 2014-02-24 2015-08-27 Lyve Minds, Inc. Automatic generation of compilation videos
US20160014482A1 (en) * 2014-07-14 2016-01-14 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
KR102306538B1 (en) * 2015-01-20 2021-09-29 삼성전자주식회사 Apparatus and method for editing content
US10169659B1 (en) * 2015-09-24 2019-01-01 Amazon Technologies, Inc. Video summarization using selected characteristics
US9721165B1 (en) * 2015-11-13 2017-08-01 Amazon Technologies, Inc. Video microsummarization
US10915569B2 (en) * 2016-03-15 2021-02-09 Telefonaktiebolaget Lm Ericsson (Publ) Associating metadata with a multimedia file
US11256741B2 (en) * 2016-10-28 2022-02-22 Vertex Capital Llc Video tagging system and method
US10192584B1 (en) * 2017-07-23 2019-01-29 International Business Machines Corporation Cognitive dynamic video summarization using cognitive analysis enriched feature set
CN109408672B (en) * 2018-12-14 2020-09-29 北京百度网讯科技有限公司 Article generation method, article generation device, server and storage medium
CN110166650B (en) * 2019-04-29 2022-08-23 北京百度网讯科技有限公司 Video set generation method and device, computer equipment and readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1267056A (en) * 1992-12-17 2000-09-20 三星电子株式会社 Disk recording medium, and replaying method and device
CN1333628A (en) * 2000-03-30 2002-01-30 索尼公司 Magnetic tape record equipment and method, magnetic tape reproducing equipment and method and record medium
WO2005101413A1 (en) * 2004-04-15 2005-10-27 Koninklijke Philips Electronics N.V. Method of generating a content item having a specific emotional influence on a user
CN104065977A (en) * 2014-06-06 2014-09-24 百度在线网络技术(北京)有限公司 Audio/video file processing method and device
CN107948732A (en) * 2017-12-04 2018-04-20 京东方科技集团股份有限公司 Playback method, video play device and the system of video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种融入用户情绪因素的综合音乐推荐方法;琚春华等;《情报学报》;20171224(第06期);全文 *

Also Published As

Publication number Publication date
WO2021058116A1 (en) 2021-04-01
CN113795882A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
Stappen et al. The multimodal sentiment analysis in car reviews (muse-car) dataset: Collection, insights and improvements
Zadeh et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos
Yuksel et al. Human-in-the-loop machine learning to increase video accessibility for visually impaired and blind users
Muszynski et al. Recognizing induced emotions of movie audiences from multimodal information
JP2019511036A (en) System and method for linguistic feature generation across multiple layer word representations
Rehm et al. Strategic research agenda for multilingual Europe 2020
Park et al. Crowdsourcing micro-level multimedia annotations: The challenges of evaluation and interface
Van der Sluis et al. Explaining student behavior at scale: the influence of video complexity on student dwelling time
Sanchez-Cortes et al. In the mood for vlog: Multimodal inference in conversational social video
Tian et al. Recognizing induced emotions of movie audiences: Are induced and perceived emotions the same?
US11677991B1 (en) Creating automatically a short clip summarizing highlights of a video stream
US20200125671A1 (en) Altering content based on machine-learned topics of interest
CN113795882B (en) Emotion-based multimedia content summarization
Lv et al. Understanding the users and videos by mining a novel danmu dataset
CN114503100A (en) Method and device for labeling emotion related metadata to multimedia file
Yousefi et al. Examining Multimodel Emotion Assessment and Resonance with Audience on YouTube
WO2016103519A1 (en) Data analysis system, data analysis method, and data analysis program
Huang et al. VideoMark: A video-based learning analytic technique for MOOCs
KR20200066134A (en) Method and device for multimodal character identification on multiparty dialogues
Murtagh et al. New methods of analysis of narrative and semantics in support of interactivity
Baldwin et al. A Character Recognition Tool for Automatic Detection of Social Characters in Visual Media Content
Suglia et al. Going for GOAL: A resource for grounded football commentaries
Murray et al. Comparing player responses to choice-based interactive narratives using facial expression analysis
Chávez-Martínez et al. Happy and agreeable? Multi-label classification of impressions in social video
Murray et al. Proposal for analyzing player emotions in an interactive narrative using story intention graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant