CN114503100A - Method and device for labeling emotion related metadata to multimedia file - Google Patents

Method and device for labeling emotion related metadata to multimedia file Download PDF

Info

Publication number
CN114503100A
CN114503100A CN202080071054.0A CN202080071054A CN114503100A CN 114503100 A CN114503100 A CN 114503100A CN 202080071054 A CN202080071054 A CN 202080071054A CN 114503100 A CN114503100 A CN 114503100A
Authority
CN
China
Prior art keywords
multimedia object
scene
data analysis
emotion
movie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080071054.0A
Other languages
Chinese (zh)
Inventor
塔里克·乔杜里
德克兰·奥沙利文
欧文·康兰
杰里米·德巴蒂斯塔
法布里齐奥·奥兰迪
马吉德·拉蒂菲
马修·尼科尔森
哈桑伊斯兰·哈桑
基利安·麦凯比
丹尼尔·特纳
德克兰·麦基本
唐健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN114503100A publication Critical patent/CN114503100A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames

Abstract

This disclosure describes apparatus, systems, architectures and methods for summarizing multimedia objects having a main narrative. Methods and processes may include: performing data analysis to identify scene-related emotions indicated in a scene of the multimedia object; generating a knowledge graph associating each of the scenes with a respective one or more scene-related emotions; calculating a plurality of scores using the knowledge graph, each score indicating a relative importance of a scene of the plurality of scenes to conveying the primary narrative; selecting a subset of the plurality of scenes based on the plurality of scores; and/or generating a summary of the multimedia object from the subset.

Description

Method and device for labeling emotion related metadata to multimedia file
Background
The present invention, in some embodiments, relates to media technology, and more particularly, but not by way of limitation, to movie annotation software.
Figure BDA0003587082320000011
And
Figure BDA0003587082320000012
the introduction of streaming media and video-on-demand platforms has provided the movie industry with new tools for delivering products to targeted audiences. As a result, the online video content available to users has increased dramatically. One of the consequences of this is that it is increasingly difficult for viewers to find content from rich content that is relevant to personal preferences. The traditional video search tool can only search movie contents through general indexes and cannot comprehensively reflect the contents. Moreover, subjectively, conventional methods of producing a compressed version of a movie typically require a significant amount of human labor. These methods also typically take a significant amount of time, And may not necessarily be able to transition between users as normal.
Disclosure of Invention
It is an object of the present invention to provide an apparatus, system and method for enriching movie metadata. It is an object of the present invention to provide an apparatus, system and method for creating meaningful, semantically rich descriptions of motion picture scenes. It is an object of the present invention to provide an apparatus, system and method for efficiently generating a compressed version of a multimedia object. It is an object of the present invention to provide an apparatus, system and method for creating a reduced form of a movie, maintaining the overall hue and basic narrative of the movie. It is an object of the present invention to provide an apparatus, system and method for searching and classifying multimedia content based on indicators that more fully represent the content. The object of the present invention is an apparatus, system and method for facilitating the selection of preferred multimedia object types by a viewer based on specific interests, hues, patterns and/or emotions.
The foregoing and other objects are achieved by the features of the independent claims. Other implementations are apparent from the dependent claims, the detailed description and the accompanying drawings.
According to a first aspect of the present invention, there is provided a system for summarizing a multimedia object having a main narrative, comprising: a processor executing readable instructions to: performing data analysis to identify, in each of a plurality of scenes of the multimedia object, one or more scene-related emotions indicated in the scene; generating a knowledge graph associating each scene of the plurality of scenes with a respective one or more scene-related emotions; calculating a plurality of scores using the knowledge graph, each score indicating a relative importance of a scene of the plurality of scenes to conveying the primary narrative; selecting a subset of the plurality of scenes based on the plurality of scores; generating a summary of the multimedia object from the subset.
According to a second aspect of the present invention there is provided a method for summarizing a multimedia object having a main narrative, comprising: performing data analysis to identify, in each of a plurality of scenes of the multimedia object, one or more scene-related emotions indicated in the scene; generating a knowledge map associating each scene of the plurality of scenes with a respective one or more scene-related emotions; calculating a plurality of scores using the knowledge graph, each score indicating a relative importance of a scene of the plurality of scenes to conveying the primary narrative; selecting a subset of the plurality of scenes based on the plurality of scores; generating a summary of the multimedia object from the subset.
In implementations of various aspects of the invention, the data analysis includes a pre-processing of the multimedia object, the pre-processing including extracting from the multimedia object at least one of: a video file; a subtitle file; a chapter text file detailing a start time of a chapter of the multimedia object; audio file segments of actor speech and non-speech portions.
In possible implementations of various aspects of the invention, the data analysis includes: erasing associated metadata describing the multimedia object; analyzing the associated metadata to indicate the one or more scene-related emotions.
In possible implementations of various aspects of the present invention, the data analysis includes implementing semantic lifting according to scene ontology to capture raw multimedia information of the multimedia object.
In a possible implementation of the various aspects of the invention, the data analysis comprises an interconnection with an external source describing the characteristics of the multimedia object.
In possible implementations of the various aspects of the invention, the features include at least one of: a scene of the multimedia object; an activity in the multimedia object scene; an actor performing in the multimedia object; a character depicted in the multimedia object.
In a possible implementation of the various aspects of the invention, the data analysis comprises analyzing descriptive audio soundtracks of the multimedia object to indicate the one or more scene-related emotions.
In a possible implementation of the various aspects of the invention, the data analysis comprises extracting the emotion from a visual emotion indicator comprising at least one of: a facial expression image; a body posture image; the mood indicates a video sequence of the behavior.
In possible implementations of the various aspects of the invention, the data analysis includes extracting the emotion from an auditory emotion indicator, the auditory emotion indicator including at least one of: representing the mood of the musical soundtrack; an emotional implicit vocal indicator.
In possible implementations of aspects of the invention, the data analysis includes extracting the emotion from a textual emotion indicator, the textual emotion indicator including at least one of: an explicit mood descriptor; a suggestive mood indicator.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, exemplary methods and/or materials are described below. In case of conflict, the present specification will control. In addition, these materials, methods, and examples are illustrative only and not necessarily limiting.
Drawings
Some embodiments of the invention are described herein, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the embodiments of the present invention. Thus, it will be apparent to one skilled in the art from the description of the figures how embodiments of the invention may be practiced.
In the drawings:
FIG. 1A is a schematic flow chart of an alternative operational procedure provided by some embodiments of the present invention;
FIG. 1B is a schematic diagram of an exemplary system provided by some embodiments of the present invention;
FIG. 1C is a schematic diagram of an exemplary system provided by some embodiments of the invention;
FIG. 2 is a schematic diagram of an exemplary system architecture provided by some embodiments of the present invention;
FIG. 3 is a schematic diagram of an exemplary system architecture provided by some embodiments of the present invention;
FIG. 4 is a schematic diagram of an exemplary system architecture provided by some embodiments of the present invention;
FIG. 5A is a schematic diagram representing aspects of the exemplary system architecture of FIG. 4;
FIG. 5B is a schematic diagram representing aspects of the exemplary system architecture of FIG. 4;
FIG. 6 is a schematic diagram of an exemplary system architecture provided by some embodiments of the invention;
FIG. 7 is a schematic diagram of an exemplary system architecture provided by some embodiments of the invention;
FIG. 8 is a schematic diagram of an exemplary system architecture provided by some embodiments of the present invention;
FIG. 9 is a schematic diagram of an exemplary system architecture provided by some embodiments of the invention;
FIG. 10 is a schematic diagram of an exemplary system architecture provided by some embodiments of the invention;
FIG. 11 is a schematic diagram of an exemplary system architecture provided by some embodiments of the present invention;
FIG. 12A is a schematic diagram of an exemplary system architecture provided by some embodiments of the invention;
FIG. 12B is a diagram illustrating a scenario indicated by the architecture of FIG. 12A.
Detailed Description
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
Embodiments of the invention include one or more apparatuses, one or more systems, one or more methods, one or more architectures, and/or one or more computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions that cause a processor to perform various aspects of the present invention.
The computer readable storage medium may be a tangible device capable of retaining and storing instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a corresponding computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network.
The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the final scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, electronic circuitry, including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs), or Programmable Logic Arrays (PLAs), may perform aspects of the present invention by utilizing state information of the computer-readable program instructions to execute the computer-readable program instructions to customize the electronic circuitry.
Aspects of the present invention are described herein in connection with flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products provided by various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Each media segment in a multimedia object, such as a movie or video, may have a particular emotion that describes the feeling of the movie at the current moment. Aspects of the invention may include a metadata model, such as a Resource Description Framework (RDF) framework, that is defined and developed to serve as a basis for a knowledge graph of video content.
Aspects of the invention may include annotation toolkits, algorithms, processes, tools, interfaces, and/or operations. In some embodiments of the invention, one or more annotators may be responsible for populating observable actions and emotions in the semantic model. The markers may be familiar with characters and plots. The annotator may attempt to mark portions of the content of the movie according to activity and mood.
Mood is an important aspect of movies, which is influenced by various factors. The annotation of emotions can be based on different data sources, such as background music. The movie score most likely reflects the expected mood of the production team rather than the mood of the character or viewer.
It is not an easy matter to decide at what granularity to show details. For example, in a fighting scene, each punch in the scene may be labeled as a separate activity, or alternatively, most segments of the movie may be labeled as "character x and y fighting".
The methods and apparatus of the present invention may include or involve: tags incorporating multimedia files with visual and/or audio emotional cue metadata; means for annotating the movie according to mood are incorporated; utilizing features of the emotion-based annotation; ranking the importance of the scenes and/or sub-scenes according to the diversity of the main characters, the mood and/or the elements between adjacent scenes; and/or storing emotion-based annotations as structured information of the knowledge graph, for example in Resource Description Framework (RDF) triplets.
Embodiments of the present invention can be used to exploit the richness of semantic technologies to enrich movie metadata and/or create meaningful and semantically rich movie scene descriptions using various video and/or audio processing technologies. The invention may include a scene ontology that semantically defines concepts that capture moments in the movie, such as mood-based moments, in order to capture the original multimedia information of the video using a meaningful and semantically rich representation. For example, semantic information may be extracted from descriptive audio soundtracks (e.g., for a visually impaired viewer) and/or from movie subtitles (e.g., spoken by actors). The invention can include creating a knowledge graph that can be queried to select a video summary.
In some embodiments, the emotion-based annotation approach may utilize the structured information of a knowledge graph, which may be stored in RDF triples that may be queried using the W3C standardized SPARQL protocol and RDF query language (SPARQL). The semantic model or ontology may represent data in the knowledge graph that facilitates the identification of movie-related scenes by various features, attributes, and resources that may be used in the knowledge graph.
Embodiments of the present invention include systems, architectures, devices, and methods for movie and/or video annotation. The systems and architectures may involve one or more steps of the methods and/or may include one or more features of the apparatus.
The system comprises a system for summarizing multimedia objects. The method comprises a method of summarizing a multimedia object. The multimedia objects may comprise movies or any other suitable audio and/or video objects, such as video games or music files. The multimedia object may include a main narration. The multimedia object may include video data. The multimedia object may include audio data. The term "multimedia" as used herein refers to one or more presentation media, such as audio and/or video. Summarizing the multimedia object may include generating a knowledge graph of the multimedia object.
The system may include a processor for executing machine readable instructions for implementing the method. The system may include a memory for storing the machine-readable instructions. Alternatively and/or additionally, the method may be manual, automatic and/or partially automatic.
The method may comprise performing one or more data analyses to identify one or more scene-related emotions and/or actions indicated in one or more movie scenes of the multimedia object. The method may include generating a knowledge graph. The knowledge graph can associate one or more scenes with corresponding one or more scene-related emotions.
The method may include calculating a score indicating the relative importance of one or more scenes to convey the primary narrative. For example, a low score may indicate a lack of a primary character and/or subject element; narratives that are not strongly related to the subject matter may be represented; and/or may represent redundant subject elements, while a high score may indicate the presence of a primary character and/or subject element; may represent a main narrative; and/or may indicate transitional subject matter elements, where, for example, narrative subject matter may make significant turns before and after a high tide of the subject matter. Calculations may be made based on the use of knowledge maps.
The method may include selecting a subset of the scenes based on the scores. The subset may include higher scoring scenes. The method may comprise generating a summary of the multimedia object from the subset.
The method may comprise pre-processing of the multimedia object. The analysis may comprise a pre-processing of the multimedia object. The pre-processing may include extracting stored data from the multimedia object. The stored data may include one or more video files, subtitle files, chapter text files detailing the start times of chapters of the multimedia object, audio file segments of the actor's voice and/or non-voice portions, and/or any other suitable data source. The pre-processing may provide some or all of the information for analysis. Preprocessing may provide technical improvements, richer scene feature indicators help to detect key emotions, and other key features related to the subject summary, such as actions and/or activities. The term "activity" as used herein refers to something that a viewer may observe to happen in a multimedia object.
The analysis may include erasing the associated metadata. The associated metadata may describe the multimedia object. The stored data may include associated metadata. The associated metadata may provide some or all of the information for analysis. Including associated metadata may provide improvements by generating mood-related information that is not recorded or easily extracted from other stored data files, and does not require subjective input from individual users.
The parsing may include transforming structured and/or semi-structured information into associated data, such as associated open data, using existing and/or novel vocabulary, domain-specific ontologies, and/or web federation recommended RDF, RDFs, OWL, R2RML, and/or SPARQL techniques, using a semantic web system. The analysis may include implementing semantic enhancement. Semantic enhancement can capture the original multimedia information of a multimedia object. Semantic enhancement may follow the scene ontology. Semantic enhancement may be used to semantically represent video content and/or support user analysis. Semantic enhancement may provide improvements by capturing raw multimedia information of multimedia objects, which may not be transformable for different human and/or machine readers, e.g., in terms of emotion detection and/or object summarization, in a meaningful and semantically rich manner. The scenario ontology may facilitate knowledge graph with semantically rich and/or machine interoperable properties. The scene ontology may help the knowledge graph to fully describe the content of the multimedia object.
The method may include associating the data. The method may include publishing the data, for example, publishing the data over the internet. The method may include interconnecting. The method may include interconnecting data between data sources. The associated data may be configured to facilitate access by a semantic browser. The association data may facilitate navigation between data sources through Resource Description Framework (RDF) associations. The data source may comprise an external source describing characteristics of the multimedia object. The data source may include stored data. The interconnection can improve the knowledge graph, so that the semantics of external information of the knowledge graph are richer.
The described multimedia object features may include one or more features or attributes of at least one of: one or more scenes of the multimedia object; an activity in the multimedia object scene; an actor performing in the multimedia object; and/or a character depicted in the multimedia object. The inclusion of features in the analysis may be improved by providing a knowledge graph that is semantically rich, may fully describe the content of the multimedia object, and/or may generate a reliable summary of the multimedia object, as may be appropriate for key narratives.
The stored data may include descriptive audio soundtracks for the multimedia object. The stored data may include elements of a descriptive audio soundtrack. The soundtrack may improve in more directly representing the emotion that the author of the multimedia object wants to express, rather than relying solely on subjective user input.
The analysis may include extracting emotions from the visual emotion indicators. The visual indicators may include facial expression images, body posture images, and/or video sequences of emotionally indicative behaviors. The visual emotion indicators may be refined by generating rich social and/or emotional information that is not available from other sources, thereby generating a semantically rich knowledge graph that may fully describe the content of the multimedia object and/or may generate a reliable summary of the multimedia object, as well as being applicable to key narratives.
The analysis may include extracting emotions from auditory emotion indicators, such as musical soundtracks indicative of emotions and/or sound indicators indicative of emotions. Auditory emotion indicators can be improved by generating rich social and/or emotional information that is not available from other sources, thereby generating a semantically rich knowledge graph that fully describes the content of a multimedia object and generates a reliable summary of the multimedia object, also applicable to key narratives.
The analysis may include extracting emotions from textual emotion indicators, such as explicit emotion descriptors and/or implicit emotion indicators. These textual sentiment indicators may be improved by generating otherwise unobtrusive sentiment information, generating a semantically rich knowledge graph that fully describes the content of the multimedia object and may generate a reliable summary applicable to key narratives.
The method may include a method of dividing a movie into component scenes. The scenario may include one or more actions and/or activities. The method may include marking one or more of the component scenes. The tagging may be based on a fixed emotion in the detected scene or scenes.
As described above, the systems and architectures of the invention may involve one or more steps of the method and/or may include one or more features of the apparatus. The architecture may include one or more features of the system. The inventive method may comprise and the inventive system may relate to a method of annotating and/or summarizing multimedia objects. The term "multimedia" as used herein refers to one or more presentation media, such as audio and/or video. One or more steps of the method may be manual, automatic, and/or partially automatic. The method may include one or more machine learning processes.
Referring to FIG. 1A, FIG. 1A illustrates an exemplary multimedia object summarization process 100A. The method may include summarizing process 100A.
The multimedia objects may comprise one or more movies or any other suitable audio and/or video objects, such as video games or music files. The multimedia object may include a main narration. The multimedia object may comprise video data. The multimedia object may include audio data. Summarizing a multimedia object may include generating one or more knowledge graphs of the multimedia object.
Process 100A may begin at step 101. In step 101, one or more data analyses may be performed. The method may include performing a data analysis, such as described in step 101. The data analysis may identify one or more scene-related emotions and/or actions indicated in one or more scenes of a movie of the multimedia object. The first column in table 1 below shows exemplary moods.
The analysis may involve one or more sets of stored data. The stored data may include information of the multimedia object. The multimedia object may include stored data. The stored data may include one or more video files, subtitle files, chapter text files detailing the start times of chapters of the multimedia object, audio file segments of the actor's voice and/or non-voice portions, and/or any other suitable data source.
The stored data may include descriptive audio soundtracks for the multimedia object. The stored data may include elements of a descriptive audio soundtrack. The soundtrack may improve in more directly representing the emotion that the author of the multimedia object wants to express, rather than relying solely on subjective user input.
The method may comprise pre-processing of the multimedia object. The analysis may comprise a pre-processing of the multimedia object. The pre-processing may include extracting stored data from the multimedia object. Preprocessing may provide some or all of the information for data analysis. The pre-processing may provide a technical improvement, and richer indicators of features of the scene may help to detect key emotions, as well as other key features related to the subject summary, such as actions and/or activities. The term "activity" as used herein refers to something that a viewer may observe to happen in a multimedia object.
The data analysis may include erasing the associated metadata. The associated metadata may describe the multimedia object. The stored data may include associated metadata. The associated metadata may provide some or all of the information for data analysis. Including associated metadata may provide improvements by generating mood-related information that is not recorded or easily extracted from other stored data files, and does not require subjective input from individual users.
The data analysis may include transforming structured and/or semi-structured information into associated data, such as associated open data, using a semantic Web system using existing and/or novel vocabulary, domain-specific ontologies, and/or Web federation recommended RDF, RDF architecture (RDFs), Web Ontology Language (OWL), R2RML, and/or SPARQL Language technologies.
The data analysis may include implementing semantic boosting. Semantic enhancement can capture the original multimedia information of a multimedia object. Semantic enhancement may follow the scene ontology. Semantic enhancement may be used to semantically represent video content and/or support user analysis. Semantic enhancement may provide improvements by capturing raw multimedia information of multimedia objects, which may not be transformable for different human and/or machine readers, e.g., in terms of emotion detection and/or object summarization, in a meaningful and semantically rich manner. The scenario ontology may facilitate knowledge graph with semantically rich and/or machine interoperable properties. The scene ontology may help the knowledge graph to fully describe the content of the multimedia object.
The method may include associating the data. The method may include publishing the data, for example, publishing the data over the internet. The method may include interconnecting. The method may include interconnecting data between data sources. The associated data may be configured to facilitate access by a semantic browser. The association data may facilitate navigation between data sources through Resource Description Framework (RDF) associations. The data source may comprise an external source describing characteristics of the multimedia object. The data source may include stored data. The interconnection can improve the knowledge graph, so that the semantics of external information of the knowledge graph are richer.
The interconnections may help the knowledge graph have semantically richer external information that may not be available from the movie content or movie metadata. For example, the metadata of a movie may only provide an association between the actor and the depicted fictional character. Interconnecting the actor entities with corresponding entities in an external source may provide more information about the actor, such as gender, date of birth, etc.
May be interconnected to an external source such as a data network. The external sources may include common cross-domain knowledge maps, such as DBPEDIA, WIKIDATA, and/or YAGO. For example, WIKIDATA may be used for movie data and personal data, such as actors and movie production groups. The data association of other entities, such as the city and country of a certain scene or segment, can be determined from the resources of DBPEDIA.
The apparatus and methods of the present invention may include a mechanism to facilitate the interconnection of entities with the most appropriate external source. Interconnections can be represented by SPARQL.
The described multimedia object features may include one or more features or attributes of at least one of: one or more scenes of the multimedia object; an activity in the multimedia object scene; an actor performing in the multimedia object; and/or a character depicted in the multimedia object. The inclusion of features in the data analysis can be improved by providing a knowledge graph that is semantically rich, can fully describe the content of the multimedia object, and/or can generate a reliable summary of the multimedia object, as well as being applicable to key narratives.
The data analysis may include extracting emotions from the visual emotion indicators. The stored data may include visual mood indicators. The visual indicators may include facial expression images, body posture images, and/or video sequences of emotionally indicative behaviors. The visual emotion indicators may be refined by generating rich social and/or emotional information that is not available from other sources, thereby generating a semantically rich knowledge graph that may fully describe the content of the multimedia object and/or may generate a reliable summary of the multimedia object, as well as being applicable to key narratives.
The data analysis may include extracting emotions from auditory emotion indicators, such as musical soundtracks indicative of emotions and/or sound indicators indicative of emotions. The stored data may include auditory emotion indicators. Auditory emotion indicators may be improved by generating rich social and/or emotional information that may not be available from other sources, thereby generating a semantically rich knowledge graph that fully describes the content of a multimedia object and generates a reliable summary of the multimedia object, also applicable to key narratives.
The data analysis may include extracting emotions from textual emotion indicators, such as explicit emotion descriptors and/or implicit emotion indicators. These textual sentiment indicators may be improved by generating otherwise unobtrusive sentiment information, generating a semantically rich knowledge graph that fully describes the content of the multimedia object and may generate a reliable summary applicable to key narratives.
Various aspects of the method may be implemented by one or more user interfaces. The user interface may include a Graphical User Interface (GUI). The user interface may include one or more interface features. The interface features may include widgets and/or virtual buttons. The interface features may be specific to one or more steps of the method. Interface features may be used to avoid subjective individual user input.
Interface features may facilitate the generation of emotion-based knowledge graphs that video summaries can query. The emotion-based knowledge graph can be used to create a form of a movie synopsis that preserves the overall mood and story. The method may include creating an emotion-based knowledge graph using the annotation information to select the most informative scenes in the movie digest.
The method may involve a knowledge-based system including a knowledge base representing information of multimedia objects. The method may include the representation and/or definition of categories, attributes and relationships between concepts, data and/or entities that validate multimedia objects. The knowledge-based system may include one or more inference engines for deriving new information and/or discovering inconsistencies. The method may include generating a knowledge graph in step 103. The knowledge graph can associate one or more scenes with corresponding one or more scene-related emotions. Aspects of the present invention may avoid the need for user emotions or emotions to create a knowledge graph.
The method may include calculating a score in step 105, the score indicating the relative importance of one or more scenes to convey a primary narrative. For example, a lower score may indicate the absence of a primary character and/or subject element; narratives that are not strongly related to the subject matter may be represented; and/or may represent redundant subject matter elements. A higher score may indicate the presence of a primary character and/or subject element; may represent a main narrative; and/or may represent transitional subject matter elements, where, for example, narrative subject matter may make significant turns before and after a high tide of the subject matter. Scores may be calculated from the knowledge graph.
The method may include selecting a subset of the scenes based on the scores in step 107. The subset may include higher scoring scenes. Selecting may include setting one or more thresholds to be included in the subset.
The method may comprise generating a summary of the multimedia object from the subset. Generating the summary may include merging only scenes with scores that satisfy a threshold. Generating the summary may include screening for scenes that do not meet a threshold score. Generating the summary may include deleting scenes for which the score does not meet a threshold.
The systems and architectures of the invention can include, and the methods and processes of the invention can include, systems for annotating and/or summarizing multimedia objects. The system may include a processor for executing machine readable instructions for implementing the method. The system may include a memory for storing the machine-readable instructions.
Referring to FIG. 1B, a software system 100B is shown. The system may include any or all of the features of system 100B. The system may be used to perform any or all of the steps of the method 100A shown in fig. 1A. The system may include one or more modules for performing one, any or all of the steps of the method 100A. The term "module" as used herein refers to one or more software components and/or one or more portions of one or more programs, and may also include and/or refer to hardware for executing software components and/or program portions. The hardware may include a processor to execute instructions and/or a memory to store instructions. The program may contain one or more routines. A program may include one or more modules. As explained in the following paragraphs, the representation of the modules may be used to illustrate functional features of a system architecture embodying the present invention and/or implementing the methods of the present invention. Modules may be incorporated into the programs and/or software via one or more interfaces. The instructions executed by the processor may include one or more modules.
The system 100B shown in fig. 1B is used to annotate multimedia files based on visual and/or audio emotional cues therein. The system 100B includes two main sections, namely a semantic enhancement section 102 and a video summarization section 104. The semantic enhancement section 102 creates a knowledge graph from the emotions and/or activities. The system 100B may be used to implement one or more steps of the process 100A shown in fig. 1A.
The semantic enhancement section 102 may include an automatic media processing module 106. The media processing module 106 may be used to automatically pre-process the multimedia objects selected by the user. Module 106 may be used to perform any or all of the steps of preprocessing described with respect to process 100A. The module 106 may extract video, audio and/or subtitles from the movie. Module 106 may extract chapter text files, audio files, actor sound segments, etc. that detail the start times of the movie chapters.
The semantic enhancement section 102 may include a Natural Language Processing (NLP) module 108. The natural language processing module 108 may be used to locate and classify named entity data processed by the module 106. Module 108 may be used to perform named entity recognition on unstructured multimedia data processed by module 106.
The semantic boost portion 102 includes an annotation tool module 110. Module 110 may include the graphical user interface described above with respect to step 101 of process 100A. Module 110 facilitates user selection of a movie for processing. Module 110 may facilitate user interaction with modules 106 and 108. Module 110 may include any or all of the features of the graphical user interface described with respect to step 101 of process 100A.
The semantic boost portion 102 includes an ontology module 112. Ontology module 112 facilitates modules 108 and 110 in performing data analysis. Ontology module 112 may facilitate the classification of named entity data processed by module 106. Module 112 may facilitate named entity identification of unstructured multimedia data processed by module 106. Ontology module 112 may facilitate semantically defining concepts that capture moments in a movie selected by a user, such as emotion-based moments, in order to obtain raw multimedia information for the movie using a meaningful and semantically rich representation. For example, semantic information may be extracted from descriptive audio soundtracks of a movie (e.g., soundtracks for a visually impaired viewer) and/or from movie subtitles (e.g., words spoken by actors and/or visual cues of a hearing impaired viewer). Semantic information may be extracted from one or more transcriptions of one or more audio portions of the multimedia object. The transcription may include a description of speech and/or non-speech elements. The transcription may include one or more languages. The semantic information may be extracted from closed captioning, open captioning, and/or captioning. Semantic information may be extracted from translations of the conversation, sound effects, related musical cues, and/or any other suitable related audio data.
The semantic boost portion 102 includes a semantic boost module 114 and an automatic metadata module 116. The automatic metadata erasure module 116 is used to automatically erase metadata from movie files of movies selected by the user for processing by the semantic promotion module 114. The automatic metadata wipe module 116 is used to wipe movie metadata without the need for pre-processing by the modules 106 and 108.
The semantic enhancement module 114 semantically enhances the data and facilitates semantic representation of the movie content. The semantic enhancement module 114 is used to process the raw multimedia information and/or the processed multimedia information of the multimedia object directly through the annotation tool module 110, through the ontology module 112, and/or through the automatic metadata erasure module 116. The semantic boost module 114 supports user analysis through the annotation tool module 110.
The semantic enhancement section 102 includes an external source interconnection identification module 118. The external source interconnection identification module 118 may facilitate connection to an external source of movie metadata. For example, the external source interconnection identification module 118 may facilitate erasing movie metadata from one or more movie database application programming interfaces (e.g., IMDB, DBpedia1, Wikidata2, and/or other open source data). The interconnection with external sources may make the knowledge graph have semantically richer external information that may not be available from the movie content or movie metadata.
The semantic boost portion 102 includes a movie knowledge graph generation module 120. The knowledge graph generation module 120 implements ontology-driven generation of a movie knowledge graph. The knowledge graph generation module 120 generates a semantic rich and machine interoperable knowledge graph that fully describes the movie in terms of scenes, emotions, activities, actors, etc. The knowledge graph generation module 120 can store the knowledge graph as Resource Description Frame (RDF) triples that can be queried using SPARQL (e.g., via the SPARQL query endpoint module 128), which can be connected to external resources. The semantic model or ontology may represent data in the knowledge graph that facilitates various features, attributes, and resources in the knowledge graph that may be used to identify movie scenes that are useful for movie-related tasks (e.g., movie summaries).
Video summary portion 104 includes a summary user interface module 130, a summary Application Programming Interface (API) module 132, and a summary master module 136. The summary master module 136 may include modules for scene template selection and general purpose selection for customizing preferred features of a movie summary by a user, as well as a scene template processor and scene ranker for scoring movie scenes according to the user selection. When a user requests a movie digest through interface module 130, digest API module 132 communicates with digest master module 136 to select a subset of the highest ranked scenes that are ranked according to the user selected template for inclusion in movie digest 138. The movie digest 138 may be generated by the module 136 to include only scenes where the ranked movie data extracted by the module 128 from the movie knowledge graph module 120 and/or external movie metadata sources is of high importance to the main narrative conveying the movie.
Referring to FIG. 1C, an illustrative block diagram of an illustrative system 100C is shown. The system 100C may include any or all of the features of the system 100B. The system 100C may be used to perform any or all of the steps of the method 100A shown in fig. 1A. The system 100C may include one or more modules for performing one, any or all of the steps of the method 100A. The system 100C is based on a computer 141. The computer 141 has a processor 143 that controls the operation of the computer 141 and related components of the computer 141. Computer 141 includes RAM 145, ROM 147, input/output module 149, and memory 155. Processor 143 executes software running on computer 141, such as operating system 157 and software including the steps of process 100A. Other components typically used in computers, such as EEPROM or flash memory, or any other suitable component, may also be part of the computer 141.
The memory 155 comprises any suitable storage technology, such as a hard disk. The memory 155 stores software including an operating system 157 and applications 159, as well as data 151 required for operation of the system 100C. For example, memory 155 may also store video, text, and/or audio auxiliary files including multimedia objects. The video, text, and/or audio auxiliary files may also be stored in cache memory or any other suitable memory. Alternatively, some or all of the computer-executable instructions comprising those of process 100A may be embodied in hardware or firmware (not shown), for example. The computer 141 executes instructions embodied in software to perform various functions, such as the steps of the process 100A.
Input/output (I/O) module 149 may include connections to a microphone, keyboard, touch screen, mouse, and/or stylus through which a user of computer 141 can make inputs. The input may be moved by a cursor. The input may be included in a transfer event and/or an escape event. Input/output module 149 may also include speakers for providing audio output and/or one or more video display devices for providing textual, audio, audiovisual and/or graphical output. The inputs and outputs may be associated with computer application functions, such as facilitating one or more steps of the process 100A.
The network connections depicted in FIG. 1C include a Local Area Network (LAN) 153 and a Wide Area Network (WAN) 169, but may also include other networks. For example, the system 100C is connected to other systems through the LAN interface 153. System 100C may operate in a networked environment using connections to one or more remote computers, such as systems 181 and 191. Systems 181 and 191 can be personal computers or servers including many or all of the elements described above with respect to system 100C. When used in a LAN networking environment, the computer 141 is connected to the LAN 153 through a LAN interface 151. When used in a WAN networking environment, the computer 141 can include a modem 151 or other means for establishing communications over the WAN 169, such as the Internet 171.
It will be appreciated that the network connections shown are illustrative and other means of establishing a communications link between the computers 141, 181 and 191 can be used. The existence of various well-known protocols such as TCP/IP, Ethernet, FTP, HTTP and the like is presumed, and system 100C can be operated in a client-server configuration to permit a user to retrieve web pages, such as from a web-based server. The web-based server may be used to transmit data to any other suitable computer system, such as 181 and 191. The web-based server may also send computer-readable instructions along with data to any suitable computer system. Computer readable instructions may include storing data in a cache memory, a hard disk, a secondary memory, or any other suitable memory. The transmission of data along with computer-readable instructions may enable a computer system to quickly retrieve the data as needed. Because computer systems are able to retrieve data quickly, web-based servers do not need to stream data to the computer systems. This may provide benefits to the computer system because retrieval is faster than data streaming. Thus, the user avoids the frustration of waiting for the application to run. Conventional data stream processing requires extensive use of processors and cache memories. As envisioned, retrieving data when stored in a memory of a computer system may avoid extensive use of the processor and cache memory. Any conventional web browser can be used to display and/or manipulate data retrieved on a web page.
Further, the application program(s) 159 used by the computer 141 may include computer-executable instructions comprising the steps of the process 100A.
The computer 141 and/or systems 181 and 191 may also include various other components, such as a battery, speaker, and antenna (not shown).
Systems 181 and 191 can be portable devices such as laptops, smart phones, or any other suitable device for storing, transmitting, and/or communicating relevant information. Systems 181 and 191 can include other devices. These devices may be the same as or different from system 100C. These differences relate to hardware components and/or software components.
The method may include and the system may involve one or more steps of an emoticon generation process. Referring to fig. 2, an exemplary emotion-based knowledge graph generation process 200 is shown. The method may include and the system may involve one or more steps of the mood map generation process 200. Process 200 may be performed by one or more modules of system 100B. The process 200 may be performed by one or more modules of the semantic promotion component 102 shown in FIG. 1B. Process 200 may begin at step 202.
In step 202 of process 200, a user may select a multimedia object, such as a movie and/or movie file, stored on a DVD or any suitable storage medium to generate a knowledge graph. The movie may be of any suitable genre, e.g. action and/or thriller.
In step 204, video pre-processing may include extracting video, audio and/or subtitles from the movie. Video pre-processing may provide multimedia object data. The multimedia object data may include any suitable data, such as one or more chapter text files detailing the start times of chapters of the multimedia object; a plurality of audio files of a movie, for example, one audio file for one available soundtrack; relatively many smaller audio files may include segments of the actor's voice (dialog) and/or segments of the non-speech portion. Step 204 may be performed by the automatic media processing module 106 and/or the NLP module 108 of the semantic promotion component 102 shown in fig. 1B. Step 204 may include natural language processing of the extracted movie data.
In step 206, movie metadata may be automatically erased from the loaded movie without preprocessing in step 204. Alternatively or additionally, movie metadata may be erased from an external source (e.g., external source 212). Step 206 may be performed by the automatic metadata wipe module 116 of the semantic promotion component 102 shown in FIG. 1B.
Step 208 shows an implementation of an algorithm and annotation toolkit for annotating movies according to mood. Step 208 may be performed by the automatic annotation tool module 110 of the semantic lifting portion 102 shown in FIG. 1B. As described above with respect to step 101 of process 100A shown in fig. 1A, step 208 may include one or more user interactions with a dedicated graphical user interface. Step 208 may include one or more steps of data analysis. Step 208 may assist the user in performing any or all of the steps of process 200.
In step 210, a semantic enhancement module, such as module 114 shown in FIG. 1B, may be used to semantically represent the movie and support analysis of the movie. Implementing the abstract semantic model may include a scene ontology. The scene ontology may include a media annotation ontology. The content of the semantic representation may facilitate navigation between data sources through Resource Description Framework (RDF) associations. Step 210 includes semantically enhancing the data and facilitating semantic representation of the movie content. Step 210 may include semantic enhancement of raw or pre-processed movie data.
Step 212 of process 200 includes interfacing with an external data source of movie metadata. The interconnection with external sources makes the knowledge graph have external information that is semantically richer and not available from the movie content or movie metadata. Movie metadata from external sources may be erased using a movie database application programming interface (e.g., IMDB, DBPEDIA, WIKIDATA, and/or other open source code data). Step 212 may be mediated by one or more modules of semantic enhancement section 102 shown in FIG. 1B, such as automatic metadata scrubbing module 116 and/or interconnection identification module 118.
Step 214 depicts the generation of an ontology-driven movie knowledge graph. In step 214, the user may follow the semantic web page and associated data best practices. A user may create a novel ontology and/or reuse one or more existing ontologies. Through the framework, the generated knowledge graph can have rich semantics and machine interoperability, and can comprehensively describe the content of the movie, such as scenes and activities in the scenes, and metadata, such as the description of the movie and actors. Step 214 may be mediated by one or more modules of semantic promotion component 102 (e.g., knowledge graph generation module 120) shown in FIG. 1B.
The data analysis may include one or more processes that parse the multimedia object into action-based and/or emotion-based components. Fig. 3 is a graphical depiction of an exemplary emotion labeling process 300. Process 300 may begin at step 302. Data analysis of the multimedia object, as shown in step 101 of FIG. 1, may include any or all of the steps of process 300.
Step 302 comprises analyzing a movie having a start time T0 and an end time Tn. Upon determining (e.g., through data analysis), a movie or other multimedia object may be determined to include a dynamic complex structure, as shown at step 302. In process 300, the entire movie may be marked as beginning at time T0 and ending at time Tn.
In step 304, it is determined by data analysis that the structure includes one or more component scenes (Sm). Step 304 may include marking a scene of the multimedia object. A scene may have a fixed start time and end time. Step 304 may include dividing the movie into component scenes. For example, in step 101 of fig. 1, the data analysis may include dividing the multimedia object into component scenes. The method may include marking one or more of the component scenes. The tagging may be based on one or more fixed emotions in the detected one or more scenes.
As shown in FIG. 3, a first scenario (S) may be determined by data analysis0) At the time ofT0Beginning and last scene (S)m) May be at time TnAnd (6) ending. Intermediate scene STn-50And STn-10Can be determined by data analysis, respectively at time Tn-50And Tn-10And (6) ending. The scene may follow the overall narration. Scene transitions can be obvious or subtle. The transition described may indicate a new scene. Scene transitions may be identified, for example, by changes in location, intonation, mood, events, narrative stages, and/or main narrative development.
In step 306, it may be determined through data analysis that each scene includes one or more temporally stationary, observable actions and/or activities, such as action A w、Ax、AyAnd Az. The actions may overlap each other (not shown).
In step 308, it may be determined through data analysis that each scene includes one or more temporally fixed, observable, and possibly overlapping emotions. The scene mood may be different from the overall genre of the multimedia object. The scene mood may set the mood of a particular scene. Table 1 below shows illustrative scene emotions.
The marking may be based on one or more emotional indicators, such as an auditory indicator of the tone of a vocal cord in a given scene, and/or an audiovisual behavior indicator of the behavior of a person and/or facial expression. For example, the loudness and/or intensity of portions of the vocal cords that overlap with the scene may indicate an angry mood of the scene. Soft music can be gentle, melancholic, and/or fearful. The treble may represent happy and/or excited mood. Bass may represent sad and/or serious emotions. The auditory and/or visual signals representing laughter may indicate a happy atmosphere in the scene, while the facial expressions of crying and frowning may represent sad emotions. Table 1 below shows illustrative emotional indicators.
Table 1 illustrative emotional indicators. The music intensity and the tone may represent relative values. The pitch of music is expressed in the order of hertz and the tempo in beats per minute.
Figure BDA0003587082320000131
Alternatively and/or additionally, the marking may be based on explicit and/or implicit cues of emotion comprised in the subtitles of the multimedia object. Alternatively and/or additionally, the marking may be based on erased external data of the media database.
The method can include creating a semantic annotation using a semantic annotation architecture. For example, to generate movie metadata, the movie metadata may be erased and promoted into a knowledge graph using a data resource (e.g., movie DB API). The media annotation ontology (prefixed by "ma" in the figures below) can capture the required information. The knowledge graph may include other predicates, such as "hasDirector". The knowledge graph can include a set of classifications to represent different professions in the movie industry, such as directors. These concepts may be formally defined as part of the multimedia scene ontology. The method may include generating one or more features of a scene ontology. The knowledge graph can include a scenario ontology that can be created during step 103 of the process 100A shown in fig. 1A and/or can be generated by one or more modules (e.g., modules 112, 118, and 120) of the semantic enhancement portion 102 of the system 100B shown in fig. 1B. Referring to FIG. 4, an illustrative scene annotation ontology 400 is shown; referring to FIG. 5A, there is shown an exemplary illustration 501 of a resource description framework architecture provided by the present invention; referring to FIG. 5B, an illustrative mood time instant definition 503 is shown, as shown by the legend 501 shown in FIG. 5A.
The ontology 400 semantically defines concepts that capture moments in the movie in order to translate the original multimedia information of the video in a meaningful and semantically rich representation. The ontology 400 may be designed to be interoperable across machines and is also well understood by humans.
As shown in fig. 4, according to a legend 501 shown in fig. 5A, in an Ontology 400, a scene 402 is determined as an existing Media Annotation Ontology (MA) Media segment 404 corresponding to an existing MA Media asset 406 during a time interval 408. Media resource 406 is based on existing RDFS resource 410. Scene 402 is associated with existing location 412 by MA "MA: hasRelatedLocation". For example, the time 414 at which the scene 402 includes a time interval 416 is determined by data analysis. Time 414 is determined and marked as including an emotional time 418 and/or an observable action 420. From the mood moment definitions 503 shown in fig. 5B, a mood moment, such as mood moment 418, can be determined and labeled from data analysis, such as function 519, as a property of having an associated mood 521, according to the legend 501 shown in fig. 5A. Time 414 may be scored according to importance, for example, using an Extensible Markup Language (XML) Schema Definition (XSD) of an XML Boolean score 422. The score 422 may be based on the emotion 521 and/or action 420 shown in fig. 5B and/or other relevant clues of the scene to convey the importance of the main narrative of the movie, such as actor 424 being in the scene.
The creation of a scene ontology, such as the ontology 400 shown in fig. 4, may be accomplished through one or more algorithmic processes. Referring to fig. 6, fig. 6 is a flow chart of an exemplary process and/or algorithm 600. The algorithm 600 may be used to annotate a movie. The algorithm 600 may begin at step 601. Step 601 may include a launch algorithm 600. The algorithm 600 may include any and/or all of the steps of the processes 100A, 200, and 300 described with reference to fig. 1A, 2, and 3, respectively. The processes 100A, 200, and 300 may include any and/or all of the steps of the algorithm 600. The algorithm 600 may be performed by one or more modules of the portion 102 of the system 100B shown in FIG. 1B.
In step 602, a particular movie may be selected and/or loaded by a user. The stored annotation data set may also be loaded to annotate the movie. Metadata for the movie may also be loaded. Step 602 may include any and/or all of the features described with respect to step 202 of process 200 shown in fig. 2.
In step 604, the user can add and/or edit annotations for the movie. The annotations may be stored in an annotation database.
In step 606, the algorithm and/or user may populate the media ontology. The ontology may include any and/or all of the features described with respect to the ontology 400 shown in fig. 4.
In step 608, the user and/or algorithm can derive a movie knowledge graph. The knowledge graph may be stored as RDF triples. Derivation of the knowledge graph can be performed by module 120 of system 100B shown in fig. 1B. Step 608 may include any and/or all of the features of step 103 of process 100A shown in fig. 1A and/or steps 212 and/or 214 of process 22 shown in fig. 2. Step 103 of process 100A may include any and/or all of the features of step 608.
The steps of the algorithmic process may be carried out by a user through one or more user interfaces. The user interface may include any or all of the features of the GUI described with respect to step 101 of process 100A and/or step 208 of process 200 shown in fig. 2. The module 110 of the system 100B shown in FIG. 1B may include one or more features of a user interface.
Referring to FIG. 7, an exemplary Graphical User Interface (GUI) 700 for an exemplary annotation toolkit that facilitates annotation according to mood is shown. The user interface may include any or all of the features of GUI 700. GUI700 may facilitate one or more steps of the processes described with respect to 100A of fig. 1, 200 of fig. 2, and 300 of fig. 3. The annotation toolkit may provide an option for the annotator to record information of the activities and emotions observed in the movie. GUI700 includes four main features.
Features 702 include illustrative global actions, global emotions, and entities for a loaded movie. The user can select a predefined action, emotion, and/or entity in feature 702. The user can define new actions, emotions, and/or entities in the features 702. Saved actions, emotions, and/or entities may be available in the movie.
Feature 704 is used to display a movie. Features 704 may include devices for viewing movies; a device for listening to the soundtrack; a device for pausing a movie; a device for skipping scenes and/or any other suitable motion picture display device. Features 704 may facilitate a user in detecting emotions and/or actions in a scene viewed within a specified time frame.
The features 706 are used to display and/or edit annotation details based on the aspect represented, such as scene, action, and/or mood. Various configuration options are provided in feature 706 based on the user's selection of the annotation requirements in feature 708. Each annotation aspect includes a corresponding option. For example, if the user wants to define a new action, then options are provided including determining the start and end points of the timeline editor. In addition, the entities involved in the action (e.g., the receiving entity and the executing entity) are also assigned to a particular time frame.
GUI 700 is used to facilitate semantic modeling of movie content. The annotator makes a selection from the executing entities, participating activities, and/or receiving entities of the associated action defined in the features 702. Control buttons may be included to facilitate access to subsequent and/or previous options to delete annotations and/or other suitable operations. The operations may be performed in accordance with one or more selected objects and/or regions that have been selected in feature 708. Features 708 are used to facilitate switching between scenes, actions, and/or emotions.
The panels of features 708 are used to create and/or define scenes to determine temporal states such as emotions, and to label activities that occur within a specified time. As shown, the horizontal side of the panel (left to right) represents the time axis. The vertical plane is divided into three parts: "scene", "action", "emotion".
A user interface, such as GUI 700, may be included in the annotation tool. The tools may include software and/or hardware for performing media annotation of multimedia objects. Referring to FIG. 8, an illustrative system architecture 800 for an annotation tool is shown. The annotation tool can include the annotation tool package previously described with respect to the GUI 700 of FIG. 7. The architecture 800 may include one or more features of the system 100B shown in fig. 1B. Architecture 800 may be used to perform one or more steps in process 100A of fig. 1A, process 200 of fig. 2, process 300 of fig. 3, and process 600 of fig. 6, respectively. The architecture 800 may be used to generate the ontology 700 of fig. 7.
In the system architecture 800, the erasure metadata 802 for a movie is presented through the user interface 804. User interface 804 may include one or more features of GUI 700 of fig. 7. The user interface 804 interacts with a annotating module 806 of the movie in a timeline editor 808. The data of the annotation tool is then stored as RDF triples in the knowledge graph 810.
The method and system of the present invention may include semantic analysis. The method and system of the present invention may include text mining. The method and system of the present invention may include deep learning. Knowledge graph data can return concepts that are closely related to search terms based on the search. For example, "punch" and "kick" may be returned for a more general "fight" search. If the activity is simply a string tag, then semantic analysis can avoid semantic voids in the generated activity so that the machine cannot understand the synonym. Semantic analysis may also facilitate searching for general concepts, such as in a summarization task.
Semantic analysis may include building Simple Knowledge Organization System (SKOS) taxonomies to search for terms that can be extended from a wide range of terms/concepts (e.g., fighting) to more detailed terms (e.g., punching a fist). Using SKOS can enrich movie activities with richer semantic meaning and context. Taxonomy can create a large thesaurus by preprocessing the action movie script.
A jump word2vec model can be established by utilizing the corpus. The leap model may capture each word in the movie script and surrounding words in a predefined window to build word pairs. Word pairs may be used as training data for a single-layer neural network. The trained neural network may then be used to predict the associated probability of the word.
Referring to fig. 9, a single-layer neural network input-output example 900 is depicted, and referring to fig. 10, a visualization structure 1000 of a portion of an illustrative SKOS classification is depicted. Example 900 includes exemplary inputs and outputs of a trained neural network. The example 900 illustrates a calculated probability that the more specific word "go boxing" appears in proximity to the more general word "put up" in accordance with text of an illustrative script that includes the more general action-oriented word "put up" and the more specific action-oriented word "go boxing". Based on a threshold of higher probability that a more specific word appears near a broader word, the user may incorporate a more specific concept associated with the more specific word into a broader category associated with the broader word, tagging the scene with the more specific concept with a broader label of the broader word.
The structure 1000 shown in FIG. 10 illustrates the concept "skos: narrower" hierarchy. To establish an activity classification, abstract terms may be created in a knowledge graph for a manually curated activity. Abstract terms may be considered as a broad SKOS concept, and manual activities may be defined as SKOS: narrower concepts.
The method may include generating one or more synthetic metadata ontologies that interconnect action and/or emotion metadata with scenes of multimedia objects. The knowledge graph may include a metadata ontology.
Next, fig. 11, 12A, and 12B are referred to. Fig. 11 depicts an ontology 1100, the ontology 1100 comprising illustrative segments of multimedia metadata generated in accordance with the principles of the present invention. Likewise, FIG. 12A illustrates, via ontology 1200, how the scene 1207 shown in FIG. 12A (including the frame 1201 depicted in FIG. 12B) can be translated into a triple of type "so: ObservableAction".
In the ontology 1100, the movie is represented by the identifier "movie 123456", as shown in block 1102. In block 1103, the movie is defined as a concept of the media annotation ontology: the type of "MediaResource". Movie metadata is enriched by a number of movie attributes, including: movies are created with the "ma: createdIn" predicate 1105; the director of the movie uses the "so: hasDirector" predicate 1107 and the actors uses the "ma: creatures" predicate 1109.
Similarly, other instances have multiple properties associated with the instance, including a human-readable label, as shown by the dashed box 1104 associated with the "rdfs: label" predicate. Each instance is classified with a dashed box indicating the type in the outer ontology, e.g. block 1103 indicates "ma: MediaResource". Other instances are shown referring to categories defined in the scene ontology, such as a box 1110 representing "so: FictionalEntity". Further, each instance may be enriched by one or more associations of external resources, such as the association shown in dashed dotted box 1108.
In the ontology 1200 shown in FIG. 12A, a movie 1206 is shown to include scenes 1207 associated with a movie 1204 through an "ma: isfragmentOf" predicate 1202. The scenario 1207 includes one or more time instants, such as the "so: observer action" time instant 1204 in FIG. 12A, or such as the "so: MoodMoment" (not shown). The scene and time of day identifiers are labeled with timestamps. The timestamp complies with W3C recommendations for media asset URI specifications. The timestamps are distinguished by the URI fragment symbol "#" and the parameter "t ═ ts, te", where ts denotes the start time and te denotes the end time of the media fragment. As shown in fig. 12A, the scene 1207 is labeled with a timestamp "scene # t-60,316" indicating that the scene 1207 starts from the 60 th second and ends at the 316 th second.
As shown in fig. 12A, an observable motion included in the scene 1207 corresponding to the time 1204, such as the motion 1209, starts at 195 seconds and ends at 209 seconds, which may include, for example, a character 1213 walking on the flat wing 1215, as shown in fig. 12B. An instance of an observable action is associated with a scene instance through a "so: hasMoment" predicate (e.g., predicate 1208). Each "ObservableAction" defines an activity, such as activity 1210, or there is a "so: performingEntity", such as so: performingEntity 1203. Or "so: receivengentity", such as so: receivengentity 1205, or both (as shown in FIG. 12A). The "observer blebleAction" and "Scene" instances are enriched by a set of time triplets, which are omitted from FIG. 12B for simplicity of description.
The apparatus may relate to and the method may comprise: and querying the movie knowledge graph. The semantic boost module may be used to semantically represent video content and support analysis of multimedia objects. The semantic boosting module can use emotions and/or related activities as a core component for boosting movies and creating knowledge graphs.
As described above, the abstract semantic model may be designed according to a scene ontology, which may include one or more aspects of the media annotation ontology. The results data generated by the annotation tool can follow the same ontology/schema. Thus, a semantic knowledge graph can be generated, enabling sophisticated reasoning and query capabilities, and supporting novel and personalized customer-oriented content representations.
As disclosed, a labeling process based on emotion can help create a knowledge graph. In some embodiments of the invention, the knowledge graph may in turn facilitate the creation of a shorter video summary of the movie, for example, including a summary that does not exceed 25% of the original length of the movie. However, the summary may preserve important narrative portions of the movie.
In some embodiments of the invention, the annotation of the movie may include manual and/or automatic processing and semantic enhancement of the movie. The automatic part of the promotion may include Natural Language Processing (NLP) techniques, such as movie audio commentary.
Other systems, methods, features and advantages of the invention will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.
The description of the various embodiments of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application, or the technical advances, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein, compared to the technology found in the market.
It is expected that during the life of a patent of this application, many relevant processes of multimedia object knowledge graph generation and summarization will be developed, and the scope of the term multimedia object knowledge graph generation and summarization is intended to include all such new technologies a priori.
The term "about" as used herein means ± 10%.
The terms "including", "having" and variations thereof mean "including but not limited to". This term includes the terms "consisting of … …" and "consisting essentially of … …".
The phrase "consisting essentially of …" means that the composition or method may include additional ingredients and/or steps, provided that the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, the term "compound" or "at least one compound" may comprise a plurality of compounds, including mixtures thereof.
The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any "exemplary" embodiment is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the presence of other combinations of features of embodiments.
The word "optionally" is used herein to mean "provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless these features are mutually inconsistent.
In the present application, various embodiments of the present invention may be presented in a range format. It is to be understood that the description of the range format is merely for convenience and brevity and should not be construed as a fixed limitation on the scope of the present invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, a description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within that range such as 1, 2, 3, 4, 5, and 6. This applies regardless of the wide range.
When a range of numbers is indicated herein, the expression includes any number (fractional or integer) recited within the indicated range. The phrases "in the first indicated number and the second indicated number range" and "from the first indicated number to the second indicated number range" and used interchangeably herein are meant to include the first and second indicated numbers and all fractions and integers in between.
It is appreciated that certain features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as any suitable alternative embodiment of the invention. Certain features described in the context of various embodiments are not considered essential features of those embodiments unless the embodiments are not otherwise invalid.
All publications, patents and patent specifications mentioned in this specification are herein incorporated in the specification by reference, and likewise, each individual publication, patent or patent specification is specifically and individually incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims (20)

1. A system for summarizing a multimedia object having a main narrative, comprising:
a processor executing readable instructions to:
performing data analysis to identify, in each of a plurality of scenes of the multimedia object, one or more scene-related emotions indicated in the scene;
generating a knowledge graph associating each scene of the plurality of scenes with a respective one or more scene-related emotions;
calculating a plurality of scores using the knowledge graph, each score indicating a relative importance of a scene of the plurality of scenes to conveying the primary narrative;
selecting a subset of the plurality of scenes based on the plurality of scores;
Generating a summary of the multimedia object from the subset.
2. The system of claim 1, wherein the data analysis comprises pre-processing the multimedia object, the pre-processing comprising extracting from the multimedia object at least one of:
a video file;
a subtitle file;
a chapter text file detailing a start time of chapters of the multimedia object;
audio file segments of actor speech and non-speech portions.
3. The system of claim 1, wherein the data analysis comprises:
erasing associated metadata describing the multimedia object;
analyzing the associated metadata to indicate the one or more scene-related emotions.
4. The system of claim 1, wherein the data analysis comprises performing semantic enhancement according to scene ontology to capture raw multimedia information of the multimedia object.
5. The system of claim 1, wherein the data analysis comprises an interconnection with an external source that characterizes the multimedia object.
6. The system of claim 5, wherein the characteristics comprise at least one of:
A scene of the multimedia object;
an activity in the multimedia object scene;
an actor performing in the multimedia object;
a character depicted in the multimedia object.
7. The system of claim 1, wherein the data analysis comprises analyzing descriptive audio soundtracks of the multimedia object to indicate the one or more scene-related emotions.
8. The system of claim 1, wherein the data analysis comprises extracting the emotion from visual emotion indicators, the visual emotion indicators comprising at least one of:
a facial expression image;
a body posture image;
the mood indicates a video sequence of the behavior.
9. The system of claim 1, wherein the data analysis includes extracting the emotion from auditory emotion indicators, the auditory emotion indicators including at least one of:
representing the mood of the musical soundtrack;
an emotional implicit vocal indicator.
10. The system of claim 1, wherein the data analysis comprises extracting the emotion from a textual emotion indicator comprising at least one of:
An explicit mood descriptor;
a suggestive mood indicator.
11. A method for summarizing a multimedia object having a main narrative, comprising:
performing data analysis to identify, in each of a plurality of scenes of the multimedia object, one or more scene-related emotions indicated in the scene;
generating a knowledge graph associating each scene of the plurality of scenes with a respective one or more scene-related emotions;
calculating a plurality of scores using the knowledge graph, each score indicating a relative importance of a scene of the plurality of scenes to conveying the primary narrative;
selecting a subset of the plurality of scenes based on the plurality of scores;
generating a summary of the multimedia object from the subset.
12. The method of claim 11, wherein the data analysis comprises pre-processing the multimedia object, the pre-processing comprising extracting at least one of the following from the multimedia object:
a video file;
a subtitle file;
a chapter text file detailing a start time of a chapter of the multimedia object;
audio file segments of actor speech and non-speech portions.
13. The method of claim 11, wherein the data analysis comprises:
erasing associated metadata describing the multimedia object;
analyzing the associated metadata to indicate the one or more scene-related emotions.
14. The method of claim 11, wherein the data analysis comprises performing semantic enhancement according to scene ontology to capture raw multimedia information of the multimedia object.
15. The method of claim 11, wherein the data analysis comprises an interconnection with an external source that characterizes the multimedia object.
16. The method of claim 15, wherein the characteristic comprises at least one of:
a scene of the multimedia object;
an activity in the multimedia object scene;
an actor performing in the multimedia object;
a character depicted in the multimedia object.
17. The method of claim 11, wherein the data analysis comprises analyzing descriptive audio soundtracks of the multimedia object to indicate the one or more scene-related emotions.
18. The method of claim 11, wherein the data analysis comprises extracting the emotion from visual emotion indicators, the visual emotion indicators comprising at least one of:
A facial expression image;
a body posture image;
the mood indicates a video sequence of behaviors.
19. The method of claim 11, wherein the data analysis includes extracting the emotion from auditory emotion indicators, the auditory emotion indicators including at least one of:
representing the mood of the musical vocal cords;
an emotional implicit vocal indicator.
20. The method of claim 11, wherein the data analysis comprises extracting the emotion from a textual emotion indicator, the textual emotion indicator comprising at least one of:
an explicit mood descriptor;
a suggestive mood indicator.
CN202080071054.0A 2020-01-30 2020-01-30 Method and device for labeling emotion related metadata to multimedia file Pending CN114503100A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/052216 WO2021151485A1 (en) 2020-01-30 2020-01-30 A method and apparatus to annotate mood related metadata to multimedia files

Publications (1)

Publication Number Publication Date
CN114503100A true CN114503100A (en) 2022-05-13

Family

ID=69467513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080071054.0A Pending CN114503100A (en) 2020-01-30 2020-01-30 Method and device for labeling emotion related metadata to multimedia file

Country Status (2)

Country Link
CN (1) CN114503100A (en)
WO (1) WO2021151485A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220006926A (en) * 2020-07-09 2022-01-18 삼성전자주식회사 Device and method for generating summary video

Also Published As

Publication number Publication date
WO2021151485A1 (en) 2021-08-05

Similar Documents

Publication Publication Date Title
Labatut et al. Extraction and analysis of fictional character networks: A survey
CN107918653B (en) Intelligent playing method and device based on preference feedback
Sundaram et al. A utility framework for the automatic generation of audio-visual skims
KR101715971B1 (en) Method and system for assembling animated media based on keyword and string input
WO2019100350A1 (en) Providing a summary of a multimedia document in a session
CN106462640B (en) Contextual search of multimedia content
JP2008234431A (en) Comment accumulation device, comment creation browsing device, comment browsing system, and program
JP2021069117A5 (en)
CN115082602B (en) Method for generating digital person, training method, training device, training equipment and training medium for model
CN111279333B (en) Language-based search of digital content in a network
CN111506794A (en) Rumor management method and device based on machine learning
US20230022966A1 (en) Method and system for analyizing, classifying, and node-ranking content in audio tracks
Lv et al. Understanding the users and videos by mining a novel danmu dataset
Wu et al. Mumor: A multimodal dataset for humor detection in conversations
Qi et al. Fakesv: A multimodal benchmark with rich social context for fake news detection on short video platforms
Sharma et al. Sentiments mining and classification of music lyrics using SentiWordNet
Yang et al. Multimodal indicators of humor in videos
CN114503100A (en) Method and device for labeling emotion related metadata to multimedia file
US11386163B2 (en) Data search method and data search system thereof for generating and comparing strings
Hussien et al. Multimodal sentiment analysis: a comparison study
Chang et al. Report of 2017 NSF workshop on multimedia challenges, opportunities and research roadmaps
Dalla Torre et al. Deep learning-based lexical character identification in TV series
Salim An Alternative Representation of Video via Feature Extraction (RAAVE)
Li Enabling Structured Navigation of Longform Spoken Dialog with Automatic Summarization
Zhu Hotspot Detection for Automatic Podcast Trailer Generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination