US12266175B2 - Combining visual and audio insights to detect opening scenes in multimedia files - Google Patents

Combining visual and audio insights to detect opening scenes in multimedia files Download PDF

Info

Publication number
US12266175B2
US12266175B2 US18/090,843 US202218090843A US12266175B2 US 12266175 B2 US12266175 B2 US 12266175B2 US 202218090843 A US202218090843 A US 202218090843A US 12266175 B2 US12266175 B2 US 12266175B2
Authority
US
United States
Prior art keywords
scene
media file
classification
index data
opening song
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US18/090,843
Other versions
US20240221379A1 (en
Inventor
Yonit Hoffman
Mordechai KADOSH
Zvi Figov
Eliyahu STRUGO
Mattan SERRY
Michael BEN-HAYM
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to US18/090,843 priority Critical patent/US12266175B2/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BEN-HAYM, MICHAEL, FIGOV, ZVI, HOFFMAN, Yonit, KADOSH, MORDECHAI, SERRY, MATTAN, STRUGO, ELIYAHU
Publication of US20240221379A1 publication Critical patent/US20240221379A1/en
Application granted granted Critical
Publication of US12266175B2 publication Critical patent/US12266175B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • G06V30/245Font recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers

Definitions

  • Multimedia files for example, series episodes and movies—often include an introduction with an opening song (hereinafter “opening song”).
  • opening song The characteristics of an opening song can vary, such as with regard to length, audio/visual (AV) content, and temporal location of the opening song within the video.
  • AV audio/visual
  • some opening songs will play concomitantly with the opening credits, while others do not.
  • some opening songs will play at the very beginning of the episode or movie, while others may play after one or two scenes of the episode or movie have already transpired.
  • the ability to detect the placement of an opening song in a media file can be important for facilitating playback functionality, as well as for post-production editing of the content.
  • the ability to automatically detect an opening song can facilitate a ‘skip intro’ capability—where a viewer can jump right to the main multimedia content and pass over the opening song.
  • precision is required—one must be able to detect the exact beginning and end of the opening song. Otherwise, a portion of the main content may be incidentally skipped, rather than just the opening song.
  • the detection of an opening song can also facilitate selective processing of the media content (e.g., editing of the introduction in different languages).
  • Disclosed embodiments are directed to systems and methods for classifying portions of multimedia content included in a media file.
  • systems and methods are provided for facilitating the automatic detection of opening scene(s) (e.g., a predefined introduction or opening song) in multimedia content of a media file.
  • the disclosed embodiments may include or be practiced on computing systems configured with modules for implementing the disclosed methods.
  • the disclosed methods include acts for designating sequential blocks of time in the multimedia content as scene(s), then detecting certain feature(s) of those scene(s).
  • the extracted scene feature(s) may be analyzed by machine learning model(s), or other type of artificial intelligence (AI) model(s), to classify those scenes as either part of, or not part of, the introduction/opening song, based on a probability derived from the scene feature(s).
  • the machine learning model(s) may be trained so as to give higher or lower weight to certain scene feature(s), based on the past success of those feature(s) to accurately predict whether a scene is part of the introduction/opening song.
  • FIG. 1 illustrates an example of how multimedia content of a multimedia file can be broken up into sequential blocks of time designated as scenes, in order to detect the location of a predefined introduction/opening song.
  • FIG. 2 A illustrates an example flow chart for how scene(s) in a multimedia file can be analyzed for certain feature(s), and those feature(s) used to classify the scene(s) as either being part of, or not being part of, a predefined introduction/opening song.
  • FIG. 2 B illustrates an example flow chart for how different types of data from multimedia content included in a multimedia file—for example, visual and/or audio data—can be analyzed for certain scene feature(s).
  • FIG. 3 illustrates an example flow chart for how scene feature(s) can be analyzed using a scene classification model to determine a probability that each scene is part of a predefined introduction/opening song, and to classify each scene as either part of, or not part of, the introduction/opening song.
  • FIG. 4 illustrates an example graph showing how a probability assigned to each scene in a multimedia file may be used to determine which scene, or series of scenes, constitutes a predefined introduction/opening song.
  • FIG. 5 illustrates an example flow chart for automatically detecting a predefined introduction/opening song in a multimedia file.
  • FIG. 6 illustrates an example computing environment in which a computing system incorporates and/or is utilized to perform disclosed aspects of the disclosed embodiments.
  • Some of the disclosed embodiments are directed toward systems and methods for detecting a particular portion of multimedia files based on features extracted from the multimedia content of a multimedia file, as well as for tagging or otherwise indexing the detected portions of the multimedia files.
  • a multimedia file, or media file comprises multimedia content with associated metadata about the multimedia content. Additionally, the multimedia content can be formatted in various different file formats. In some instances, the multimedia file comprises the raw multimedia content or a compressed or compiled version of the multimedia content. Multimedia content refers to electronic content which comprises multiple different media types, for example, audio content and visual content. Features can be extracted from the multimedia content wherein certain features correspond to each type of media represented in the multimedia content. It should be appreciated that the systems and methods disclosed, while described in application to multimedia content, may also be applied to media files comprising a single type of media.
  • Some of the disclosed embodiments are specifically directed to improved systems and methods for automatically detecting opening scene(s) (e.g., an opening song) that are included in the multimedia file. This can be beneficial, particularly when considering conventional systems, for enabling playback and post-production editing of the media files without requiring manual review and editing of each file being processed. For at least this reason, the disclosed embodiments may be implemented to provide many technical advantages over existing media processing systems, as will now be described in more detail.
  • a frame refers to any temporal unit associated with the multimedia file.
  • a frame is selected based on structural and semantic properties associated with the temporal unit.
  • a frame refers to a temporal unit comprising a still image associated with the multimedia file. In such instances, a plurality of frames is combined to form a moving picture.
  • a frame refers to a temporal unit comprising a limited portion of an audio file. In such instances, a plurality of frames is combined to form continuous audio.
  • a multimedia file comprising only a few minutes of multimedia content can contain thousands of frames. This results in a high computational cost, either manually, or using a computing system, to process the frames, identify which frames correspond to the opening song, and then tag the frames that have been identified.
  • Disclosed embodiments are directed to improved systems and methods for detection of an opening song to overcome the disadvantages of current detection solutions.
  • the system of the present disclosure differs from prior systems in that it allows for the automatic detection of the opening song and automatic tagging of the multimedia file, without the need for manual/human detection or tagging.
  • This automation significantly reduces the time and cost it takes to process and edit a multimedia file with segment tagging. It can also improve the consistency in which segment boundaries are identified, at least as compared to subjective/arbitrary tagging that is sometimes caused by human error and variations in human perception.
  • the disclosed embodiments are able to achieve these aforementioned benefits of automatic detection by segmenting the multimedia file into scenes, as opposed to frames, as the building blocks for analyzing the multimedia file.
  • Each segment of the multimedia file comprises a particular portion of multimedia content included in the multimedia file.
  • the technical advantage of this is that there are far fewer scenes than frames in a multimedia file. This significantly reduces the computational expense of analyzing a limited number of scenes, instead of thousands of frames.
  • a scene refers to a particular portion of the multimedia content which is characterized by having continuous and distinct features from an adjacent portion of the multimedia content.
  • a scene is a multi-modality object which is extracted from electronic content. Scenes can be extracted based on visual and/or audio features. For example, a scene is typically associated with a particular set or environment in which the characters of the story are interacting. When one or more characters begin interacting in a different location of the story (e.g., set or geolocation), typically, a new scene has begun. In some instances, a scene involves the same set of characters or at least one or more same characters for some continuous length of time in the same environment. Because of the ability to detect a scene, features associated with a particular scene can be extracted and analyzed to determine which scenes are associated with the opening song.
  • Additional technical benefits include improved training of machine learning models used to automatically detect the opening song, resulting in improved machine learning models which are more accurate, consistent, and transparent. Because multiple different features are extractable from the different scenes of the multimedia file, the machine learning model can be trained on different sets of features which help it to detect opening songs of new multimedia files. Each scene corresponds to a particular subset of all features that are or can be extracted from the multimedia content included in multimedia file. Features can also be extracted from metadata included in the multimedia file which corresponds to the multimedia content. These features contribute to the model, both in training and during run-time.
  • each feature can be assigned, either manually or by a machine learning model, a particular weight that predicts how much that feature will contribute to the prediction that the scene corresponds to the opening song.
  • Some features may be more indicative or distinctive of an opening song than other features. For example, some features, like a series of written names appearing on the scene, may correlate to the opening song more than other features, like a background melody playing, which may appear more frequently throughout the entire multimedia content instead of exclusively in the opening scene.
  • the results of the machine learning model are more transparent to users. For example, a user is able to understand why the machine learning model returned a particular result (e.g., why a particular scene or set of scenes was detected as the opening song portion) because the user is able to see which features were identified and how important each feature was (e.g., the weight applied to each feature) in predicting whether the scene(s) corresponded to the opening song.
  • a particular result e.g., why a particular scene or set of scenes was detected as the opening song portion
  • a user can tune how many features are to be extracted and processed. For example, if a user wants to improve the accuracy of the results, the user or the machine learning model can select a higher or total number of features available.
  • the user or the machine learning model can select a lower or limited number of features available (e.g., the categories of features that have the highest weights).
  • the machine learning model is able to be trained on that new feature in isolation, or in combination with existing features, to update the model to be able to extract the new feature. The model can then use this new feature to augment and improve the detection of the opening song.
  • the machine learning model is also configured to learn which feature, or combination of features, results in more accurate detection of the opening song for a particular type of media file. Accordingly, the machine learning model can add or omit certain features dynamically upon determining a particular type of media file.
  • FIG. 1 illustrates, in one embodiment, a multimedia file 100 which is separated into sequential scene(s) 120 (e.g., scene 1, scene 2, scene 3, scene 4, scene 5, scene 6, scene 7, scene 8, scene 9, scene 10, scene 11, scene 12, scene 13, scene 14, scene 15, scene 16, scene 17, and so forth). This may be done manually, or by an AI model, such as a machine learning model or other type of AI model.
  • an AI model may recognize that the people/characters in a series of sequential shots in the multimedia file 100 do not change, or that the background scenery does not change.
  • a shot is an inner unit of the scene, wherein a plurality of shots is identified in a media file. For example, a shot is a contiguous sequence of frames with the same or similar camera angle.
  • a sequential subset of the plurality of shots is then aggregated into a scene. In some instances, multiple different subsets, where a subset comprises a certain number of shots, are aggregated into different scenes.
  • the model may classify a group of shots, frames, or blocks of sequential time 110 in the multimedia file, as individual scene(s) 120 . As shown in FIG.
  • the file may be broken up into blocks of time 110 that are designated as sequential scene(s) 120 .
  • Scene(s) 120 may be determined by analyzing both audio and visual data from multimedia content of the multimedia file 100 .
  • the disclosed systems also analyze each scene and extract one or more features for each scene and predict how likely each of the scenes corresponds to the opening song, wherein the system is able to detect which scene(s) correspond to the opening song.
  • scenes 1-8 constitute the opening song 130 .
  • all of the scenes in the file are analyzed.
  • the system refrains from analyzing the rest of the scenes, which reduces the computational expense of processing the file.
  • each scene 120 may be characterized by certain feature(s) 220 .
  • the multimedia file 100 is processed using a Feature Extraction Model 210 to determine whether a particular scene possesses certain feature(s) 220 .
  • the particular scene is classified by a Scene Classification Model 230 to determine the probability 410 (shown in FIG. 4 ) that the scene is associated with the opening song 130 . This process is repeated for some or all of the scenes in the multimedia file.
  • Feature Extraction Model(s) 210 different types of data from multimedia content of multimedia file 100 or from metadata included in multimedia file 100 are analyzed by the Feature Extraction Model(s) 210 to determine the feature(s) 220 associated with each scene 120 .
  • visual data 250 from multimedia content multimedia file 100 may be analyzed by Feature Extraction Model(s) 210 , such as an OCR Model or a Black Frame Detector model, to determine extracted scene feature(s) 220 .
  • An OCR model may be applied to text identified in the scene, in order to perform character recognition on the text.
  • the Black Frame Detector model may be used to detect a black or blank frame within a predetermined proximity to the scene (e.g., before or after a particular scene).
  • the extracted scene feature(s) 220 taken from visual data 250 may include, for example: the number of known words (detecting words commonly used in opening credits, such as “introducing,” “producer,” or “produced by,” etc.); people names (because opening credits usually list the names of the actors, producers, etc.); and/or the existence of a black frame within a predetermined proximity of that scene.
  • the extracted scene feature(s) 220 taken from visual data 250 may also include font characteristics of text that appears on the screen, including font size.
  • the extracted scene feature(s) 220 taken from visual data 250 may also include known media assets-such as, for example, known TV show or movie names.
  • audio data 260 may also be analyzed by Feature Extraction Model(s) 210 , such as a speech-to-text model, a music detector model, and/or a silence detector model.
  • the Feature Extraction Model(s) 210 may identify a language spoken in the scene and employ a speech-to-text function for the recognition and translation of a spoken language into text.
  • the Feature Extraction Model(s) 210 may detect any music that plays during a scene and recognize the duration of time that the music plays during the scene.
  • the feature extraction may also detect a particular volume associated with the music, or a volume change from one scene to the next, or volume contrast from music relative to spoken words.
  • the Feature Extraction Model(s) 210 may detect silence of a predetermined duration that occurs either within a scene or within a predetermined time after a scene ends.
  • the extracted scene feature(s) 220 taken from audio data 260 may include, for example: the rate of words spoken during a scene, the number of speakers participating in the scene, the amount of time that music plays in the scene, and/or the presence of silence that occurs near the end of a scene or just after the scene.
  • the Feature Extraction Model(s) 210 used to analyze audio data 260 may involve diarization model(s).
  • the audio and visual data together 270 may be analyzed by Feature Extraction Model(s) 210 , such as a scenes creation model, to determine extracted scene feature(s) 220 —including, for example, the duration of the scene, the number of shots comprising the scene, and the scene location within multimedia file 100 .
  • Feature Extraction Model(s) 210 such as a scenes creation model, to determine extracted scene feature(s) 220 —including, for example, the duration of the scene, the number of shots comprising the scene, and the scene location within multimedia file 100 .
  • An opening song is typically at or near the beginning of a multimedia file.
  • the results of the features analyses may be aggregated and used to classify each scene 120 with a probability that the scene is a part of the opening song 130 (as shown in FIG. 3 ).
  • the extracted scene feature(s) 220 data may be analyzed using Scene Classification Model 230 to determine classified scene probability data 310 , wherein each scene 120 has been assigned a probability 410 (shown in FIG. 4 ) that the scene 120 is part of the opening song 130 .
  • the opening song 130 may be determined using, for example, algorithmic heuristics which locate a sequence of scene(s) 120 that have been assigned sufficiently high probabilities 410 of being part of the opening song 130 .
  • Scene Classification Model 230 may be trained using a Scene Correction Model 320 , which determines the success of each scene feature 220 in predicting the probability that scene 120 is part of the opening song 130 . The result would be that in subsequent applications of Scene Classification Model 230 , certain feature(s) 220 may be given more classification weight and other feature(s) 220 may be given less classification weight when determining the probability that a scene from scene(s) 120 is part of the opening song 130 , based on the past success (or failure) of those feature(s) 220 to correctly predict that the scene was part of the opening song 130 . For example, different text (either from the audio or visual data) associated with a scene may be given different classification weights by the Scene Classification Model 230 .
  • classified scene probability data 310 provides the probabilities 410 that each scene 120 is a part of the opening song 130 and can be analyzed to find the sequence of scene(s) 120 with sufficiently high probability 410 so as to constitute the opening song 130 .
  • knowledge of adjacent scenes 120 may be combined to detect the full opening song 130 .
  • scenes 2-4 and 6-8 have been assigned a fairly high probability 410 of being part of the opening song, while scene 5 has been assigned a lower probability 410 of being part of the opening song.
  • scene 5 will be designated as part of the opening song, along with scenes 2-4 and 6-8, because of the high probability 410 of its neighboring scenes.
  • the machine learning model is tunable in determining how many more scenes should be analyzed after a scene is predicted to have a low probability score before refraining from analyzing the rest of the file.
  • all of the scenes are analyzed.
  • the scenes which correspond to the opening song can then be tagged. These tags can be used in post-editing to insert “intro skip” functions for a user to skip over the opening song while streaming the media file.
  • the system analyzes a relative relationship between probability scores. For example, if a low probability score is followed by a high probability score with a score difference that meets or exceeds a pre-determined threshold, the scene with the high probability score is likely the beginning of the opening song. If a high probability score is followed by a low probability score with a score difference that meets or exceeds a pre-determined threshold, the system may predict that the scene with the high probability score is the end of the opening song.
  • the system may determine that there is no opening song included in the media file, for example, if there is not a big enough difference between the probability scores of different scenes, or if the probability scores of the scenes do not meet or exceed a pre-determined threshold value.
  • the threshold value can be pre-determined by a user or learned and set automatically by the computing system. In some instances, a different threshold value is chosen for different types or categories of media files or can be dynamically updated based on identifying the type of media file or based on certain features which have been extracted for one or more scenes.
  • FIG. 5 illustrates a flow diagram that includes various acts (act 510 , act 520 , act 530 , act 540 , act 550 , and act 560 ) associated with example methods that can be implemented by computing system 600 for performing opening song detection using a machine learning model.
  • the disclosed methods for automatically detecting an opening song 130 in a multimedia file 100 may include, for example, the initial act of accessing a multimedia file 100 , which contains multimedia content (act 510 ).
  • the multimedia file 100 may then be analyzed (as described above with regards to FIG. 1 ) to identify a scene (e.g., scene(s) 120 ) in the multimedia file 100 (act 520 ).
  • feature(s) 220 of the multimedia content included in the multimedia file may be determined relative to their association with each scene 120 (act 530 ), as described above in regard to FIGS. 2 A and 2 B .
  • Each scene 120 may then be scored with a probability 410 that the scene corresponds to a predefined opening song 130 (act 540 ), as described above regarding FIGS. 3 - 4 . Based on the probability 410 , the scene will then be classified as correlating, or not correlating, to the opening song 130 (act 550 ).
  • the classifying weight of at least one feature 220 may be modified by Scene Correction Model 320 when determining a probability 310 that the new scene 120 b is part of the opening song 130 , based on the success of the machine learning model in accurately predicting whether the first scene 120 was part of the opening song 130 (act 560 ).
  • the temporal location of the opening song 130 can be stored as index data that is associated with the multimedia file 100 .
  • the index data may be associated with the multimedia file 100 by adding the index data as new metadata to the multimedia file 100 , or by adding the index data to an index that is stored separately from the multimedia file 100 .
  • the index data may be associated with the multimedia file 100 in such a way as to enable a trick play function to be performed to the multimedia file 100 , during which the index data is referenced for skipping or fast-forwarding the opening scene(s) of the multimedia file 100 file during the trick play function.
  • the method further comprises generating index data that identifies a temporal location of the opening song in the media file and associating the index data with the media file.
  • the system can associate the index data with the media file according to several different techniques. For example, the system performs the association by adding the index data as metadata to the media file or by adding the index data to an index that is stored separately from the media file. Additionally, or alternatively, the index data is associated with the media file in such a manner as to enable a trick play function to be performed to the media file during which the index data is referenced for skipping or fast-forwarding the opening song and corresponding scenes of the media file during the trick play function.
  • identifying the feature includes identifying text in the scene by applying an OCR model to perform character recognition on text identified in the scene and wherein different text is associated with different classification weights.
  • identifying the feature includes identifying a black frame within a predetermined proximity to the scene. Additionally, or alternatively, the system identifies one or more features by identifying a language spoken in the scene and by applying a speech to text model for the recognition and translation of spoken language into text and wherein different text is associated with different classification weights.
  • the feature may be identified using a music detector to detect any music that plays during the scene and to recognize a duration of time that the music plays during the scene or using a silence detector to detect audio silence of a predetermined duration that occurs either within the scene or within a predetermined duration after the scene ends.
  • Some example features that are identified are obtained from visual data and include font characteristics of text (e.g., font size). Some features that are identified are obtained from visual data and include: particular words, names of people, terms associated with production titles or credit attribution, OCR data; a size of text that appears on the screen, media assets, or a black frame within a scene or within a predetermined proximity to the scene. Some example features that are identified are obtained from audio data, including: a rate in which words spoken during the scene, a quantity of unique speakers in the scene, a duration of time within the scene that music is played, or a predetermined duration of silence after the scene ends.
  • Some example features that are identified are obtained from both visual and audio data, including: a scene duration, a quantity of camera shots in the scene, or a location of the scene within the media file.
  • the system is able to classify the scene as either correlating, or not correlating, to the opening song.
  • the system is able to classify the scene based at least in part on knowledge of neighboring scenes, in addition to features corresponding to the scene being classified. Additionally, the system is able to learn which features from a scene are associated with a higher probability that the scene correlates to an opening song and which features from a scene are associated with a lower probability that the scene correlates to the opening song.
  • the system accesses a media file containing multimedia content and applies the trained machine learning model (i.e., classification model) to the media file to identify a temporal location of an opening song identified in the multimedia content of the media file. This is done by generating index data that identifies the temporal location of the opening song in the multimedia content based on the identified temporal location; and associating the index data with the media file.
  • the trained machine learning model i.e., classification model
  • FIG. 6 illustrates components of a computing system 600 which may include and/or be used to implement aspects of the disclosed methods.
  • the computing system 600 comprises a plurality of AI models—for example, a Feature Extraction Model 210 , a Scene Classification Model 230 , and a Scene Correction Model 320 .
  • Computing system 600 is able to utilize different AI models, and/or different types of AI models.
  • Scene Classification Model 230 is configured as a machine learning model, such as a classification model.
  • a machine learning model is a particular type of AI model which is configured to recognize and learn from patterns identified in datasets and utilize that learning to improve itself for a particular task for which the machine learning model is trained.
  • a classification model is a type of machine learning model trained to perform one or more different classification tasks. Classification tasks are a type of predictive modeling problem in which the machine learning model is trained to predict a class label for a particular set or subset (e.g., a scene) of input data (e.g., multimedia content of a multimedia file).
  • FIG. 6 illustrates how the computing system 600 is one part of a distributed computing environment that also includes remote (e.g., third party) system(s) 660 in communication (via a network 650 ) with the computing system 600 .
  • the computing system 600 is configured to train a plurality of AI models for automatically detecting scene feature(s) and probabilities that scene(s) are part of a predefined opening song.
  • the computing system 600 is also configured to generate training data configured for training the AI models.
  • the computing system 600 includes a processing system including one or more processor(s) 610 (such as one or more hardware processor(s)) and a storage (e.g., hardware storage device(s) 630 ) storing computer-executable instructions wherein one or more of the hardware storage device(s) 630 is able to house any number of data types and any number of computer-executable instructions by which the computing system 600 is configured to implement one or more aspects of the disclosed embodiments when the computer-executable instructions are executed by the one or more processor(s) 610 .
  • the computing system 600 is also shown including input/output (I/O) device(s) 620 .
  • I/O input/output
  • hardware storage device(s) 630 is shown as a single storage unit. However, it will be appreciated that the hardware storage device(s) 630 is configurable as a distributed storage that is distributed to several separate and sometimes remote and/or third-party system(s) 660 , such as remote client system 660 A and remote client system 660 B.
  • Remote client system 660 A comprises at least a processor 670 A and hardware storage device 680 A.
  • Remote client system 660 B comprises at least a processor 670 B and hardware storage device 680 B.
  • the computing system 600 can also comprise a distributed system with one or more of the components of computing system 600 being maintained/run by different discrete systems that are remote from each other and that each performs different tasks. In some instances, a plurality of distributed systems performs similar and/or shared tasks for implementing the disclosed functionality, such as in a distributed cloud environment.
  • the hardware storage device(s) 630 are configured to store the different data types including multimedia file(s) 100 and index data 640 . Once the location of the opening song 130 has been determined, the location may be added as index data 640 associated with the multimedia file 100 .
  • the index data 640 may be associated with the multimedia file 100 by adding index data 640 as new metadata to the multimedia file 100 . Or the index data 640 may be associated with the multimedia file 100 by adding the index data 640 to an index that is stored separately from the multimedia file 100 .
  • the storage (e.g., hardware storage device(s) 630 ) includes computer-executable instructions for instantiating or executing one or more of the models and/or engines shown in computing system 600 .
  • the models for example, Feature Extraction Model 210 , Scene Classification AI Model 230 , and Scene Correction Model 320 —are configured as AI models.
  • the one or more models are configured as engines or processing systems (e.g., computing systems integrated within computing system 600 ), wherein each engine (e.g., model) comprises one or more processors (e.g., hardware processor(s) 610 ) and computer-executable instructions corresponding to the computing system 600 .
  • the computing system 600 is provided for training and/or utilizing a machine learning model (e.g., a trained classification model) that is trained to classify different portions of multimedia content included in a media file. For example, the computing system 600 identifies a particular portion (e.g., a frame, a shot, a scene, or other predefined subset of multimedia content) in a multimedia content of a media file. The computing system 600 then identifies a feature associated with the particular portion and scores the particular portion for a probability that the particular portion corresponds to a particular classification based on a classification weight of the machine learning model that is assigned to the feature.
  • a machine learning model e.g., a trained classification model
  • classifications include an opening scene, an opening song, a closing scene, a closing song, a recap of a previous episode or season of a television series, an opening credit, a closing credit, or other particular classification associated with multimedia or media content of a media file.
  • the computing system 600 classifies the particular portion as correlating to the particular classification, or alternatively, classifies the particular portion as not correlating to the particular classification. Based on the classification for the particular portion, the computing system 600 modifies the classification weight of the machine learning model to generate a trained classification model.
  • the computing system 600 is then able to apply the trained classification model to a new media file to identify a temporal location of the particular classification in the new multimedia content included in the new media file.
  • Computing system 600 generates index data that identifies the temporal location of the particular classification in the multimedia content based on the identified temporal location and associates the index data with the new media file.
  • system 600 of FIG. 6 which is configured with one or more hardware processors and computer storage that stores computer-executable instructions that, when executed by one or more processors, cause various functions to be performed, such as the acts recited above.
  • Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
  • Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
  • Computer-readable media that store computer-executable instructions are physical storage media.
  • Computer-readable media that carry computer-executable instructions are transmission media.
  • embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: physical computer-readable storage media and transmission computer-readable media.
  • Physical computer-readable storage media includes random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only (EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage (such as DVDs, etc.), magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only
  • CD-ROM compact disk read-only memory
  • magnetic disk storage or other magnetic storage devices or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
  • a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
  • a network or another communications connection can include a network and/or data links which can be used to carry or desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.
  • program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa).
  • program code means in the form of computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system.
  • NIC network interface module
  • computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.
  • Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
  • the computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
  • the functionality described herein can be performed, at least in part, by one or more hardware logic components.
  • illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Disclosed is a method for automatically detecting an introduction/opening song within a multimedia file. The method includes designating sequential blocks of time in the multimedia file as scene(s) and detecting certain feature(s) associated with each scene. The extracted scene feature(s) may be analyzed and used to assign a probability to each scene that the scene is part of the introduction/opening song. The probabilities may be used to classify each scene as either correlating to or not correlating to, the introduction/opening song. The temporal location of the opening song may be saved as index data associated with the multimedia file.

Description

BACKGROUND
Multimedia files—for example, series episodes and movies—often include an introduction with an opening song (hereinafter “opening song”). The characteristics of an opening song can vary, such as with regard to length, audio/visual (AV) content, and temporal location of the opening song within the video. By way of example, some opening songs will play concomitantly with the opening credits, while others do not. Likewise, some opening songs will play at the very beginning of the episode or movie, while others may play after one or two scenes of the episode or movie have already transpired.
The ability to detect the placement of an opening song in a media file can be important for facilitating playback functionality, as well as for post-production editing of the content. By way of example, the ability to automatically detect an opening song can facilitate a ‘skip intro’ capability—where a viewer can jump right to the main multimedia content and pass over the opening song. However, for such a capability, precision is required—one must be able to detect the exact beginning and end of the opening song. Otherwise, a portion of the main content may be incidentally skipped, rather than just the opening song.
The detection of an opening song can also facilitate selective processing of the media content (e.g., editing of the introduction in different languages).
Unfortunately, conventional systems for indexing the opening song and introduction of a media file, as well as for indexing the other portions of a media file, are limited to manual review and tagging of the different media segments. This process can be very cumbersome and expensive, as well as subjective and inconsistent.
In view of the foregoing, it will be appreciated that there is an ongoing need for improved systems and methods for detecting opening songs in different multimedia productions.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one example technology area where some embodiments described herein may be practiced.
BRIEF SUMMARY
Disclosed embodiments are directed to systems and methods for classifying portions of multimedia content included in a media file. In particular, systems and methods are provided for facilitating the automatic detection of opening scene(s) (e.g., a predefined introduction or opening song) in multimedia content of a media file.
As described, the disclosed embodiments may include or be practiced on computing systems configured with modules for implementing the disclosed methods.
The disclosed methods include acts for designating sequential blocks of time in the multimedia content as scene(s), then detecting certain feature(s) of those scene(s). The extracted scene feature(s) may be analyzed by machine learning model(s), or other type of artificial intelligence (AI) model(s), to classify those scenes as either part of, or not part of, the introduction/opening song, based on a probability derived from the scene feature(s). The machine learning model(s) may be trained so as to give higher or lower weight to certain scene feature(s), based on the past success of those feature(s) to accurately predict whether a scene is part of the introduction/opening song.
This Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Additional features and advantages will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. Features and advantages of the disclosed systems and methods may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present disclosed systems and methods will become more fully apparent from the following description and appended claims or may be learned by the practice of the disclosed systems and methods as set forth hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
FIG. 1 illustrates an example of how multimedia content of a multimedia file can be broken up into sequential blocks of time designated as scenes, in order to detect the location of a predefined introduction/opening song.
FIG. 2A illustrates an example flow chart for how scene(s) in a multimedia file can be analyzed for certain feature(s), and those feature(s) used to classify the scene(s) as either being part of, or not being part of, a predefined introduction/opening song.
FIG. 2B illustrates an example flow chart for how different types of data from multimedia content included in a multimedia file—for example, visual and/or audio data—can be analyzed for certain scene feature(s).
FIG. 3 illustrates an example flow chart for how scene feature(s) can be analyzed using a scene classification model to determine a probability that each scene is part of a predefined introduction/opening song, and to classify each scene as either part of, or not part of, the introduction/opening song.
FIG. 4 illustrates an example graph showing how a probability assigned to each scene in a multimedia file may be used to determine which scene, or series of scenes, constitutes a predefined introduction/opening song.
FIG. 5 illustrates an example flow chart for automatically detecting a predefined introduction/opening song in a multimedia file.
FIG. 6 illustrates an example computing environment in which a computing system incorporates and/or is utilized to perform disclosed aspects of the disclosed embodiments.
DETAILED DESCRIPTION
Some of the disclosed embodiments are directed toward systems and methods for detecting a particular portion of multimedia files based on features extracted from the multimedia content of a multimedia file, as well as for tagging or otherwise indexing the detected portions of the multimedia files.
A multimedia file, or media file, comprises multimedia content with associated metadata about the multimedia content. Additionally, the multimedia content can be formatted in various different file formats. In some instances, the multimedia file comprises the raw multimedia content or a compressed or compiled version of the multimedia content. Multimedia content refers to electronic content which comprises multiple different media types, for example, audio content and visual content. Features can be extracted from the multimedia content wherein certain features correspond to each type of media represented in the multimedia content. It should be appreciated that the systems and methods disclosed, while described in application to multimedia content, may also be applied to media files comprising a single type of media.
Some of the disclosed embodiments are specifically directed to improved systems and methods for automatically detecting opening scene(s) (e.g., an opening song) that are included in the multimedia file. This can be beneficial, particularly when considering conventional systems, for enabling playback and post-production editing of the media files without requiring manual review and editing of each file being processed. For at least this reason, the disclosed embodiments may be implemented to provide many technical advantages over existing media processing systems, as will now be described in more detail.
Conventional media processing systems have largely employed manual tagging of the opening song in a multimedia file. Furthermore, these systems have typically analyzed the multimedia file according to the different frames that are included in the file. A frame refers to any temporal unit associated with the multimedia file. A frame is selected based on structural and semantic properties associated with the temporal unit. In some instances, a frame refers to a temporal unit comprising a still image associated with the multimedia file. In such instances, a plurality of frames is combined to form a moving picture. In some instances, a frame refers to a temporal unit comprising a limited portion of an audio file. In such instances, a plurality of frames is combined to form continuous audio. Thus, even a multimedia file comprising only a few minutes of multimedia content can contain thousands of frames. This results in a high computational cost, either manually, or using a computing system, to process the frames, identify which frames correspond to the opening song, and then tag the frames that have been identified.
Disclosed embodiments are directed to improved systems and methods for detection of an opening song to overcome the disadvantages of current detection solutions. For example, the system of the present disclosure differs from prior systems in that it allows for the automatic detection of the opening song and automatic tagging of the multimedia file, without the need for manual/human detection or tagging. This automation significantly reduces the time and cost it takes to process and edit a multimedia file with segment tagging. It can also improve the consistency in which segment boundaries are identified, at least as compared to subjective/arbitrary tagging that is sometimes caused by human error and variations in human perception.
The disclosed embodiments are able to achieve these aforementioned benefits of automatic detection by segmenting the multimedia file into scenes, as opposed to frames, as the building blocks for analyzing the multimedia file. Each segment of the multimedia file comprises a particular portion of multimedia content included in the multimedia file. The technical advantage of this is that there are far fewer scenes than frames in a multimedia file. This significantly reduces the computational expense of analyzing a limited number of scenes, instead of thousands of frames.
Herein, a scene refers to a particular portion of the multimedia content which is characterized by having continuous and distinct features from an adjacent portion of the multimedia content. In some instances, a scene is a multi-modality object which is extracted from electronic content. Scenes can be extracted based on visual and/or audio features. For example, a scene is typically associated with a particular set or environment in which the characters of the story are interacting. When one or more characters begin interacting in a different location of the story (e.g., set or geolocation), typically, a new scene has begun. In some instances, a scene involves the same set of characters or at least one or more same characters for some continuous length of time in the same environment. Because of the ability to detect a scene, features associated with a particular scene can be extracted and analyzed to determine which scenes are associated with the opening song.
Additional technical benefits include improved training of machine learning models used to automatically detect the opening song, resulting in improved machine learning models which are more accurate, consistent, and transparent. Because multiple different features are extractable from the different scenes of the multimedia file, the machine learning model can be trained on different sets of features which help it to detect opening songs of new multimedia files. Each scene corresponds to a particular subset of all features that are or can be extracted from the multimedia content included in multimedia file. Features can also be extracted from metadata included in the multimedia file which corresponds to the multimedia content. These features contribute to the model, both in training and during run-time.
Additionally, each feature can be assigned, either manually or by a machine learning model, a particular weight that predicts how much that feature will contribute to the prediction that the scene corresponds to the opening song. Some features may be more indicative or distinctive of an opening song than other features. For example, some features, like a series of written names appearing on the scene, may correlate to the opening song more than other features, like a background melody playing, which may appear more frequently throughout the entire multimedia content instead of exclusively in the opening scene.
Because of this weighting system, the results of the machine learning model are more transparent to users. For example, a user is able to understand why the machine learning model returned a particular result (e.g., why a particular scene or set of scenes was detected as the opening song portion) because the user is able to see which features were identified and how important each feature was (e.g., the weight applied to each feature) in predicting whether the scene(s) corresponded to the opening song.
The disclosed embodiments also achieve additional technical benefits over the prior art, in that the systems and methods described herein are flexible and scalable. In some instances, a user can tune how many features are to be extracted and processed. For example, if a user wants to improve the accuracy of the results, the user or the machine learning model can select a higher or total number of features available.
Alternatively, if a user wants to reduce the computational time of processing the file, the user or the machine learning model can select a lower or limited number of features available (e.g., the categories of features that have the highest weights). Additionally, if a new feature module is developed for identifying and extracting a new type of feature, the machine learning model is able to be trained on that new feature in isolation, or in combination with existing features, to update the model to be able to extract the new feature. The model can then use this new feature to augment and improve the detection of the opening song. The machine learning model is also configured to learn which feature, or combination of features, results in more accurate detection of the opening song for a particular type of media file. Accordingly, the machine learning model can add or omit certain features dynamically upon determining a particular type of media file.
It will be appreciated that the disclosed systems and methods can also be applied to detecting other portions of the multimedia content, such as closing credits, an intermission, or other distinct portions of the file content, etc. It should also be appreciated that the systems and methods can be used to analyze single media files, such as visual-only files, audio-only files, or other multimedia files including virtual reality and augment reality content files. Attention will first be directed to FIG. 1 , which illustrates, in one embodiment, a multimedia file 100 which is separated into sequential scene(s) 120 (e.g., scene 1, scene 2, scene 3, scene 4, scene 5, scene 6, scene 7, scene 8, scene 9, scene 10, scene 11, scene 12, scene 13, scene 14, scene 15, scene 16, scene 17, and so forth). This may be done manually, or by an AI model, such as a machine learning model or other type of AI model.
For example, an AI model may recognize that the people/characters in a series of sequential shots in the multimedia file 100 do not change, or that the background scenery does not change. A shot is an inner unit of the scene, wherein a plurality of shots is identified in a media file. For example, a shot is a contiguous sequence of frames with the same or similar camera angle. A sequential subset of the plurality of shots is then aggregated into a scene. In some instances, multiple different subsets, where a subset comprises a certain number of shots, are aggregated into different scenes. For example, the model may classify a group of shots, frames, or blocks of sequential time 110 in the multimedia file, as individual scene(s) 120. As shown in FIG. 1 , if the multimedia file 100 is viewed temporally (e.g., see Time 110), the file may be broken up into blocks of time 110 that are designated as sequential scene(s) 120. Scene(s) 120 may be determined by analyzing both audio and visual data from multimedia content of the multimedia file 100.
The disclosed systems also analyze each scene and extract one or more features for each scene and predict how likely each of the scenes corresponds to the opening song, wherein the system is able to detect which scene(s) correspond to the opening song. In this example, scenes 1-8 constitute the opening song 130. In some instances, all of the scenes in the file are analyzed. Alternatively, in some instances, once a set of scenes is predicted to correspond to an opening song, the system refrains from analyzing the rest of the scenes, which reduces the computational expense of processing the file.
Referring now to FIG. 2A, after the multimedia file 100 has been broken up into sequential scene(s) 120, each scene 120 may be characterized by certain feature(s) 220. As shown in FIG. 2A, the multimedia file 100—as a series of sequential scene(s) 120—is processed using a Feature Extraction Model 210 to determine whether a particular scene possesses certain feature(s) 220. Based on the feature(s) 220 associated with the particular scene, the particular scene is classified by a Scene Classification Model 230 to determine the probability 410 (shown in FIG. 4 ) that the scene is associated with the opening song 130. This process is repeated for some or all of the scenes in the multimedia file.
As shown in more detail in FIG. 2B, different types of data from multimedia content of multimedia file 100 or from metadata included in multimedia file 100 are analyzed by the Feature Extraction Model(s) 210 to determine the feature(s) 220 associated with each scene 120. For example, visual data 250 from multimedia content multimedia file 100 may be analyzed by Feature Extraction Model(s) 210, such as an OCR Model or a Black Frame Detector model, to determine extracted scene feature(s) 220. An OCR model may be applied to text identified in the scene, in order to perform character recognition on the text. The Black Frame Detector model may be used to detect a black or blank frame within a predetermined proximity to the scene (e.g., before or after a particular scene).
The extracted scene feature(s) 220 taken from visual data 250 may include, for example: the number of known words (detecting words commonly used in opening credits, such as “introducing,” “producer,” or “produced by,” etc.); people names (because opening credits usually list the names of the actors, producers, etc.); and/or the existence of a black frame within a predetermined proximity of that scene. The extracted scene feature(s) 220 taken from visual data 250 may also include font characteristics of text that appears on the screen, including font size. The extracted scene feature(s) 220 taken from visual data 250 may also include known media assets-such as, for example, known TV show or movie names.
As shown in FIG. 2B, audio data 260 may also be analyzed by Feature Extraction Model(s) 210, such as a speech-to-text model, a music detector model, and/or a silence detector model. The Feature Extraction Model(s) 210 may identify a language spoken in the scene and employ a speech-to-text function for the recognition and translation of a spoken language into text. The Feature Extraction Model(s) 210 may detect any music that plays during a scene and recognize the duration of time that the music plays during the scene. The feature extraction may also detect a particular volume associated with the music, or a volume change from one scene to the next, or volume contrast from music relative to spoken words. For example, in some instances, background music is typically at a lower volume while characters are speaking, while an opening song is usually at a higher volume than the background music. The Feature Extraction Model(s) 210 may detect silence of a predetermined duration that occurs either within a scene or within a predetermined time after a scene ends.
Additionally, the extracted scene feature(s) 220 taken from audio data 260 may include, for example: the rate of words spoken during a scene, the number of speakers participating in the scene, the amount of time that music plays in the scene, and/or the presence of silence that occurs near the end of a scene or just after the scene. The Feature Extraction Model(s) 210 used to analyze audio data 260 may involve diarization model(s).
As shown in FIG. 2B, the audio and visual data together 270 may be analyzed by Feature Extraction Model(s) 210, such as a scenes creation model, to determine extracted scene feature(s) 220—including, for example, the duration of the scene, the number of shots comprising the scene, and the scene location within multimedia file 100. An opening song is typically at or near the beginning of a multimedia file.
Attention will now be directed to FIG. 3 . Once the multimedia file 100 has been broken into scene(s) 120 (as shown in FIG. 1 ), and feature(s) 220 have been extracted from each scene 120 (as shown in FIGS. 2A and 2B), the results of the features analyses may be aggregated and used to classify each scene 120 with a probability that the scene is a part of the opening song 130 (as shown in FIG. 3 ). For example, as shown in FIG. 3 , the extracted scene feature(s) 220 data may be analyzed using Scene Classification Model 230 to determine classified scene probability data 310, wherein each scene 120 has been assigned a probability 410 (shown in FIG. 4 ) that the scene 120 is part of the opening song 130. Using the classified scene probability data 310, the opening song 130 may be determined using, for example, algorithmic heuristics which locate a sequence of scene(s) 120 that have been assigned sufficiently high probabilities 410 of being part of the opening song 130.
Scene Classification Model 230 may be trained using a Scene Correction Model 320, which determines the success of each scene feature 220 in predicting the probability that scene 120 is part of the opening song 130. The result would be that in subsequent applications of Scene Classification Model 230, certain feature(s) 220 may be given more classification weight and other feature(s) 220 may be given less classification weight when determining the probability that a scene from scene(s) 120 is part of the opening song 130, based on the past success (or failure) of those feature(s) 220 to correctly predict that the scene was part of the opening song 130. For example, different text (either from the audio or visual data) associated with a scene may be given different classification weights by the Scene Classification Model 230.
As shown in FIG. 4 , classified scene probability data 310 provides the probabilities 410 that each scene 120 is a part of the opening song 130 and can be analyzed to find the sequence of scene(s) 120 with sufficiently high probability 410 so as to constitute the opening song 130. While analyzing the classified scene probability data 310, knowledge of adjacent scenes 120 may be combined to detect the full opening song 130. For example, as shown in FIG. 4 , scenes 2-4 and 6-8 have been assigned a fairly high probability 410 of being part of the opening song, while scene 5 has been assigned a lower probability 410 of being part of the opening song. However, if knowledge of adjacent scenes is combined, scene 5 will be designated as part of the opening song, along with scenes 2-4 and 6-8, because of the high probability 410 of its neighboring scenes.
It should be appreciated that the machine learning model is tunable in determining how many more scenes should be analyzed after a scene is predicted to have a low probability score before refraining from analyzing the rest of the file. In other instances, all of the scenes are analyzed. The scenes which correspond to the opening song can then be tagged. These tags can be used in post-editing to insert “intro skip” functions for a user to skip over the opening song while streaming the media file. Additionally, it should be appreciated that in some instances, the system analyzes a relative relationship between probability scores. For example, if a low probability score is followed by a high probability score with a score difference that meets or exceeds a pre-determined threshold, the scene with the high probability score is likely the beginning of the opening song. If a high probability score is followed by a low probability score with a score difference that meets or exceeds a pre-determined threshold, the system may predict that the scene with the high probability score is the end of the opening song.
In some instances, the system may determine that there is no opening song included in the media file, for example, if there is not a big enough difference between the probability scores of different scenes, or if the probability scores of the scenes do not meet or exceed a pre-determined threshold value. The threshold value can be pre-determined by a user or learned and set automatically by the computing system. In some instances, a different threshold value is chosen for different types or categories of media files or can be dynamically updated based on identifying the type of media file or based on certain features which have been extracted for one or more scenes.
Attention will now be directed to FIG. 5 , which illustrates a flow diagram that includes various acts (act 510, act 520, act 530, act 540, act 550, and act 560) associated with example methods that can be implemented by computing system 600 for performing opening song detection using a machine learning model.
For example, the disclosed methods for automatically detecting an opening song 130 in a multimedia file 100 may include, for example, the initial act of accessing a multimedia file 100, which contains multimedia content (act 510). The multimedia file 100 may then be analyzed (as described above with regards to FIG. 1 ) to identify a scene (e.g., scene(s) 120) in the multimedia file 100 (act 520). Then, feature(s) 220 of the multimedia content included in the multimedia file may be determined relative to their association with each scene 120 (act 530), as described above in regard to FIGS. 2A and 2B.
Each scene 120 may then be scored with a probability 410 that the scene corresponds to a predefined opening song 130 (act 540), as described above regarding FIGS. 3-4 . Based on the probability 410, the scene will then be classified as correlating, or not correlating, to the opening song 130 (act 550).
In determining the probability that a different scene 120 b is part of the opening song 130, the classifying weight of at least one feature 220 may be modified by Scene Correction Model 320 when determining a probability 310 that the new scene 120 b is part of the opening song 130, based on the success of the machine learning model in accurately predicting whether the first scene 120 was part of the opening song 130 (act 560).
Once the opening song 130 has been identified, the temporal location of the opening song 130 can be stored as index data that is associated with the multimedia file 100. The index data may be associated with the multimedia file 100 by adding the index data as new metadata to the multimedia file 100, or by adding the index data to an index that is stored separately from the multimedia file 100. The index data may be associated with the multimedia file 100 in such a way as to enable a trick play function to be performed to the multimedia file 100, during which the index data is referenced for skipping or fast-forwarding the opening scene(s) of the multimedia file 100 file during the trick play function.
In some instances, the method further comprises generating index data that identifies a temporal location of the opening song in the media file and associating the index data with the media file. The system can associate the index data with the media file according to several different techniques. For example, the system performs the association by adding the index data as metadata to the media file or by adding the index data to an index that is stored separately from the media file. Additionally, or alternatively, the index data is associated with the media file in such a manner as to enable a trick play function to be performed to the media file during which the index data is referenced for skipping or fast-forwarding the opening song and corresponding scenes of the media file during the trick play function.
The system is also able to identify features in different ways. In some instances, identifying the feature includes identifying text in the scene by applying an OCR model to perform character recognition on text identified in the scene and wherein different text is associated with different classification weights. As another example, identifying the feature includes identifying a black frame within a predetermined proximity to the scene. Additionally, or alternatively, the system identifies one or more features by identifying a language spoken in the scene and by applying a speech to text model for the recognition and translation of spoken language into text and wherein different text is associated with different classification weights. The feature may be identified using a music detector to detect any music that plays during the scene and to recognize a duration of time that the music plays during the scene or using a silence detector to detect audio silence of a predetermined duration that occurs either within the scene or within a predetermined duration after the scene ends.
Some example features that are identified are obtained from visual data and include font characteristics of text (e.g., font size). Some features that are identified are obtained from visual data and include: particular words, names of people, terms associated with production titles or credit attribution, OCR data; a size of text that appears on the screen, media assets, or a black frame within a scene or within a predetermined proximity to the scene. Some example features that are identified are obtained from audio data, including: a rate in which words spoken during the scene, a quantity of unique speakers in the scene, a duration of time within the scene that music is played, or a predetermined duration of silence after the scene ends.
Some example features that are identified are obtained from both visual and audio data, including: a scene duration, a quantity of camera shots in the scene, or a location of the scene within the media file. Using any of the aforementioned features, or other features, the system is able to classify the scene as either correlating, or not correlating, to the opening song. In some instances, the system is able to classify the scene based at least in part on knowledge of neighboring scenes, in addition to features corresponding to the scene being classified. Additionally, the system is able to learn which features from a scene are associated with a higher probability that the scene correlates to an opening song and which features from a scene are associated with a lower probability that the scene correlates to the opening song.
After the machine learning model is trained, the system accesses a media file containing multimedia content and applies the trained machine learning model (i.e., classification model) to the media file to identify a temporal location of an opening song identified in the multimedia content of the media file. This is done by generating index data that identifies the temporal location of the opening song in the multimedia content based on the identified temporal location; and associating the index data with the media file.
FIG. 6 illustrates components of a computing system 600 which may include and/or be used to implement aspects of the disclosed methods. As shown, the computing system 600 comprises a plurality of AI models—for example, a Feature Extraction Model 210, a Scene Classification Model 230, and a Scene Correction Model 320.
Computing system 600 is able to utilize different AI models, and/or different types of AI models. For example, Scene Classification Model 230 is configured as a machine learning model, such as a classification model. A machine learning model is a particular type of AI model which is configured to recognize and learn from patterns identified in datasets and utilize that learning to improve itself for a particular task for which the machine learning model is trained. A classification model is a type of machine learning model trained to perform one or more different classification tasks. Classification tasks are a type of predictive modeling problem in which the machine learning model is trained to predict a class label for a particular set or subset (e.g., a scene) of input data (e.g., multimedia content of a multimedia file). FIG. 6 illustrates how the computing system 600 is one part of a distributed computing environment that also includes remote (e.g., third party) system(s) 660 in communication (via a network 650) with the computing system 600.
As described herein, the computing system 600 is configured to train a plurality of AI models for automatically detecting scene feature(s) and probabilities that scene(s) are part of a predefined opening song. The computing system 600 is also configured to generate training data configured for training the AI models.
The computing system 600, for example, includes a processing system including one or more processor(s) 610 (such as one or more hardware processor(s)) and a storage (e.g., hardware storage device(s) 630) storing computer-executable instructions wherein one or more of the hardware storage device(s) 630 is able to house any number of data types and any number of computer-executable instructions by which the computing system 600 is configured to implement one or more aspects of the disclosed embodiments when the computer-executable instructions are executed by the one or more processor(s) 610. The computing system 600 is also shown including input/output (I/O) device(s) 620.
As shown in FIG. 6 , hardware storage device(s) 630 is shown as a single storage unit. However, it will be appreciated that the hardware storage device(s) 630 is configurable as a distributed storage that is distributed to several separate and sometimes remote and/or third-party system(s) 660, such as remote client system 660A and remote client system 660B. Remote client system 660A comprises at least a processor 670A and hardware storage device 680A. Remote client system 660B comprises at least a processor 670B and hardware storage device 680B. The computing system 600 can also comprise a distributed system with one or more of the components of computing system 600 being maintained/run by different discrete systems that are remote from each other and that each performs different tasks. In some instances, a plurality of distributed systems performs similar and/or shared tasks for implementing the disclosed functionality, such as in a distributed cloud environment.
The hardware storage device(s) 630 are configured to store the different data types including multimedia file(s) 100 and index data 640. Once the location of the opening song 130 has been determined, the location may be added as index data 640 associated with the multimedia file 100. The index data 640 may be associated with the multimedia file 100 by adding index data 640 as new metadata to the multimedia file 100. Or the index data 640 may be associated with the multimedia file 100 by adding the index data 640 to an index that is stored separately from the multimedia file 100.
The storage (e.g., hardware storage device(s) 630) includes computer-executable instructions for instantiating or executing one or more of the models and/or engines shown in computing system 600. The models—for example, Feature Extraction Model 210, Scene Classification AI Model 230, and Scene Correction Model 320—are configured as AI models. In some instances, the one or more models are configured as engines or processing systems (e.g., computing systems integrated within computing system 600), wherein each engine (e.g., model) comprises one or more processors (e.g., hardware processor(s) 610) and computer-executable instructions corresponding to the computing system 600.
In some instances, the computing system 600 is provided for training and/or utilizing a machine learning model (e.g., a trained classification model) that is trained to classify different portions of multimedia content included in a media file. For example, the computing system 600 identifies a particular portion (e.g., a frame, a shot, a scene, or other predefined subset of multimedia content) in a multimedia content of a media file. The computing system 600 then identifies a feature associated with the particular portion and scores the particular portion for a probability that the particular portion corresponds to a particular classification based on a classification weight of the machine learning model that is assigned to the feature. Examples of different classifications include an opening scene, an opening song, a closing scene, a closing song, a recap of a previous episode or season of a television series, an opening credit, a closing credit, or other particular classification associated with multimedia or media content of a media file.
Based at least in part on the probability that the particular portion corresponds to the particular classification, the computing system 600 classifies the particular portion as correlating to the particular classification, or alternatively, classifies the particular portion as not correlating to the particular classification. Based on the classification for the particular portion, the computing system 600 modifies the classification weight of the machine learning model to generate a trained classification model.
Subsequently, the computing system 600 is then able to apply the trained classification model to a new media file to identify a temporal location of the particular classification in the new multimedia content included in the new media file. Computing system 600 generates index data that identifies the temporal location of the particular classification in the multimedia content based on the identified temporal location and associates the index data with the new media file.
With regard to all of the foregoing, it will be appreciated that the disclosed embodiments may include or be practiced by or implemented by a computer system, such as system 600 of FIG. 6 , which is configured with one or more hardware processors and computer storage that stores computer-executable instructions that, when executed by one or more processors, cause various functions to be performed, such as the acts recited above.
Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are physical storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: physical computer-readable storage media and transmission computer-readable media.
Physical computer-readable storage media includes random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only (EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage (such as DVDs, etc.), magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry or desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.
Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system. Thus, computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.
Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The disclosed systems and methods may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims (21)

What is claimed is:
1. A method implemented by a computing system for training a machine learning model to classify scenes in multimedia content, the method comprising:
identifying a scene in the multimedia content of a media file;
identifying a feature associated with the scene;
scoring the scene for a probability that the scene corresponds to an opening song based on a classification weight of the model that is assigned to the feature;
based at least in part on the probability that the scene corresponds to the opening song, classifying the scene as correlating to the opening song, or alternatively, classifying the scene as not correlating to the opening song; and
based on the classification for the scene, modifying a classification weight of the machine learning model.
2. The method of claim 1, further comprising:
generating index data that identifies a temporal location of the opening song in the media file; and
associating the index data with the media file.
3. The method of claim 2, wherein the associating is performed by adding the index data as metadata to the media file.
4. The method of claim 2, wherein the associating is performed by adding the index data to an index that is stored separately from the media file.
5. The method of claim 2, wherein the index data is associated with the media file in such a manner as to enable a trick play function to be performed to the media file during which the index data is referenced for skipping or fast-forwarding the opening song and corresponding scenes of the media file during the trick play function.
6. The method of claim 1, wherein the identifying the feature includes identifying text in the scene by applying an OCR model to perform character recognition on text identified in the scene and wherein different text is associated with different classification weights.
7. The method of claim 1, wherein the identifying the feature includes identifying a black frame within a predetermined proximity to the scene.
8. The method of claim 1, wherein the identifying the feature includes identifying language spoken in the scene and by applying a speech to text model for the recognition and translation of spoken language into text and wherein different text is associated with different classification weights.
9. The method of claim 1, wherein the identifying the feature includes using a music detector to detect any music that plays during the scene and to recognize a duration of time that the music plays during the scene.
10. The method of claim 1, wherein the identifying the feature includes using a silence detector to detect audio silence of a predetermined duration that occurs either within the scene or within a predetermined duration after the scene ends.
11. The method of claim 1, wherein the features that are identified are obtained from visual data and include font characteristics of text.
12. The method of claim 11, wherein the font characteristics include a font size.
13. The method of claim 1, wherein the features that are identified are obtained from visual data and include:
(i) particular words,
(ii) names of people,
(iii) terms associated with production titles or credit attribution,
(iv) OCR data;
(v) a size of text that appears on the screen,
(vi) media assets, or
(vii) a black frame within a scene or within a predetermined proximity to the scene.
14. The method of claim 1, wherein the features that are identified are obtained from audio data, including:
(i) a rate at which words are spoken during the scene,
(ii) a quantity of unique speakers in the scene,
(iii) a duration of time within the scene that music is played, or
(iv) a predetermined duration of silence after the scene ends.
15. The method of claim 1, wherein the features that are identified are obtained from both visual and audio data, including:
(i) a scene duration,
(ii) a quantity of camera shots in the scene, or
(iii) a location of the scene within the media file.
16. The method of claim 1, further comprising:
classifying the scene as either correlating, or not correlating, to the opening song based at least in part on knowledge of neighboring scenes.
17. A method implemented by a computing system for associating index data of an opening song with a media file, including:
accessing a media file containing multimedia content;
applying a trained classification model to the media file to identify a temporal location of a scene associated with an opening song in the multimedia content, the trained classification model identifying features of the scene and applying weights to the identified features of the scene to determine whether the scene meets a selected threshold of probability for being associated with the opening song;
generating index data that identifies the temporal location of the opening song in the multimedia content based on the identified temporal location of the scene associated with the opening song; and
associating the index data with the media file.
18. The method of claim 17, wherein the associating is performed by adding the index data as metadata to the media file.
19. The method of claim 17, wherein the associating is performed by adding the index data to an index that is stored separately from the media file.
20. The method of claim 17, wherein the selected threshold of probability for being associated with the opening song is one of a plurality of different thresholds of probability, wherein each threshold of the plurality of thresholds corresponds to a different type or category of media file, and wherein the selected threshold of probability that is based on a type or category corresponding to accessed media file containing the multimedia content.
21. A computing system for training and utilizing a machine learning model to classify a portion of multimedia content, the computing system comprising:
a processor; and
a hardware storage device storing computer-executable instructions that are executable by the processor for causing the computing system to:
identify a particular portion in the multimedia content of a media file;
identify a feature associated with the particular portion;
score the particular portion for a probability that the particular portion corresponds to a particular classification based on a classification weight of the machine learning model that is assigned to the feature;
based at least in part on the probability that the particular portion corresponds to the particular classification, classify the particular portion as correlating to the particular classification, or alternatively, classify the particular portion as not correlating to the particular classification;
based on the classification for the particular portion, modify the classification weight of the machine learning model to generate a trained classification model;
apply the trained classification model to a new media file to identify a temporal location of the particular classification in the new multimedia content included in the new media file;
generate index data that identifies the temporal location of the particular classification in the multimedia content based on the identified temporal location; and
associate the index data with the new media file.
US18/090,843 2022-12-29 2022-12-29 Combining visual and audio insights to detect opening scenes in multimedia files Active US12266175B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/090,843 US12266175B2 (en) 2022-12-29 2022-12-29 Combining visual and audio insights to detect opening scenes in multimedia files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US18/090,843 US12266175B2 (en) 2022-12-29 2022-12-29 Combining visual and audio insights to detect opening scenes in multimedia files

Publications (2)

Publication Number Publication Date
US20240221379A1 US20240221379A1 (en) 2024-07-04
US12266175B2 true US12266175B2 (en) 2025-04-01

Family

ID=91665873

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/090,843 Active US12266175B2 (en) 2022-12-29 2022-12-29 Combining visual and audio insights to detect opening scenes in multimedia files

Country Status (1)

Country Link
US (1) US12266175B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20250217061A1 (en) * 2024-01-02 2025-07-03 Rivian Ip Holdings, Llc Electric vehicle data based storage control

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233642A1 (en) * 2011-03-11 2012-09-13 At&T Intellectual Property I, L.P. Musical Content Associated with Video Content
US8386506B2 (en) * 2008-08-21 2013-02-26 Yahoo! Inc. System and method for context enhanced messaging
US20150279344A1 (en) * 2014-03-28 2015-10-01 Than Van Nguyen Bird Whistle
US20170025152A1 (en) * 2014-03-17 2017-01-26 Manuel Jaime Media clip creation and distribution systems, apparatus, and methods
US20170193094A1 (en) * 2015-12-31 2017-07-06 Le Holdings (Beijing) Co., Ltd. Method and electronic device for obtaining and sorting associated information
US20180152767A1 (en) * 2016-11-30 2018-05-31 Alibaba Group Holding Limited Providing related objects during playback of video data
US20210295148A1 (en) * 2020-03-20 2021-09-23 Avid Technology, Inc. Adaptive Deep Learning For Efficient Media Content Creation And Manipulation
US20230108579A1 (en) * 2021-10-05 2023-04-06 Deepmind Technologies Limited Dynamic entity representations for sequence generation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386506B2 (en) * 2008-08-21 2013-02-26 Yahoo! Inc. System and method for context enhanced messaging
US20120233642A1 (en) * 2011-03-11 2012-09-13 At&T Intellectual Property I, L.P. Musical Content Associated with Video Content
US20170025152A1 (en) * 2014-03-17 2017-01-26 Manuel Jaime Media clip creation and distribution systems, apparatus, and methods
US20150279344A1 (en) * 2014-03-28 2015-10-01 Than Van Nguyen Bird Whistle
US20170193094A1 (en) * 2015-12-31 2017-07-06 Le Holdings (Beijing) Co., Ltd. Method and electronic device for obtaining and sorting associated information
US20180152767A1 (en) * 2016-11-30 2018-05-31 Alibaba Group Holding Limited Providing related objects during playback of video data
US20210295148A1 (en) * 2020-03-20 2021-09-23 Avid Technology, Inc. Adaptive Deep Learning For Efficient Media Content Creation And Manipulation
US20230108579A1 (en) * 2021-10-05 2023-04-06 Deepmind Technologies Limited Dynamic entity representations for sequence generation

Also Published As

Publication number Publication date
US20240221379A1 (en) 2024-07-04

Similar Documents

Publication Publication Date Title
Sundaram et al. A utility framework for the automatic generation of audio-visual skims
US10528821B2 (en) Video segmentation techniques
US10108709B1 (en) Systems and methods for queryable graph representations of videos
KR100828166B1 (en) Metadata extraction method using voice recognition and subtitle recognition of video, video search method using metadata, and recording media recording the same
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
US20020157116A1 (en) Context and content based information processing for multimedia segmentation and indexing
CN111797272A (en) Video content segmentation and search
US20250156642A1 (en) Semantic text segmentation based on topic recognition
US20240370661A1 (en) Generating summary prompts with visual and audio insights and using summary prompts to obtain multimedia content summaries
EP4550274A1 (en) Processing and contextual understanding of video segments
Nandzik et al. CONTENTUS—technologies for next generation multimedia libraries: Automatic multimedia processing for semantic search
Bost A storytelling machine?: automatic video summarization: the case of TV series
US12266175B2 (en) Combining visual and audio insights to detect opening scenes in multimedia files
Dumont et al. Automatic story segmentation for tv news video using multiple modalities
US11386163B2 (en) Data search method and data search system thereof for generating and comparing strings
CN113537215A (en) Method and device for labeling video label
Bretti et al. Find the cliffhanger: Multi-modal trailerness in soap operas
US20250139942A1 (en) Contextual understanding of media content to generate targeted media content
US20250142183A1 (en) Scene break detection
AU2024238949A1 (en) Systems and methods for automatically identifying digital video clips that respond to abstract search queries
Valdes et al. On-line video abstract generation of multimedia news
KR20200056724A (en) Method for analysis interval of media contents and service device supporting the same
Stein et al. From raw data to semantically enriched hyperlinking: Recent advances in the LinkedTV analysis workflow
US12439108B2 (en) Video clip learning model
Bost A storytelling machine?

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOFFMAN, YONIT;KADOSH, MORDECHAI;FIGOV, ZVI;AND OTHERS;REEL/FRAME:062236/0001

Effective date: 20221229

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNOR'S INTEREST;ASSIGNORS:HOFFMAN, YONIT;KADOSH, MORDECHAI;FIGOV, ZVI;AND OTHERS;REEL/FRAME:062236/0001

Effective date: 20221229

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE