US12266175B2 - Combining visual and audio insights to detect opening scenes in multimedia files - Google Patents
Combining visual and audio insights to detect opening scenes in multimedia files Download PDFInfo
- Publication number
- US12266175B2 US12266175B2 US18/090,843 US202218090843A US12266175B2 US 12266175 B2 US12266175 B2 US 12266175B2 US 202218090843 A US202218090843 A US 202218090843A US 12266175 B2 US12266175 B2 US 12266175B2
- Authority
- US
- United States
- Prior art keywords
- scene
- media file
- classification
- index data
- opening song
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/61—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
- G06V30/242—Division of the character sequences into groups prior to recognition; Selection of dictionaries
- G06V30/244—Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
- G06V30/245—Font recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/102—Programmed access in sequence to addressed parts of tracks of operating record carriers
Definitions
- Multimedia files for example, series episodes and movies—often include an introduction with an opening song (hereinafter “opening song”).
- opening song The characteristics of an opening song can vary, such as with regard to length, audio/visual (AV) content, and temporal location of the opening song within the video.
- AV audio/visual
- some opening songs will play concomitantly with the opening credits, while others do not.
- some opening songs will play at the very beginning of the episode or movie, while others may play after one or two scenes of the episode or movie have already transpired.
- the ability to detect the placement of an opening song in a media file can be important for facilitating playback functionality, as well as for post-production editing of the content.
- the ability to automatically detect an opening song can facilitate a ‘skip intro’ capability—where a viewer can jump right to the main multimedia content and pass over the opening song.
- precision is required—one must be able to detect the exact beginning and end of the opening song. Otherwise, a portion of the main content may be incidentally skipped, rather than just the opening song.
- the detection of an opening song can also facilitate selective processing of the media content (e.g., editing of the introduction in different languages).
- Disclosed embodiments are directed to systems and methods for classifying portions of multimedia content included in a media file.
- systems and methods are provided for facilitating the automatic detection of opening scene(s) (e.g., a predefined introduction or opening song) in multimedia content of a media file.
- the disclosed embodiments may include or be practiced on computing systems configured with modules for implementing the disclosed methods.
- the disclosed methods include acts for designating sequential blocks of time in the multimedia content as scene(s), then detecting certain feature(s) of those scene(s).
- the extracted scene feature(s) may be analyzed by machine learning model(s), or other type of artificial intelligence (AI) model(s), to classify those scenes as either part of, or not part of, the introduction/opening song, based on a probability derived from the scene feature(s).
- the machine learning model(s) may be trained so as to give higher or lower weight to certain scene feature(s), based on the past success of those feature(s) to accurately predict whether a scene is part of the introduction/opening song.
- FIG. 1 illustrates an example of how multimedia content of a multimedia file can be broken up into sequential blocks of time designated as scenes, in order to detect the location of a predefined introduction/opening song.
- FIG. 2 A illustrates an example flow chart for how scene(s) in a multimedia file can be analyzed for certain feature(s), and those feature(s) used to classify the scene(s) as either being part of, or not being part of, a predefined introduction/opening song.
- FIG. 2 B illustrates an example flow chart for how different types of data from multimedia content included in a multimedia file—for example, visual and/or audio data—can be analyzed for certain scene feature(s).
- FIG. 3 illustrates an example flow chart for how scene feature(s) can be analyzed using a scene classification model to determine a probability that each scene is part of a predefined introduction/opening song, and to classify each scene as either part of, or not part of, the introduction/opening song.
- FIG. 4 illustrates an example graph showing how a probability assigned to each scene in a multimedia file may be used to determine which scene, or series of scenes, constitutes a predefined introduction/opening song.
- FIG. 5 illustrates an example flow chart for automatically detecting a predefined introduction/opening song in a multimedia file.
- FIG. 6 illustrates an example computing environment in which a computing system incorporates and/or is utilized to perform disclosed aspects of the disclosed embodiments.
- Some of the disclosed embodiments are directed toward systems and methods for detecting a particular portion of multimedia files based on features extracted from the multimedia content of a multimedia file, as well as for tagging or otherwise indexing the detected portions of the multimedia files.
- a multimedia file, or media file comprises multimedia content with associated metadata about the multimedia content. Additionally, the multimedia content can be formatted in various different file formats. In some instances, the multimedia file comprises the raw multimedia content or a compressed or compiled version of the multimedia content. Multimedia content refers to electronic content which comprises multiple different media types, for example, audio content and visual content. Features can be extracted from the multimedia content wherein certain features correspond to each type of media represented in the multimedia content. It should be appreciated that the systems and methods disclosed, while described in application to multimedia content, may also be applied to media files comprising a single type of media.
- Some of the disclosed embodiments are specifically directed to improved systems and methods for automatically detecting opening scene(s) (e.g., an opening song) that are included in the multimedia file. This can be beneficial, particularly when considering conventional systems, for enabling playback and post-production editing of the media files without requiring manual review and editing of each file being processed. For at least this reason, the disclosed embodiments may be implemented to provide many technical advantages over existing media processing systems, as will now be described in more detail.
- a frame refers to any temporal unit associated with the multimedia file.
- a frame is selected based on structural and semantic properties associated with the temporal unit.
- a frame refers to a temporal unit comprising a still image associated with the multimedia file. In such instances, a plurality of frames is combined to form a moving picture.
- a frame refers to a temporal unit comprising a limited portion of an audio file. In such instances, a plurality of frames is combined to form continuous audio.
- a multimedia file comprising only a few minutes of multimedia content can contain thousands of frames. This results in a high computational cost, either manually, or using a computing system, to process the frames, identify which frames correspond to the opening song, and then tag the frames that have been identified.
- Disclosed embodiments are directed to improved systems and methods for detection of an opening song to overcome the disadvantages of current detection solutions.
- the system of the present disclosure differs from prior systems in that it allows for the automatic detection of the opening song and automatic tagging of the multimedia file, without the need for manual/human detection or tagging.
- This automation significantly reduces the time and cost it takes to process and edit a multimedia file with segment tagging. It can also improve the consistency in which segment boundaries are identified, at least as compared to subjective/arbitrary tagging that is sometimes caused by human error and variations in human perception.
- the disclosed embodiments are able to achieve these aforementioned benefits of automatic detection by segmenting the multimedia file into scenes, as opposed to frames, as the building blocks for analyzing the multimedia file.
- Each segment of the multimedia file comprises a particular portion of multimedia content included in the multimedia file.
- the technical advantage of this is that there are far fewer scenes than frames in a multimedia file. This significantly reduces the computational expense of analyzing a limited number of scenes, instead of thousands of frames.
- a scene refers to a particular portion of the multimedia content which is characterized by having continuous and distinct features from an adjacent portion of the multimedia content.
- a scene is a multi-modality object which is extracted from electronic content. Scenes can be extracted based on visual and/or audio features. For example, a scene is typically associated with a particular set or environment in which the characters of the story are interacting. When one or more characters begin interacting in a different location of the story (e.g., set or geolocation), typically, a new scene has begun. In some instances, a scene involves the same set of characters or at least one or more same characters for some continuous length of time in the same environment. Because of the ability to detect a scene, features associated with a particular scene can be extracted and analyzed to determine which scenes are associated with the opening song.
- Additional technical benefits include improved training of machine learning models used to automatically detect the opening song, resulting in improved machine learning models which are more accurate, consistent, and transparent. Because multiple different features are extractable from the different scenes of the multimedia file, the machine learning model can be trained on different sets of features which help it to detect opening songs of new multimedia files. Each scene corresponds to a particular subset of all features that are or can be extracted from the multimedia content included in multimedia file. Features can also be extracted from metadata included in the multimedia file which corresponds to the multimedia content. These features contribute to the model, both in training and during run-time.
- each feature can be assigned, either manually or by a machine learning model, a particular weight that predicts how much that feature will contribute to the prediction that the scene corresponds to the opening song.
- Some features may be more indicative or distinctive of an opening song than other features. For example, some features, like a series of written names appearing on the scene, may correlate to the opening song more than other features, like a background melody playing, which may appear more frequently throughout the entire multimedia content instead of exclusively in the opening scene.
- the results of the machine learning model are more transparent to users. For example, a user is able to understand why the machine learning model returned a particular result (e.g., why a particular scene or set of scenes was detected as the opening song portion) because the user is able to see which features were identified and how important each feature was (e.g., the weight applied to each feature) in predicting whether the scene(s) corresponded to the opening song.
- a particular result e.g., why a particular scene or set of scenes was detected as the opening song portion
- a user can tune how many features are to be extracted and processed. For example, if a user wants to improve the accuracy of the results, the user or the machine learning model can select a higher or total number of features available.
- the user or the machine learning model can select a lower or limited number of features available (e.g., the categories of features that have the highest weights).
- the machine learning model is able to be trained on that new feature in isolation, or in combination with existing features, to update the model to be able to extract the new feature. The model can then use this new feature to augment and improve the detection of the opening song.
- the machine learning model is also configured to learn which feature, or combination of features, results in more accurate detection of the opening song for a particular type of media file. Accordingly, the machine learning model can add or omit certain features dynamically upon determining a particular type of media file.
- FIG. 1 illustrates, in one embodiment, a multimedia file 100 which is separated into sequential scene(s) 120 (e.g., scene 1, scene 2, scene 3, scene 4, scene 5, scene 6, scene 7, scene 8, scene 9, scene 10, scene 11, scene 12, scene 13, scene 14, scene 15, scene 16, scene 17, and so forth). This may be done manually, or by an AI model, such as a machine learning model or other type of AI model.
- an AI model may recognize that the people/characters in a series of sequential shots in the multimedia file 100 do not change, or that the background scenery does not change.
- a shot is an inner unit of the scene, wherein a plurality of shots is identified in a media file. For example, a shot is a contiguous sequence of frames with the same or similar camera angle.
- a sequential subset of the plurality of shots is then aggregated into a scene. In some instances, multiple different subsets, where a subset comprises a certain number of shots, are aggregated into different scenes.
- the model may classify a group of shots, frames, or blocks of sequential time 110 in the multimedia file, as individual scene(s) 120 . As shown in FIG.
- the file may be broken up into blocks of time 110 that are designated as sequential scene(s) 120 .
- Scene(s) 120 may be determined by analyzing both audio and visual data from multimedia content of the multimedia file 100 .
- the disclosed systems also analyze each scene and extract one or more features for each scene and predict how likely each of the scenes corresponds to the opening song, wherein the system is able to detect which scene(s) correspond to the opening song.
- scenes 1-8 constitute the opening song 130 .
- all of the scenes in the file are analyzed.
- the system refrains from analyzing the rest of the scenes, which reduces the computational expense of processing the file.
- each scene 120 may be characterized by certain feature(s) 220 .
- the multimedia file 100 is processed using a Feature Extraction Model 210 to determine whether a particular scene possesses certain feature(s) 220 .
- the particular scene is classified by a Scene Classification Model 230 to determine the probability 410 (shown in FIG. 4 ) that the scene is associated with the opening song 130 . This process is repeated for some or all of the scenes in the multimedia file.
- Feature Extraction Model(s) 210 different types of data from multimedia content of multimedia file 100 or from metadata included in multimedia file 100 are analyzed by the Feature Extraction Model(s) 210 to determine the feature(s) 220 associated with each scene 120 .
- visual data 250 from multimedia content multimedia file 100 may be analyzed by Feature Extraction Model(s) 210 , such as an OCR Model or a Black Frame Detector model, to determine extracted scene feature(s) 220 .
- An OCR model may be applied to text identified in the scene, in order to perform character recognition on the text.
- the Black Frame Detector model may be used to detect a black or blank frame within a predetermined proximity to the scene (e.g., before or after a particular scene).
- the extracted scene feature(s) 220 taken from visual data 250 may include, for example: the number of known words (detecting words commonly used in opening credits, such as “introducing,” “producer,” or “produced by,” etc.); people names (because opening credits usually list the names of the actors, producers, etc.); and/or the existence of a black frame within a predetermined proximity of that scene.
- the extracted scene feature(s) 220 taken from visual data 250 may also include font characteristics of text that appears on the screen, including font size.
- the extracted scene feature(s) 220 taken from visual data 250 may also include known media assets-such as, for example, known TV show or movie names.
- audio data 260 may also be analyzed by Feature Extraction Model(s) 210 , such as a speech-to-text model, a music detector model, and/or a silence detector model.
- the Feature Extraction Model(s) 210 may identify a language spoken in the scene and employ a speech-to-text function for the recognition and translation of a spoken language into text.
- the Feature Extraction Model(s) 210 may detect any music that plays during a scene and recognize the duration of time that the music plays during the scene.
- the feature extraction may also detect a particular volume associated with the music, or a volume change from one scene to the next, or volume contrast from music relative to spoken words.
- the Feature Extraction Model(s) 210 may detect silence of a predetermined duration that occurs either within a scene or within a predetermined time after a scene ends.
- the extracted scene feature(s) 220 taken from audio data 260 may include, for example: the rate of words spoken during a scene, the number of speakers participating in the scene, the amount of time that music plays in the scene, and/or the presence of silence that occurs near the end of a scene or just after the scene.
- the Feature Extraction Model(s) 210 used to analyze audio data 260 may involve diarization model(s).
- the audio and visual data together 270 may be analyzed by Feature Extraction Model(s) 210 , such as a scenes creation model, to determine extracted scene feature(s) 220 —including, for example, the duration of the scene, the number of shots comprising the scene, and the scene location within multimedia file 100 .
- Feature Extraction Model(s) 210 such as a scenes creation model, to determine extracted scene feature(s) 220 —including, for example, the duration of the scene, the number of shots comprising the scene, and the scene location within multimedia file 100 .
- An opening song is typically at or near the beginning of a multimedia file.
- the results of the features analyses may be aggregated and used to classify each scene 120 with a probability that the scene is a part of the opening song 130 (as shown in FIG. 3 ).
- the extracted scene feature(s) 220 data may be analyzed using Scene Classification Model 230 to determine classified scene probability data 310 , wherein each scene 120 has been assigned a probability 410 (shown in FIG. 4 ) that the scene 120 is part of the opening song 130 .
- the opening song 130 may be determined using, for example, algorithmic heuristics which locate a sequence of scene(s) 120 that have been assigned sufficiently high probabilities 410 of being part of the opening song 130 .
- Scene Classification Model 230 may be trained using a Scene Correction Model 320 , which determines the success of each scene feature 220 in predicting the probability that scene 120 is part of the opening song 130 . The result would be that in subsequent applications of Scene Classification Model 230 , certain feature(s) 220 may be given more classification weight and other feature(s) 220 may be given less classification weight when determining the probability that a scene from scene(s) 120 is part of the opening song 130 , based on the past success (or failure) of those feature(s) 220 to correctly predict that the scene was part of the opening song 130 . For example, different text (either from the audio or visual data) associated with a scene may be given different classification weights by the Scene Classification Model 230 .
- classified scene probability data 310 provides the probabilities 410 that each scene 120 is a part of the opening song 130 and can be analyzed to find the sequence of scene(s) 120 with sufficiently high probability 410 so as to constitute the opening song 130 .
- knowledge of adjacent scenes 120 may be combined to detect the full opening song 130 .
- scenes 2-4 and 6-8 have been assigned a fairly high probability 410 of being part of the opening song, while scene 5 has been assigned a lower probability 410 of being part of the opening song.
- scene 5 will be designated as part of the opening song, along with scenes 2-4 and 6-8, because of the high probability 410 of its neighboring scenes.
- the machine learning model is tunable in determining how many more scenes should be analyzed after a scene is predicted to have a low probability score before refraining from analyzing the rest of the file.
- all of the scenes are analyzed.
- the scenes which correspond to the opening song can then be tagged. These tags can be used in post-editing to insert “intro skip” functions for a user to skip over the opening song while streaming the media file.
- the system analyzes a relative relationship between probability scores. For example, if a low probability score is followed by a high probability score with a score difference that meets or exceeds a pre-determined threshold, the scene with the high probability score is likely the beginning of the opening song. If a high probability score is followed by a low probability score with a score difference that meets or exceeds a pre-determined threshold, the system may predict that the scene with the high probability score is the end of the opening song.
- the system may determine that there is no opening song included in the media file, for example, if there is not a big enough difference between the probability scores of different scenes, or if the probability scores of the scenes do not meet or exceed a pre-determined threshold value.
- the threshold value can be pre-determined by a user or learned and set automatically by the computing system. In some instances, a different threshold value is chosen for different types or categories of media files or can be dynamically updated based on identifying the type of media file or based on certain features which have been extracted for one or more scenes.
- FIG. 5 illustrates a flow diagram that includes various acts (act 510 , act 520 , act 530 , act 540 , act 550 , and act 560 ) associated with example methods that can be implemented by computing system 600 for performing opening song detection using a machine learning model.
- the disclosed methods for automatically detecting an opening song 130 in a multimedia file 100 may include, for example, the initial act of accessing a multimedia file 100 , which contains multimedia content (act 510 ).
- the multimedia file 100 may then be analyzed (as described above with regards to FIG. 1 ) to identify a scene (e.g., scene(s) 120 ) in the multimedia file 100 (act 520 ).
- feature(s) 220 of the multimedia content included in the multimedia file may be determined relative to their association with each scene 120 (act 530 ), as described above in regard to FIGS. 2 A and 2 B .
- Each scene 120 may then be scored with a probability 410 that the scene corresponds to a predefined opening song 130 (act 540 ), as described above regarding FIGS. 3 - 4 . Based on the probability 410 , the scene will then be classified as correlating, or not correlating, to the opening song 130 (act 550 ).
- the classifying weight of at least one feature 220 may be modified by Scene Correction Model 320 when determining a probability 310 that the new scene 120 b is part of the opening song 130 , based on the success of the machine learning model in accurately predicting whether the first scene 120 was part of the opening song 130 (act 560 ).
- the temporal location of the opening song 130 can be stored as index data that is associated with the multimedia file 100 .
- the index data may be associated with the multimedia file 100 by adding the index data as new metadata to the multimedia file 100 , or by adding the index data to an index that is stored separately from the multimedia file 100 .
- the index data may be associated with the multimedia file 100 in such a way as to enable a trick play function to be performed to the multimedia file 100 , during which the index data is referenced for skipping or fast-forwarding the opening scene(s) of the multimedia file 100 file during the trick play function.
- the method further comprises generating index data that identifies a temporal location of the opening song in the media file and associating the index data with the media file.
- the system can associate the index data with the media file according to several different techniques. For example, the system performs the association by adding the index data as metadata to the media file or by adding the index data to an index that is stored separately from the media file. Additionally, or alternatively, the index data is associated with the media file in such a manner as to enable a trick play function to be performed to the media file during which the index data is referenced for skipping or fast-forwarding the opening song and corresponding scenes of the media file during the trick play function.
- identifying the feature includes identifying text in the scene by applying an OCR model to perform character recognition on text identified in the scene and wherein different text is associated with different classification weights.
- identifying the feature includes identifying a black frame within a predetermined proximity to the scene. Additionally, or alternatively, the system identifies one or more features by identifying a language spoken in the scene and by applying a speech to text model for the recognition and translation of spoken language into text and wherein different text is associated with different classification weights.
- the feature may be identified using a music detector to detect any music that plays during the scene and to recognize a duration of time that the music plays during the scene or using a silence detector to detect audio silence of a predetermined duration that occurs either within the scene or within a predetermined duration after the scene ends.
- Some example features that are identified are obtained from visual data and include font characteristics of text (e.g., font size). Some features that are identified are obtained from visual data and include: particular words, names of people, terms associated with production titles or credit attribution, OCR data; a size of text that appears on the screen, media assets, or a black frame within a scene or within a predetermined proximity to the scene. Some example features that are identified are obtained from audio data, including: a rate in which words spoken during the scene, a quantity of unique speakers in the scene, a duration of time within the scene that music is played, or a predetermined duration of silence after the scene ends.
- Some example features that are identified are obtained from both visual and audio data, including: a scene duration, a quantity of camera shots in the scene, or a location of the scene within the media file.
- the system is able to classify the scene as either correlating, or not correlating, to the opening song.
- the system is able to classify the scene based at least in part on knowledge of neighboring scenes, in addition to features corresponding to the scene being classified. Additionally, the system is able to learn which features from a scene are associated with a higher probability that the scene correlates to an opening song and which features from a scene are associated with a lower probability that the scene correlates to the opening song.
- the system accesses a media file containing multimedia content and applies the trained machine learning model (i.e., classification model) to the media file to identify a temporal location of an opening song identified in the multimedia content of the media file. This is done by generating index data that identifies the temporal location of the opening song in the multimedia content based on the identified temporal location; and associating the index data with the media file.
- the trained machine learning model i.e., classification model
- FIG. 6 illustrates components of a computing system 600 which may include and/or be used to implement aspects of the disclosed methods.
- the computing system 600 comprises a plurality of AI models—for example, a Feature Extraction Model 210 , a Scene Classification Model 230 , and a Scene Correction Model 320 .
- Computing system 600 is able to utilize different AI models, and/or different types of AI models.
- Scene Classification Model 230 is configured as a machine learning model, such as a classification model.
- a machine learning model is a particular type of AI model which is configured to recognize and learn from patterns identified in datasets and utilize that learning to improve itself for a particular task for which the machine learning model is trained.
- a classification model is a type of machine learning model trained to perform one or more different classification tasks. Classification tasks are a type of predictive modeling problem in which the machine learning model is trained to predict a class label for a particular set or subset (e.g., a scene) of input data (e.g., multimedia content of a multimedia file).
- FIG. 6 illustrates how the computing system 600 is one part of a distributed computing environment that also includes remote (e.g., third party) system(s) 660 in communication (via a network 650 ) with the computing system 600 .
- the computing system 600 is configured to train a plurality of AI models for automatically detecting scene feature(s) and probabilities that scene(s) are part of a predefined opening song.
- the computing system 600 is also configured to generate training data configured for training the AI models.
- the computing system 600 includes a processing system including one or more processor(s) 610 (such as one or more hardware processor(s)) and a storage (e.g., hardware storage device(s) 630 ) storing computer-executable instructions wherein one or more of the hardware storage device(s) 630 is able to house any number of data types and any number of computer-executable instructions by which the computing system 600 is configured to implement one or more aspects of the disclosed embodiments when the computer-executable instructions are executed by the one or more processor(s) 610 .
- the computing system 600 is also shown including input/output (I/O) device(s) 620 .
- I/O input/output
- hardware storage device(s) 630 is shown as a single storage unit. However, it will be appreciated that the hardware storage device(s) 630 is configurable as a distributed storage that is distributed to several separate and sometimes remote and/or third-party system(s) 660 , such as remote client system 660 A and remote client system 660 B.
- Remote client system 660 A comprises at least a processor 670 A and hardware storage device 680 A.
- Remote client system 660 B comprises at least a processor 670 B and hardware storage device 680 B.
- the computing system 600 can also comprise a distributed system with one or more of the components of computing system 600 being maintained/run by different discrete systems that are remote from each other and that each performs different tasks. In some instances, a plurality of distributed systems performs similar and/or shared tasks for implementing the disclosed functionality, such as in a distributed cloud environment.
- the hardware storage device(s) 630 are configured to store the different data types including multimedia file(s) 100 and index data 640 . Once the location of the opening song 130 has been determined, the location may be added as index data 640 associated with the multimedia file 100 .
- the index data 640 may be associated with the multimedia file 100 by adding index data 640 as new metadata to the multimedia file 100 . Or the index data 640 may be associated with the multimedia file 100 by adding the index data 640 to an index that is stored separately from the multimedia file 100 .
- the storage (e.g., hardware storage device(s) 630 ) includes computer-executable instructions for instantiating or executing one or more of the models and/or engines shown in computing system 600 .
- the models for example, Feature Extraction Model 210 , Scene Classification AI Model 230 , and Scene Correction Model 320 —are configured as AI models.
- the one or more models are configured as engines or processing systems (e.g., computing systems integrated within computing system 600 ), wherein each engine (e.g., model) comprises one or more processors (e.g., hardware processor(s) 610 ) and computer-executable instructions corresponding to the computing system 600 .
- the computing system 600 is provided for training and/or utilizing a machine learning model (e.g., a trained classification model) that is trained to classify different portions of multimedia content included in a media file. For example, the computing system 600 identifies a particular portion (e.g., a frame, a shot, a scene, or other predefined subset of multimedia content) in a multimedia content of a media file. The computing system 600 then identifies a feature associated with the particular portion and scores the particular portion for a probability that the particular portion corresponds to a particular classification based on a classification weight of the machine learning model that is assigned to the feature.
- a machine learning model e.g., a trained classification model
- classifications include an opening scene, an opening song, a closing scene, a closing song, a recap of a previous episode or season of a television series, an opening credit, a closing credit, or other particular classification associated with multimedia or media content of a media file.
- the computing system 600 classifies the particular portion as correlating to the particular classification, or alternatively, classifies the particular portion as not correlating to the particular classification. Based on the classification for the particular portion, the computing system 600 modifies the classification weight of the machine learning model to generate a trained classification model.
- the computing system 600 is then able to apply the trained classification model to a new media file to identify a temporal location of the particular classification in the new multimedia content included in the new media file.
- Computing system 600 generates index data that identifies the temporal location of the particular classification in the multimedia content based on the identified temporal location and associates the index data with the new media file.
- system 600 of FIG. 6 which is configured with one or more hardware processors and computer storage that stores computer-executable instructions that, when executed by one or more processors, cause various functions to be performed, such as the acts recited above.
- Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
- Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
- Computer-readable media that store computer-executable instructions are physical storage media.
- Computer-readable media that carry computer-executable instructions are transmission media.
- embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: physical computer-readable storage media and transmission computer-readable media.
- Physical computer-readable storage media includes random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only (EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage (such as DVDs, etc.), magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
- RAM random access memory
- ROM read-only memory
- EEPROM electrically erasable programmable read-only
- CD-ROM compact disk read-only memory
- magnetic disk storage or other magnetic storage devices or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
- a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
- a network or another communications connection can include a network and/or data links which can be used to carry or desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.
- program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa).
- program code means in the form of computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system.
- NIC network interface module
- computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.
- Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- the computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
- the functionality described herein can be performed, at least in part, by one or more hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Description
Claims (21)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/090,843 US12266175B2 (en) | 2022-12-29 | 2022-12-29 | Combining visual and audio insights to detect opening scenes in multimedia files |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/090,843 US12266175B2 (en) | 2022-12-29 | 2022-12-29 | Combining visual and audio insights to detect opening scenes in multimedia files |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20240221379A1 US20240221379A1 (en) | 2024-07-04 |
| US12266175B2 true US12266175B2 (en) | 2025-04-01 |
Family
ID=91665873
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/090,843 Active US12266175B2 (en) | 2022-12-29 | 2022-12-29 | Combining visual and audio insights to detect opening scenes in multimedia files |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US12266175B2 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20250217061A1 (en) * | 2024-01-02 | 2025-07-03 | Rivian Ip Holdings, Llc | Electric vehicle data based storage control |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120233642A1 (en) * | 2011-03-11 | 2012-09-13 | At&T Intellectual Property I, L.P. | Musical Content Associated with Video Content |
| US8386506B2 (en) * | 2008-08-21 | 2013-02-26 | Yahoo! Inc. | System and method for context enhanced messaging |
| US20150279344A1 (en) * | 2014-03-28 | 2015-10-01 | Than Van Nguyen | Bird Whistle |
| US20170025152A1 (en) * | 2014-03-17 | 2017-01-26 | Manuel Jaime | Media clip creation and distribution systems, apparatus, and methods |
| US20170193094A1 (en) * | 2015-12-31 | 2017-07-06 | Le Holdings (Beijing) Co., Ltd. | Method and electronic device for obtaining and sorting associated information |
| US20180152767A1 (en) * | 2016-11-30 | 2018-05-31 | Alibaba Group Holding Limited | Providing related objects during playback of video data |
| US20210295148A1 (en) * | 2020-03-20 | 2021-09-23 | Avid Technology, Inc. | Adaptive Deep Learning For Efficient Media Content Creation And Manipulation |
| US20230108579A1 (en) * | 2021-10-05 | 2023-04-06 | Deepmind Technologies Limited | Dynamic entity representations for sequence generation |
-
2022
- 2022-12-29 US US18/090,843 patent/US12266175B2/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8386506B2 (en) * | 2008-08-21 | 2013-02-26 | Yahoo! Inc. | System and method for context enhanced messaging |
| US20120233642A1 (en) * | 2011-03-11 | 2012-09-13 | At&T Intellectual Property I, L.P. | Musical Content Associated with Video Content |
| US20170025152A1 (en) * | 2014-03-17 | 2017-01-26 | Manuel Jaime | Media clip creation and distribution systems, apparatus, and methods |
| US20150279344A1 (en) * | 2014-03-28 | 2015-10-01 | Than Van Nguyen | Bird Whistle |
| US20170193094A1 (en) * | 2015-12-31 | 2017-07-06 | Le Holdings (Beijing) Co., Ltd. | Method and electronic device for obtaining and sorting associated information |
| US20180152767A1 (en) * | 2016-11-30 | 2018-05-31 | Alibaba Group Holding Limited | Providing related objects during playback of video data |
| US20210295148A1 (en) * | 2020-03-20 | 2021-09-23 | Avid Technology, Inc. | Adaptive Deep Learning For Efficient Media Content Creation And Manipulation |
| US20230108579A1 (en) * | 2021-10-05 | 2023-04-06 | Deepmind Technologies Limited | Dynamic entity representations for sequence generation |
Also Published As
| Publication number | Publication date |
|---|---|
| US20240221379A1 (en) | 2024-07-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Sundaram et al. | A utility framework for the automatic generation of audio-visual skims | |
| US10528821B2 (en) | Video segmentation techniques | |
| US10108709B1 (en) | Systems and methods for queryable graph representations of videos | |
| KR100828166B1 (en) | Metadata extraction method using voice recognition and subtitle recognition of video, video search method using metadata, and recording media recording the same | |
| WO2023011094A1 (en) | Video editing method and apparatus, electronic device, and storage medium | |
| US20020157116A1 (en) | Context and content based information processing for multimedia segmentation and indexing | |
| CN111797272A (en) | Video content segmentation and search | |
| US20250156642A1 (en) | Semantic text segmentation based on topic recognition | |
| US20240370661A1 (en) | Generating summary prompts with visual and audio insights and using summary prompts to obtain multimedia content summaries | |
| EP4550274A1 (en) | Processing and contextual understanding of video segments | |
| Nandzik et al. | CONTENTUS—technologies for next generation multimedia libraries: Automatic multimedia processing for semantic search | |
| Bost | A storytelling machine?: automatic video summarization: the case of TV series | |
| US12266175B2 (en) | Combining visual and audio insights to detect opening scenes in multimedia files | |
| Dumont et al. | Automatic story segmentation for tv news video using multiple modalities | |
| US11386163B2 (en) | Data search method and data search system thereof for generating and comparing strings | |
| CN113537215A (en) | Method and device for labeling video label | |
| Bretti et al. | Find the cliffhanger: Multi-modal trailerness in soap operas | |
| US20250139942A1 (en) | Contextual understanding of media content to generate targeted media content | |
| US20250142183A1 (en) | Scene break detection | |
| AU2024238949A1 (en) | Systems and methods for automatically identifying digital video clips that respond to abstract search queries | |
| Valdes et al. | On-line video abstract generation of multimedia news | |
| KR20200056724A (en) | Method for analysis interval of media contents and service device supporting the same | |
| Stein et al. | From raw data to semantically enriched hyperlinking: Recent advances in the LinkedTV analysis workflow | |
| US12439108B2 (en) | Video clip learning model | |
| Bost | A storytelling machine? |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOFFMAN, YONIT;KADOSH, MORDECHAI;FIGOV, ZVI;AND OTHERS;REEL/FRAME:062236/0001 Effective date: 20221229 Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNOR'S INTEREST;ASSIGNORS:HOFFMAN, YONIT;KADOSH, MORDECHAI;FIGOV, ZVI;AND OTHERS;REEL/FRAME:062236/0001 Effective date: 20221229 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |