CN116170651A - Method, system and storage medium for generating highlight moment video from video and text input - Google Patents

Method, system and storage medium for generating highlight moment video from video and text input Download PDF

Info

Publication number
CN116170651A
CN116170651A CN202210979659.0A CN202210979659A CN116170651A CN 116170651 A CN116170651 A CN 116170651A CN 202210979659 A CN202210979659 A CN 202210979659A CN 116170651 A CN116170651 A CN 116170651A
Authority
CN
China
Prior art keywords
time
video
event
text
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210979659.0A
Other languages
Chinese (zh)
Inventor
周昕
亢乐
程治宇
�田�浩
卢大明
李大鹏
荀镜雅
王健宇
陈曦
李幸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu USA LLC
Original Assignee
Baidu USA LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/533,769 external-priority patent/US20220189173A1/en
Application filed by Baidu USA LLC filed Critical Baidu USA LLC
Publication of CN116170651A publication Critical patent/CN116170651A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Abstract

Systems, methods, and data sets for automatically and accurately generating a high-light moment video or summary video of content are provided herein. In one or more embodiments, text (e.g., articles) including key events (e.g., goals, player actions, etc.) in an activity (e.g., games, concerts, etc.) and one or more videos of the activity are entered. In one or more embodiments, the output is a short video of one or more events in the text, where the video may include comments and/or other audio (e.g., music) of the highlight moment event, which may also be automatically synthesized.

Description

Method, system and storage medium for generating highlight moment video from video and text input
Cross Reference to Related Applications
This patent application is a partially successor to the co-pending and co-owned U.S. patent application 17/393,373 and claims priority to the co-pending and co-owned U.S. patent application 17/393,373, which U.S. patent application 17/393,373 was filed on month 3 of 2021 entitled "automatic and accurate generation of high light time video (AUTOMATICALLY AND PRECISELY GENERATING HIGHLIGHT VIDEOS WITH ARTIFICIAL INTELLIGENCE) with artificial intelligence" and lists Zhiyu Cheng, le Kang, xin Zhou, hao Tian and Xing Li as inventors (application number: 28888-2450 (BN 201118USN 1), which claims priority to the co-pending and co-owned U.S. patent application 63/124,832, which U.S. patent application 63/124,832 was filed on month 13 of 2020, "automatic and accurate generation of high light time video (AUTOLLY AND PRECISELY GENERATING HIGHLIGHT EOS WITH ARTIFICIAL INTELLIGENCE) with artificial intelligence" and lists Zhiyu Cheng, le Kan, xo Tin and Xo Tin as our patent application number: 28888, and U.S. patent application number: xm 1, and U.S. patent application number: xm.S. patent application No. 20, each of which is incorporated by reference herein by reference for the entirety of their entirety (application) of their entirety).
Technical Field
The present disclosure relates generally to systems and methods for computer learning that may provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods for automatically generating summaries or highlights of content.
Background
With the rapid development of internet technology and emerging tools, online generated video content (such as sports-related or other event video) is rapidly growing at an unprecedented rate. Especially during the covd-19 pandemic, the amount of network video viewing has proliferated because fans are not allowed to attend events in the field (e.g., sports or theatrical). Creating a highlight moment video or other event related video typically involves manual work to manually edit the original untrimmed video. For example, the most popular sports videos often include short clips of a few seconds, and it is very challenging for machines to accurately understand videos and live critical events. In combination with the large amount of original content that is present, it is very time consuming and costly to break down the original content into the appropriate high light moment video. Moreover, it is important for a viewer to be able to acquire compressed content that properly captures the salient elements or events given the limited time to view the content.
Accordingly, there is a need for a system and method that automatically and accurately generates refined or compressed video content, such as high-light moment video.
Disclosure of Invention
An aspect of the present disclosure provides a computer-implemented method comprising: given input text referring to an event in an activity, parsing the input text using a text parsing module to identify the event referred to in the input text; and converting the input text to TTS generated audio using a text-to-speech TTS module; given at least a portion of the activity and an input video of the identified event: performing time anchoring to associate a runtime of the input video with a runtime of the activity; identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event; extracting features from the initial video clip; obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model; generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and combining the TTS generated audio with the final video clip to generate an event highlight moment video.
Another aspect of the present disclosure provides a system, comprising: one or more processors; and a non-transitory computer-readable medium comprising one or more sets of instructions that, when executed by at least one of the one or more processors, cause the following steps to be performed, the steps comprising: given input text referring to an event in an activity, parsing the input text using a text parsing module to identify the event referred to in the input text; and converting the input text to TTS generated audio using a text-to-speech TTS module; given at least a portion of the activity and an input video of the identified event: time anchoring is performed to associate a runtime of the input video with a runtime of the activity; identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event; extracting features from the initial video clip; obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model; generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and combining the TTS generated audio with the final video clip to generate an event highlight moment video.
Yet another aspect of the disclosure provides a non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed, the steps comprising: given input text referring to an event in an activity, parsing the input text using a text parsing module to identify the event referred to in the input text; and converting the input text to TTS generated audio using a text-to-speech TTS module; given at least a portion of the activity and an input video of the identified event: performing time anchoring to associate a runtime of the input video with a runtime of the activity; identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event; extracting features from the initial video clip; obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model; generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and combining the TTS generated audio with the final video clip to generate an event highlight moment video.
Drawings
Reference will be made to embodiments of the present disclosure, examples of which may be illustrated in the accompanying drawings. The drawings are illustrative, rather than limiting. While the present disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these particular embodiments. The items in the drawings may not be to scale.
FIG. 1 depicts an overview of a high-light moment generation system according to an embodiment of the present disclosure;
FIG. 2 depicts an overview method for a training generation system in accordance with an embodiment of the present disclosure;
FIG. 3 depicts a general overview of a dataset generation process according to an embodiment of the present disclosure;
FIG. 4 summarizes comments and tags of some cloud-sourced text data in accordance with embodiments of the present disclosure;
FIG. 5 summarizes the collected untrimmed video of a game in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates an embodiment of a user interface for event time design for annotating a video for a person in accordance with an embodiment of the present disclosure;
FIG. 7 illustrates a method for event time and video runtime association according to an embodiment of the present disclosure;
FIG. 8 illustrates an example of identifying timer numbers in a game video in accordance with an embodiment of the present disclosure;
FIG. 9 depicts a method for generating clips from input video in accordance with an embodiment of the present disclosure;
FIG. 10 illustrates feature extraction according to an embodiment of the present disclosure;
FIG. 11 illustrates a pipeline for extracting features according to an embodiment of the present disclosure;
FIG. 12 illustrates a neural network model that may be used to extract features in accordance with an embodiment of the present disclosure;
FIG. 13 depicts feature extraction using a slow-fast neural network model according to an embodiment of the present disclosure;
FIG. 14 depicts a method for audio feature extraction and event of interest temporal prediction in accordance with an embodiment of the present disclosure;
fig. 15A shows an example of an original audio waveform, and fig. 15B shows its corresponding mean absolute value feature, according to an embodiment of the present disclosure;
FIG. 16 illustrates a method for predicting time of an event of interest in a video in accordance with an embodiment of the present disclosure;
FIG. 17 illustrates a pipeline for time positioning according to an embodiment of the present disclosure;
FIG. 18 depicts a method for predicting a likelihood of an event of interest in a video clip in accordance with an embodiment of the present disclosure;
FIG. 19 illustrates a pipeline for action location prediction according to an embodiment of the present disclosure;
FIG. 20 depicts a method for predicting a likelihood of an event of interest in a video clip in accordance with an embodiment of the present disclosure;
FIG. 21 illustrates a pipeline for final temporal prediction using an integrated neural network model, according to an embodiment of the present disclosure;
FIG. 22 depicts a goal positioning result compared to another method in accordance with an embodiment of the present disclosure;
FIG. 23 shows the goal positioning results of 3 clips, with the integrated learning achieving the best results, according to embodiments of the present disclosure;
24A and 24B depict a system for generating summary or high light moment video from video input and text summary input in accordance with an embodiment of the present disclosure;
FIG. 25 depicts a method for extracting information from input text in accordance with an embodiment of the present disclosure;
FIG. 26 illustrates a method for generating a player database according to an embodiment of the present disclosure;
FIG. 27 depicts a method for combining video clips and corresponding audio clips to create a summary video in accordance with an embodiment of the present disclosure;
fig. 28 depicts a simplified block diagram of a computing device/information handling system according to an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the present disclosure. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without these details. Furthermore, those skilled in the art will recognize that the embodiments of the present disclosure described below can be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer readable medium.
The components or modules shown in the figures are examples of exemplary embodiments of the present disclosure and are intended to avoid obscuring the present disclosure. It should also be understood that throughout the discussion, components may be described as separate functional units, which may include sub-units, but those skilled in the art will recognize that various components or portions thereof may be divided into separate components or may be integrated together, including, for example, in a single system or component. It should be noted that the functions or operations discussed herein may be implemented as components. The components may be implemented in software, hardware, or a combination thereof.
Furthermore, the connections between components or systems in the figures are not limited to direct connections. Rather, the data between these components may be modified, reformatted, or otherwise changed by intermediate components. Further, additional or fewer connections may be used. It should also be noted that the terms "coupled," "connected," "communicatively coupled," "joined," "interface," or any derivatives thereof should be construed as including direct connection, indirect connection via one or more intermediary devices, and wireless connection. It should also be noted that any communication such as a signal, response, acknowledgement, message, query, etc. may include one or more exchanges of information.
Reference in the specification to "one or more embodiments," "a preferred embodiment," "one embodiment," "an embodiment," etc., means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be included in multiple embodiments. Furthermore, the appearances of the above-identified phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
Certain terminology is used throughout this description for the sake of illustration and should not be taken as limiting. The service, function or resource is not limited to a single service, function or resource; the use of these terms may refer to groupings of related services, functions or resources, which may be distributed or aggregated. The terms "include," "comprising," "includes," and "including" are to be construed as open-ended terms, and any list thereafter is intended to be exemplary and not meant to be limiting of the listed items. A "layer" may include one or more operations. The words "optimal," "optimizing," and the like refer to an improvement in a result or process and do not require that a given result or process have reached an "optimal" or peak state. The use of memory, databases, information stores, tables, hardware, caches, and the like may be used herein to refer to one or more system components in which information may be entered or otherwise recorded.
In one or more embodiments, the stop condition may include: (1) a set number of iterations have been performed; (2) an amount of processing time has been reached; (3) Convergence (e.g., the difference between successive iterations is less than a first threshold); (4) divergence (e.g., performance degradation); and (5) acceptable results have been achieved.
Those skilled in the art will recognize that: (1) optionally performing certain steps; (2) steps may not be limited to the specific order described herein; (3) certain steps may be performed in a different order; and (4) certain steps may be performed simultaneously.
Any headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated herein by reference in its entirety.
It should be noted that any experiments and results provided herein are provided by way of illustration and were performed under specific conditions using one or more specific embodiments; therefore, none of these experiments nor their results should be used to limit the scope of the disclosure of this patent document.
It should also be noted that while the embodiments described herein may be in the context of a sporting event (e.g., football), aspects of the present disclosure are not so limited. Accordingly, aspects of the present disclosure may be applied to or adapted for use in other environments.
A. General description of the invention
1. General overview
Presented herein are embodiments for automatically, massively and accurately generating high-light moment video. For illustration, a football match will be used. It should be noted, however, that embodiments herein may be used or adapted for use with other sports and non-sports events, such as concerts, performances, talk shows, demonstrations, news, shows, video games, sports events, animations, social media postings, movies, and the like. Each of these activities may be referred to as an occurrence or event, and the time of occurrence of an event may be referred to as an event of interest, an occurrence or time of occurrence.
Using a large scale multi-modal dataset, prior art deep learning models are created and trained to detect one or more events in a game (e.g., goal) although events of interest (e.g., goal, injury, fight, red card, corner ball, penalty point ball, etc.) may also be used. Embodiments of an integrated learning module are also provided herein to improve performance of event-of-interest localization.
Fig. 1 depicts an overview of a high-light moment generation system according to an embodiment of the present disclosure. In one or more embodiments, large-scale cloud-source text data and untrimmed football game video are collected and fed into a series of data processing tools to generate candidate long clips (e.g., 70 seconds, although other lengths of time may be used) containing major game interest events (e.g., goal events). In one or more embodiments, the novel event of interest localization pipeline precisely localizes the time of the event in the clip. Finally, embodiments may construct one or more customized highlight reel videos/stories around the detected highlight reel.
FIG. 2 depicts an overview method for a training generation system in accordance with an embodiment of the present disclosure. In order to train the generation system, a large-scale multi-modal dataset of event-related data must be generated or obtained for use as training data (205). Because the video runtime may not correspond to a time in an event, in one or more embodiments, time anchoring is performed for each video in a set of training videos to associate the video runtime with the event time (210). The metadata (e.g., comments and/or tags) and the associated time obtained by the time anchor may then be used to identify an approximate time for the event of interest to generate a clip from the video that includes the event of interest (215). By using clips instead of the entire video, the processing requirements are greatly reduced. For each clip, features are extracted (220). In one or more embodiments, a set of pre-trained models may be used to obtain the extracted features, which may be multi-modal.
In one or more embodiments, for each clip, a final time value of the event of interest is obtained using a neural network model (225). In an embodiment, the neural network model may be an integrated module that receives features from a set of models and outputs final time values. Given a predicted final time value for each clip, comparing the predicted final time value with its corresponding true value to obtain a loss value (230); and the model may be updated with the loss value (235).
Once trained, a generation system may be output and given the input event video, a highlight moment video is generated using the generation system.
2. Related work
In recent years, artificial intelligence has been applied to analysis of video content and generation of video. In sports analysis, many computer vision techniques have been developed to understand sports broadcasts. In particular, in soccer, researchers have proposed algorithms that identify key game events and player actions, use player body orientation to analyze through feasibility, detect events in combination with audio and video streams, use broadcast streams and trajectory data to identify group activities on the scene, aggregate depth frame features to locate major game events, and process inherent temporal patterns representing these actions with temporal context information around the actions.
For various video understanding tasks, deep neural networks are trained with large-scale data sets. Recent challenges include finding time boundaries for an activity or locating events in the time domain. In football video understanding, some define a goal event as the moment when a ball crosses the goal line.
In one or more embodiments, this definition of goal is employed, and prior art deep learning models and methods and audio stream processing techniques are utilized, and in embodiments, an integrated learning module is employed to accurately locate events in a football video clip.
3. Some contributions of the embodiments
In this patent document, an embodiment of an automatic highlight moment generating system capable of accurately recognizing the occurrence of an event in a video is proposed. In one or more embodiments, the system can be used to generate high-light temporal videos in large quantities without the need for conventional manual editing work. Some of the contributions provided by one or more embodiments include, but are not limited to, the following:
-creating a large-scale multimodal football dataset comprising cloud-sourced text data and high definition video. Moreover, in one or more embodiments, various data processing mechanisms are applied to parse, clean up, and annotate the collected data.
-aligning multimodal data from multiple sources and generating candidate long video clips by cutting the original video into 70 second clips using parsed tags from cloud source comment data.
Embodiments of an event localization pipeline are presented herein. Embodiments extract advanced feature representations from multiple angles and apply a time-locating method to help locate events in a clip. In addition, embodiments are designed with an integrated learning module to improve event localization performance. It should be noted that although the event that occurs may be a football match and the event of interest may be a goal, embodiments may be used or adapted for other events that occur and other events of interest.
Experimental results show that with respect to locating a ball-in event in the clip, the tested embodiment reached an accuracy of approximately 1 (0.984) with a tolerance of 5 seconds, which is superior to the existing work and creates new prior art. This result helps to accurately capture the goal time and accurately generate the highlight time video.
4. Patent literature layout
The patent document is organized as follows: section B describes creating a dataset and how the data is collected and annotated. Embodiments of methods for constructing embodiments of the highlight moment generation system, and of methods for how to accurately locate a goal event in a football video clip using the proposed methods are presented in section C. Experimental results are summarized and discussed in section D. It should be reiterated that the use of a football game as overall content and the use of a goal as an event within that content are presented by way of example only, and those skilled in the art will recognize that aspects herein may be applied to other content areas (including areas other than the game area) and other events.
B. Data processing embodiment
To train and develop system embodiments, large-scale multi-modal datasets are created. FIG. 3 depicts a general overview of a dataset generation process according to an embodiment of the present disclosure. In one or more embodiments, one or more comments and/or tags associated with a video of an event are collected (305). For example, football game reviews and tags (e.g., corner balls, goal balls, block, head balls, etc.) from a website or other source may be crawled (see, e.g., tag and review 105 in fig. 1) to obtain data. In addition, videos associated with metadata (i.e., comments and/or tags) are also collected (305). For the embodiments herein, high Definition (HD) untrimmed football game video from various sources is collected. The start of the game is noted in the raw video that was not trimmed with amazon turkish robot (AMT) (315). In one or more embodiments, metadata (e.g., comments and/or tag information) may be used to help identify approximate times of the event of interest to generate clips (e.g., clips of goals) from the video that includes the event of interest (320). Finally, the precise time of the event of interest (e.g., goal) in the processed video clip is identified with Amazon Turkey robot (AMT). The annotated goal time may be used as a true value during the training of the embodiment of the goal positioning model.
1. Data collection embodiment
In one or more embodiments, the sports website is crawled for over 1,000,000 reviews and labels covering over 10,000 football matches from various tournaments from 2015 to 2020 season. Fig. 4 summarizes comments and tags in some cloud-sourced text data according to embodiments of the present disclosure.
Comments and tags provide a large amount of information for each game. For example, they include game date, team name, tournament, game event time (e.g., in minutes), event tags (such as goal, corner ball, trader, foul, etc.), and associated player names. These comments and tags from the cloud source data can be translated into or can be considered as rich metadata for the original video processing embodiment as well as the highlight moment video generation embodiment.
Over 2600 high definition (720P or more) untrimmed football game videos from various online sources were also collected. The contest comes from various tournaments from 2014 to 2020. Fig. 5 summarizes the collected untrimmed game video in accordance with an embodiment of the present disclosure.
2. Data annotation embodiment
In one or more embodiments, the raw video is first sent to an amazon turkish robot (AMT) worker to mark the start time of the game (defined as the time when the referee blows a whistle to start the game), and then the cloud source game comments and tags are parsed to obtain the goal time in minutes for each game. By combining the goal minute tag with the game start time in the video, a candidate 70 second clip containing a goal event is generated. Next, in one or more embodiments, the candidate clips are sent to an AMT for marking the time of goal in seconds. FIG. 6 illustrates a user interface embodiment designed for AMT for goal time tagging in accordance with an embodiment of the present disclosure.
For the goal time annotation on AMTs, each HIT (human intelligence task, one worker task) contains one (1) candidate clip. Each HIT is assigned to five (5) AMT workers and the median timestamp value is collected as the true value tag.
C. Method embodiment
In this section, details of an embodiment of each of the five modules of the highlight moment generating system are given. As a brief overview, the first module embodiment in section c.1 is a race time anchor embodiment that examines the time integrity of the video and maps any time in the race to time in the video.
The second module embodiment in section c.2 is a coarse interval extraction embodiment. This module is the main distinction with respect to commonly studied event localization pipelines. In an embodiment of the module, a 70 second interval is extracted (although other sized intervals may be used) in which a particular event is located by utilizing text metadata. This approach is advantageous for at least three reasons compared to common end-to-end visual event localization pipelines. First, clips extracted with metadata contain more context information and can be used in different dimensions. With metadata, clips may be used as time clips (such as game highlight videos) or may be used with other clips of the same team or player to generate team, player, and/or season highlight videos. The second reason is robustness, which results from low event ambiguity of the text data. Third, by analyzing short clips of events of interest rather than the entire video, much resources (processing, processing time, memory, power consumption, etc.) are saved.
An embodiment of the third module in the system embodiment is a multi-mode feature extraction. Video features are extracted from the plurality of angles.
The embodiment of the fourth module is an accurate time positioning. Extensive research on the techniques of how to design and implement embodiments of feature extraction and time localization is provided in sections c.3 and c.4, respectively.
Finally, an embodiment of an integrated learning module is described in section c.5.
1. Race time anchoring embodiment
Event clocks in event videos are sometimes irregular. The main reason seems to be that at least some event video files collected from the ad hoc network contain corrupted timestamps or frames. It is observed that in video collection, about 10% of the video files contain time corruption that shifts a portion of the video in time, sometimes over 10 seconds. Some of the severe impairments observed include lost frames of more than 100 seconds. In addition to errors in the video file, some unexpected rare events may have occurred during the occurrence of the event/event and the event clocks must be stopped for a few minutes before they recover. If video content is corrupted or the game is interrupted, the temporal irregularity may be considered a forward or backward temporal jump. In order to accurately locate clips of any event specified by the metadata, in one or more embodiments, time jumps are detected and calibrated accordingly. Thus, in one or more embodiments, an anchoring mechanism is designed and used.
Fig. 7 illustrates a method for event time and video runtime association according to an embodiment of the present disclosure. In one or more embodiments, OCR (optical character recognition) is performed on the video frames at 5 second intervals (although other intervals may be used) to read the game clock displayed in the video (705). A tournament start time in the video may be derived from the identified tournament clock (710). Whenever a time jump occurs, in one or more embodiments, a record of the race time after the time jump is maintained and is referred to as a time anchor (710). With a time anchor, in one or more embodiments, any time in the game can be mapped to time in the video (i.e., video runtime) (715), and any clips specified by the metadata can be extracted accurately. Fig. 8 illustrates an example of identifying timer numbers in a game video according to an embodiment of the present disclosure.
As shown in fig. 8, timer numbers 805-820 may be identified and associated with video run times. Embodiments may collect multiple recognition results over time and may be self-correcting based on spatial stationarity and temporal continuity.
2. Coarse interval extraction embodiment
Fig. 9 depicts a method for generating clips from input video in accordance with an embodiment of the present disclosure. In one or more embodiments, metadata from cloud source game reviews and tags is parsed (905), the metadata including a time stamp in minutes for a goal event. In conjunction with the game start time detected by an embodiment of the OCR tool (discussed above), the original video may be edited to generate x second (e.g., 70 second) candidate clips containing events of interest. In one or more embodiments, the extraction rules may be described by the following equation:
t {clipStart} =t (gameStart} +60*t (goalMinute} -tolerance (1)
t (clipEnd} =t {clipStart} +(base clip length+2*tolerance) (2)
in one or more embodiments, a goal is given a minute t {goalMinute} And a game start time t {gameStart) From the video t {clipStart} The clip is extracted in seconds. In one or more embodiments, the duration of the candidate clip may be set to 70 seconds (where the base clip length is 60 seconds and the tolerance is 5 seconds, although it should be noted that different values and no may be usedThe same formula) because this covers the extreme case when the event of interest happens very close to the goal minutes, and it can also tolerate small deviations in the start time of the game detected by OCR. In the next section, an embodiment of a method for locating a goal seconds (the moment when a ball crosses a goal line) in a candidate clip is presented.
3. Multimode feature extraction embodiment
In this section, three embodiments are disclosed for obtaining advanced feature representations from candidate clips.
a) Feature extraction embodiment using a pre-trained model
Fig. 10 illustrates feature extraction according to an embodiment of the present disclosure. Given video data, in one or more embodiments, a time frame is extracted (1005) and, if matching input sizes are required, the time frame is resized in the spatial domain (1010), fed to the deep neural network model to obtain a high-level feature representation. In one or more embodiments, a pre-trained ResNet-152 model on image datasets is used, but other networks may be used. In one or more embodiments, the time frames are extracted in inherent frames per second (fps) of the original video and then downsampled at 2fps, i.e., the ResNet-152 characterization of 2 frames per second of the original video is obtained. ResNet is a very deep neural network that outputs a representation of the characteristics of 2048 dimensions per frame in fully connected 1000 layers. In one or more embodiments, the output of a layer preceding the softmax layer may be used as the extracted advanced features. Note that res net-152 can be used to extract advanced features from a single image; it does not inherently embed temporal context information. Fig. 11 illustrates a pipeline 1100 for extracting advanced features according to an embodiment of the present disclosure.
b) Slow and fast feature extractor embodiments
As part of the video feature extractor, in one or more embodiments, a slow-fast network architecture (Feichtenhofer, c., fan, h., malik, j., & He, k., "slow-fast network for video recognition (Slowfast Networks for Video Recognition)", IEEE international computer vision conference discussion (Proceedings of The IEEE International Conference on Computer Vision) "(pp.6202-6211) (2019), the entire contents of which are incorporated herein by reference), or a slow-fast network architecture (Xiao et al.," audiovisual slow-fast network for video recognition (Audiovisual SlowFast Networks for Video Recognition) ", axiv.org/abs/2001.08740v1 (2020), the entire contents of which are incorporated herein by reference), such as that proposed by Feichtenhofer et al; although it should be noted that other network architectures may be used. FIG. 12 graphically depicts a neural network model that may be used to extract features, according to an embodiment of the present disclosure.
Fig. 13 depicts feature extraction using a slow-fast neural network model according to an embodiment of the present disclosure. In one or more embodiments, a slow-fast network is initialized with pre-trained weights using a training dataset (1305). The network may be trimmed as a classifier (1310). The second column in table 1 below shows the event classification results using the test dataset with the baseline network. In one or more embodiments, the feature extractor is used to classify 4 second clips into 4 categories: 1) away from the event of interest (e.g., goal), 2) just before the event of interest, 3) the event of interest, and 4) just after the event of interest.
Several techniques can be implemented to find the best classifier, which is evaluated by the error percentage of the first 1. First, a network as constructed in fig. 12 is applied, which adds audio as an additional path to a slow network (avslow). The visual portion of the network may be initialized with the same weights. It can be seen that direct joint training of visual and audio features can compromise performance. This is a common problem found when training multimode networks. In one or more embodiments, techniques are applied to add different loss functions for visual and audio modes, respectively, and the entire network is trained with multiple task losses. In one or more embodiments, a linear combination of cross entropy loss on audiovisual results and each audiovisual branch may be used. The linear combination may be a weighted combination, where the weights may be learned or may be selected as super-parameters. The error results of the best front 1 shown in the bottom row of table 1 are obtained.
TABLE 1 results of event classification
Algorithm Error of the first 1%
Slow speed and fast speed 33.27
Audio only 60.01
AVSlowfast 40.84
AVSlowfast multitasking 31.82
In one or more embodiments of the goal positioning pipeline, the feature extractor portion of the network (avlow with multitasking penalty) may be utilized. The aim is therefore to reduce the error of the first 1, which corresponds to a stronger feature.
c) Average absolute value audio feature embodiment
By listening to the sound track of an event (e.g., a game without live commentary), one can typically determine when an event of interest occurs simply from the volume of the audience. Inspired by this observation, a simple method of directly extracting key information about an event of interest from audio was developed.
Fig. 14 depicts a method for audio feature extraction and event of interest temporal prediction in accordance with an embodiment of the present disclosure. In one or more embodiments, the absolute value of the audio waveform is taken and downsampled to 1 hertz (Hz) (1405). This feature representation may be referred to as an average absolute value feature because it represents the average sound amplitude per second. Fig. 15A and 15B show examples of an original audio waveform of one clip and its average absolute value feature, respectively, according to an embodiment of the present disclosure.
For each clip, the maximum value 1505 of the average absolute value audio feature 1500B may be located (1410). By locating the maximum value of the average absolute audio feature (e.g., maximum value 1505) and its corresponding time (e.g., time 1510) for the clip in the test dataset, 79% accuracy (with a 5 second tolerance) is achieved with respect to time location.
In one or more embodiments, the average absolute value audio feature (e.g., 1500B in fig. 15B) may be treated as a likelihood prediction of an event of interest in time in the clip. As will be discussed below, the average absolute audio feature may be a feature that is input into an integrated model that predicts the final time at which the event of interest occurs within the clip.
4. Motion positioning embodiment
In one or more embodiments, to accurately locate the time of a goal in a video of a football match, the circumstances occurring in the video are learned in conjunction with the temporal context information surrounding that time. For example, before a goal event occurs, the player will shoot a goal (or head ball) and the ball will move toward the goal. In some cases, the offending and defending team members are gathered in the exclusion zone and are not far from the goal. After a goal event, the goal player will typically run to the side line, hug teammates, and celebrate among spectators and coaches. Intuitively, these patterns in the video can help the model learn what happens and locate the moment of the goal event.
FIG. 16 depicts a method for predicting the likelihood of an event of interest in a video clip in accordance with an embodiment of the present disclosure. In one or more embodiments, to construct the time localization model, a time convolutional neural network (1605) is used that takes as input the extracted visual features. In one or more embodiments, the input features may be features extracted from one or more existing models discussed above. For each frame, it outputs a set of intermediate features that mix the time information across frames. Then, in one or more embodiments, the intermediate features are input into a segmentation module (1610) that generates segmentation scores that are evaluated by a segmentation loss function. The cross entropy loss function may be used to partition the loss function:
Figure BDA0003799878910000161
Wherein t is i Is a true value mark, p i Is the softmax probability for the i-th class.
In one or more embodiments, the segmentation scores and intermediate features are concatenated and fed to an action location module (1615) that generates a location prediction (e.g., a prediction of likelihood of occurrence of an event of interest at each point in time within a range of clips) (1620), which can be evaluated by a YOLO-like action location loss function. The L2 loss function may be used for the action localization loss function:
Figure BDA0003799878910000171
FIG. 17 illustrates a pipeline for time positioning according to an embodiment of the present disclosure. In one or more embodiments, the time CNN may include a convolution layer, the segmentation module may include a convolution layer and a batch normalization layer, and the action localization model may include a pooling layer and a convolution layer.
In one or more embodiments, model embodiments are trained with segmentation and motion localization loss functions as described in Ciopa et al (A., deli ge, A., giancola, S., ghanem, B., droogenbroeck, M.V., gade, R., & Moeslund, T., "Context-aware loss function for motion recognition in football video (A Context-Aware Loss Function for Action Spotting in Soccer Videos)", "IEEE/CVF computer vision and pattern recognition conference 2020 (IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))", 13123-13133, the entire contents of which are incorporated herein by reference), taking into account contextual information of time. In one or more embodiments, the segmentation module is trained using segmentation loss, wherein each frame is associated with a score to represent a likelihood that the frame belongs to an action class, and the action localization module is trained using action localization loss, wherein a temporal location of the action class is predicted.
At least one major difference between the embodiments herein and the method of Cioppa et al is that the embodiments herein deal with short clips, whereas Cioppa et al takes the entire game video as input, thus requiring much longer time to process the video and extract features when implemented in real-time.
In one or more embodiments, the extracted feature input can be a feature extracted from the ResNet model discussed above or the AVSlowFast multitasking model discussed above. Alternatively, for the AVSlowFast multitasking model, the partitions of the action localization model may be removed. FIG. 18 depicts a method for predicting the likelihood of an event of interest in a video clip in accordance with an embodiment of the present disclosure. In one or more embodiments, the time convolution neural network receives as input features extracted from an AVSlowfast multitasking model (1805). For each frame, it outputs a set of intermediate features that mix the time information across frames. Then, in one or more embodiments, the intermediate features are input to an action location module (1810) that generates a location prediction (e.g., a prediction of the likelihood of occurrence of an event of interest at each point in time within a range of the clip) (1815), which can be evaluated by an action location loss function. FIG. 19 illustrates a pipeline for action location prediction according to an embodiment of the present disclosure.
5. Integrated learning embodiment
In one or more embodiments, a single predicted time (e.g., a maximum value selected) for the event of interest in the clip may be obtained from each of the three models described above. One of the predictions may be used, or the predictions may be combined (e.g., averaged). Alternatively, the information from each model may be combined using an integrated model to obtain a final prediction of the events of interest in the clip.
FIG. 20 depicts a method for predicting the likelihood of an event of interest in a video clip, according to an embodiment of the present disclosure, and FIG. 21 illustrates a pipeline for final temporal prediction, according to an embodiment of the present disclosure. In one or more embodiments, the final accuracy may be enhanced in an integrated manner that aggregates the outputs of the three models/features described in the subsections above. In one or more embodiments, the outputs of all three previous models may be combined with the position-coding vector as inputs to the integration module (2005). The combining can be done using concatenation, e.g. 4 d-dimensional vectors become a 4 x d matrix. For ResNet and AVSlowfast multitasking models, the inputs may be the likelihood prediction outputs from their motion localization models in section 4 above. Also, for audio, the input may be the average absolute value audio feature of the clip (e.g., fig. 15B). In one or more embodiments, the position-coded vector is a 1-D vector representing the length of time (i.e., index) of the clip.
In one or more embodiments, the core of the integration module is an 18-layer 1-D ResNet network with a regression header. In essence, the integration module learns a mapping from multi-dimensional input features including multiple modes to final temporal locations of events of interest in the clip. In one or more embodiments, a final time value prediction is output from the integrated model (2010) and may be compared to the true value time to calculate the loss. The loss of various clips can be used to update parameters of the integrated model.
6. Inferred embodiment
Once trained, the entire high light moment generating system as shown in FIG. 1 can be deployed. In one or more embodiments, the system may further include an input that allows a user to select one or more parameters with respect to the generated clip. For example, the user may select a particular player, a range of games, one or more events of interest (e.g., goals and penalties), and the number of clips that make the highlight video (or the length of time each clip and/or the entire highlight editing video). The highlight moment generation system may then access the video and metadata and generate a highlight moment editing video by cascading clips. For example, the user may want 10 seconds per clip of events of interest. Thus, in one or more embodiments, the customized highlight reel video generation module may select 8 seconds before and 2 seconds after the event of interest based on the final predicted time for the clip. Alternatively, as shown in FIG. 1, key events in the player's career may be interesting events, and they may be automatically identified and compiled into a "story" of the player's career. Audio and other multimedia features that can be selected by the user can be added to the video by a custom highlight moment video generation module. Those skilled in the art will recognize other applications of the high light time generation system.
D. Experimental results
It should be noted that these experiments and results are provided by way of illustration and were performed under specific conditions using one or more specific examples; therefore, neither these experiments nor their results should be used to limit the scope of disclosure of this patent document.
1. Ball feeding positioning
For fair comparison with existing work, test model embodiments are trained with candidate clips containing goals extracted from games in a training set of datasets and validated/tested with candidate clips containing goals extracted from games in a validation/test set of datasets.
Fig. 22 shows the main results: with respect to locating a goal in a 70 second cut, the tested embodiment 2205 is significantly better than the prior art method 2210, which prior art method 2210 is known as the Context-Aware method of locating a goal in a soccer ball.
Intermediate prediction results obtained by using three different features described in section c.3 or section c.4 are also shown, and final results are predicted by the integrated learning module described in section c.5. The goal positioning results for the 3 clips are stacked in fig. 23. As shown in fig. 23, the final predicted output of the integrated learning module embodiment is best in terms of its proximity to the true value markers (shown by the dashed ellipses).
2. Partial notes
As shown in fig. 22, embodiments can achieve an accuracy of approximately 1 (0.984) with a tolerance of 5 seconds. The result is phenomenological in that it can be used to correct false marks from text and synchronize with custom audio comments. It also helps to accurately generate highlight moments and thus gives the user/editor the option to customize their video around the exact goal moment. The pipeline embodiments may naturally be extended to capture the time of other events (e.g., corner balls, arbitrary balls, and spot balls).
Again, the use of football games as the overall content and the use of goals as events in that content is merely exemplary, and those skilled in the art will recognize that aspects herein may be applied to other content areas (including other content areas outside of the game area) as well as other events.
E. Alternate embodiments for generating high light moment video from text and video input
As previously mentioned, an application of the above disclosed systems and methods is the ability to generate high light moment video. For example, it would be highly beneficial to be able to automatically generate event highlight moment videos such as sports highlight moment videos with comments. As described above, the demand for video content is increasing, and it takes longer to generate video content than to generate article-based content. Previously, the generation of, for example, video was a manual process in which video editing took a significant amount of time. Embodiments herein make it easier to generate high-light moment videos, where input text, e.g., matching summary articles, is used in generating the corresponding video content. In one or more embodiments, artificial intelligence/machine learning is used to find, match, or generate video clips corresponding to text that is related to portions of video; thus, once a text article is written, a video may be generated, wherein the process of generating such a highlight moment video is simplified to writing the article and editing the original video using an embodiment of an automated system to generate the highlight moment video.
Fig. 24A and 24B depict a system for generating highlight moment video of one or more events from video and text in accordance with an embodiment of the present disclosure. As shown, system 2400 can receive as input an article or text segment 2408 describing the game and one or more videos 2402 of the entire game. Note that for purposes of illustration, the activity is a sports game, but it should be noted that other activities (e.g., lectures, gatherings, news broadcasts, music meetings, etc.) may also be used.
Returning to fig. 24A, one task of the disclosed system is to identify one or more correct portions of the video 2402 that correspond to elements indicated in the input text 2408. Fig. 25 depicts a method for extracting information from input text in accordance with an embodiment of the present disclosure. As shown, input text (e.g., text 2408 in fig. 24A) including text related to one or more highlight reel events in an activity is received (2505). For example, the input text may be an article summarizing the game, and the article will be the basis for the final summary video output by the system. To aid in video segment selection, the input text is parsed to identify events of interest and related data, if any (2510). In one or more embodiments, the parsing may be based on rules (e.g., pattern matching, keyword matching, etc.), may employ machine learning models (e.g., neural network models that are trained to extract), or both to extract and classify key data, such as: the number of events, the minutes/times the event occurred, the type of event/action, the player and other identifying information (who, what, where, when, etc.). Examples of some template matching are provided below:
Corner balls of the arcna team, lost balls by Adam Weber: the keyword "corner ball" is included and classified as a corner ball action, and the extracted player is Adam Weber.
Paul Grom (Britton team) wins an arbitrary ball in the defending halfpace: this is classified as any ball action, and the player is Paul Grom.
Ball feed, 1 min for the arcinoma, 0 min for the brayton team. Nicolas peme (arcuna team) shoots right foot to lower right corner from the center of the forbidden zone and Clark Kamers assists him in traversing: this includes the keywords "goal" and "shoot," so this will be categorized as both a goal and a shoot event. This player is Nicolas pe.
In one or more embodiments, the extracted information including the critical events and related data (if any) is provided to a dataset (e.g., dataset 2424 in fig. 24A). This dataset is used by a video model (e.g., video deep learning model 2428 in fig. 24A) to generate a video clip for a video summary.
As shown in fig. 24A, the data may be combined with text data extracted from other sources. For example, text data 2404 may be collected from online postings, social media (e.g., facebook, twitter, etc.), high frequency or real-time story text, news feeds, forums, user groups, etc. In one or more embodiments, the text data may be parsed and classified using the same or similar rule-based model or neural network model, or the text data may be parsed and classified using a different rule-based model or neural network model that is more closely related to the input text data. Typically, web content or user-generated content of high frequency (live) text data describing the game action includes time information, which may also be extracted and used to help identify the correct time in the video. The additional text data 2404 may be parsed and categorized into a set of sentence groups that contain minutes, actions, and possibly other data (e.g., player data, time, etc.). Each sentence or group of sentences describes an event during the game. Thus, the text parsing/ action classification models 2412 and 2416 may be the same or different models. Note that the source of information 2404 provides additional information to help extract the correct event from the input summary text 2408
Note that additional data inputs may also be used to aid in parsing and/or video summary generation. As an example, consider that system 2400 can use player data database 2406 (where the system can network crawl player information or use user-generated player information content). The information may also be parsed using a parser 2414, the parser 2414 may use the same or similar rule-based model or neural network model, or may use a different rule-based model or neural network model that is more proximate to the entered text data. Consider, for example, the method shown in fig. 26.
In one or more embodiments, the parser module 2414 may normalize the table information and store entity information (e.g., team, player, actor, band, organization, etc.). For example, a list of players for a team (which may be obtained at a team website) may include a list of players by number and name (2605). The parser may extract the row and obtain the player name and coat number (2610) to obtain a database of numbers and names of players with the team. The output is a player dataset 2418 that can be used to help supplement parsed data (i.e., text data with seconds classified by action with possible player data 2420 and data collection (e.g., minutes, actions, possibly other related data) 2422). It should be noted that similar actions may also be used for other entities (e.g., performers, actors, moderators, etc.).
Returning to fig. 24A, given an input video 2402, the time detection module 2410 associates the run time of the video with the game time. In one or more embodiments, time detection may be performed as discussed above with respect to time anchoring, although other methods may be used.
As shown in the embodiment of fig. 24A, system 2400 can include a pass minute and action matching module 2426 that can also help identify time information for an event. For example, for each sentence group describing an event, the module 2426 may match the sentence with high frequency text data to obtain the number of seconds that the action occurred. In one or more embodiments, the model may also correlate the relevant data attached, such as the player. Information from other datasets (i.e., dataset 2420 and dataset 2422) may be provided to module 2426, which module 2426 uses for matching.
For example, in the matching summary, at 4 minutes, the arcin team obtained a corner sphere, and the real-time stream data included the following:
3'30 "by Bukayo foul
4'5"adam Weber pass
4 '12' Nicolas Pem goal scoring
4'25"Pascal was fouled by Adam
4'37 "corner ball of the Absonna team, lost ball by Adam Weber
5'10 "replacement of EmilBowe by Arsenal Sam Bukayo
In this case, the matching summary data of interest appears at minute 4 and the action is a corner ball. In one or more embodiments, the matching model embodiment filters live data and retains all the 4 th minute data, namely:
4'5"adam Weber pass
4 '12' Nicolas Pem goal scoring
4'25"Pascal was fouled by Adam
4'37 "corner ball of the Absonna team, lost ball by Adam Weber
These actions are then known by the text parsing module described above: pass, feed, lose and angle ball. Thus, block 2426 matches the corner ball of the osena team to "4'37", lost ball by Adam Weber if this is an event of interest. In one or more embodiments, this information may also be provided to action module 2430, where action module 2430 uses the temporal information to generate the corresponding video clip.
As illustrated in example fig. 24A, a data set 2424 is assembled that includes the approximate minutes of time when one or more events occurred and the classification of those events. In the illustrated embodiment, the data set 2424 may be a compilation of information from the time detection module 2410, from the text data information 2420, and the data collection information 2422. This information may then be used by the video deep learning model 2428 to determine more accurate times in the video in order to generate corresponding video clips containing events of interest. Model 2428 may be one or more of the models discussed above or an integration of one or more of the models discussed above.
For example, in one or more embodiments, for each sentence group describing an event, the sentence may contain the minutes for the action to occur. The system may extract one minute (or more) of video from the input video and, in one or more embodiments, use one or more deep learning video understanding models to identify where seconds the event/action occurred. As described above, the model may be one or more of the models discussed above or an integration of all or a subset of one or more of the models discussed above.
In one or more embodiments, the output of module 2428 is a collection 2430 of identified actions/events and the times (in seconds) at which those events/actions occur in the video. This information may be combined with information from the matching module 2426 (and redundant events and time may be removed). The final set of time and event information 2430 may be used to generate a video clip 2434 (fig. 24B) that includes the action/event of interest. The clip may be a set amount of time (e.g., x seconds before the event to y seconds after the event, the clip time span may vary depending on the type of action or the length of the event detected from the video deep learning model 2428) and/or may have a length corresponding to the length of audio from the text-to-speech module for the corresponding text that is related to the event in the video clip.
In one or more embodiments, the input text 2408 or a set of one or more sentences describing the event/action may be input into a text-to-speech (TTS) system 2432, the TTS system 2432 converting the input text into audio. In one or more embodiments, the text may be a compilation of audio segments generated by converting sentences into audio with TTS.
Those skilled in the art will recognize that any of a number of TTS systems may be used. For example, several works address the problem of synthesizing speech from a given input text over a neural network, including but not limited to:
deep SPEECH 1 (which discloses U.S. patent application 15/882,926 (case No. 28888-2105), filed on 29 th 2018, entitled "system and method FOR REAL-TIME neurotext TO SPEECH" (SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH), and U.S. provisional patent application 62/463,482 (case No. 28888-2105), entitled "system and method FOR REAL-TIME neurotext TO SPEECH", filed on 24 th 2017, each of which is incorporated herein by reference in its entirety (FOR convenience, these disclosures may be referred TO as "deep SPEECH 1" or "DV 1");
Deep SPEECH 2 (which is disclosed in commonly assigned U.S. patent application 15/974,397 (docket No. 8888-2144), entitled "system and method for multi-speaker neural text TO SPEECH," filed on 8, 2018, 5, and U.S. provisional patent application 62/508,579 (docket No. 28888-2144P), entitled "system and method for multi-speaker neural text TO SPEECH," filed on 19, 2017, each of which is incorporated herein by reference in its entirety (for convenience), these disclosures may be referred TO as "deep SPEECH 2" or "DV 2");
deep Speech 3 (which is disclosed in commonly assigned U.S. patent application Ser. No. 16/058,265 (Ser. No. 28888-2175), entitled "text-TO-speech nervous System and method Using convolutional sequence learning," filed 8, 2018, and U.S. provisional patent application Ser. No. 62/574,382 (Ser. No. 2888-2175P), entitled "text-TO-speech nervous System and method Using convolutional sequence learning," filed 19, 10, 2017, and listed as the inventors
Figure BDA0003799878910000251
Ar 1 k. Wei Ping, kainan Peng, shaaran Narang, ajay Kannan, andrew Gibiansky, jonathan Raiman, and John Miller (each of the foregoing patent documents is incorporated herein by reference in its entirety (the disclosure of which may be referred to as "deep Speech 3" or "DV3" for convenience)), and;
examples disclosed in commonly owned U.S. patent 10,872,596 (case No. 28888-2269), granted on month 12, 22 of 2020 (which patent document is incorporated herein by reference in its entirety); and
Examples disclosed in commonly owned U.S. patent 11,017,761 (docket No. 28888-2326), issued on month 25 of 2021, which is incorporated herein by reference in its entirety.
Returning to fig. 24B, given a video clip 2434 and corresponding audio 2436, the video and audio clips can be combined into a game highlight moment video 2440. Fig. 27 depicts a method for generating a combined video in accordance with an embodiment of the present disclosure.
As described above, for each event, audio may be generated using a TTS and a set of one or more sentences for the event (2705), where the TTS converts the set of one or more sentences into audio having a particular length. Having determined the exact or approximate time at which an event occurs in the video, a video clip (2710) comprising the event may be extracted from the complete video and its length may be selected so that it is long enough for the correspondingly generated audio. For example, if a certain amount of time is required for audio generated for a sentence or sentence-group TTS for an event (e.g., a corner ball), the corresponding video may be edited to match the length of the audio for the event in the video clip (e.g., the video may have a few seconds before the audio and a few seconds after the audio). Finally, the video clips and corresponding audio may be combined into a multimedia video regarding the event. In one or more embodiments, tools (such as FFmpeg tools) may be used to combine video clips and audio.
It should be noted that the input text 2408 may mention a plurality of events/actions, and that a plurality of video clips with corresponding audio may be associated and combined. For example, in one or more embodiments, video clips are concatenated into a single final video, and audio is synchronized with the corresponding event. In one or more embodiments, tools (such as FFmpeg tools) may be used to combine video clips and audio.
F. Computing system embodiments
In one or more embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems (or computing systems). An information handling system/computing system may include any instrumentality or aggregate of instrumentalities operable to compute, evaluate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or include a personal computer (e.g., a laptop computer), a tablet computer, a mobile device (e.g., a Personal Digital Assistant (PDA), a smart phone, a tablet, etc.), a smart card, a server (e.g., a blade server or a rack server), a network storage device, a camera, or any other suitable device, and may vary in size, shape, performance, functionality, and price. The computing system may include Random Access Memory (RAM), one or more processing resources such as a Central Processing Unit (CPU) or hardware or software control logic, read Only Memory (ROM), and/or other types of memory. Additional components of the computing system may include one or more drives (e.g., hard disk drives, solid state drives, or both) for one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, a stylus, a touch screen, and/or a video display. The computing system may also include one or more buses operable to transmit communications between the various hardware components.
Fig. 28 depicts a simplified block diagram of an information handling system (or computing system) according to an embodiment of the present disclosure. It should be appreciated that the functionality illustrated by system 2800 may be used to support various embodiments of the computing system, although it should be understood that the computing system may be configured differently and include different components, including having fewer or more components as illustrated in fig. 28.
As shown in fig. 28, computing system 2800 includes one or more Central Processing Units (CPUs) 2801, which provide computing resources and control computers. CPU 2801 may be implemented with a microprocessor or the like and computing system 2800 may also include one or more Graphics Processing Units (GPUs) 2802 and/or floating point coprocessors for mathematical computations. In one or more embodiments, one or more GPUs 2802, such as a portion of a graphics card, may be incorporated within a display controller 2809. The system 2800 may also include a system memory 2819, which may include RAM, ROM, or both.
As shown in fig. 28, a plurality of controllers and peripheral devices may also be provided. Input controller 2803 represents an interface to various input devices 2804 such as a keyboard, mouse, touch screen, and/or stylus. The computing system 2800 may also include a storage controller 2807 for interfacing with one or more storage devices 2808, each of the one or more storage devices 2808 including a storage medium or optical medium such as a tape or disk, which may record programs of instructions for operating the system, utilities and applications, which may include embodiments of programs that implement various aspects of the present disclosure. The storage device 2808 may also be used to store processed data or data to be processed in accordance with the present disclosure. The system 2800 may also include a display controller 2809 for providing an interface to a display device 2811, where the display device 2811 may be a Cathode Ray Tube (CRT) display, a Thin Film Transistor (TFT) display, an organic light emitting diode, an electroluminescent panel, a plasma panel, or any other type of display. Computing system 2800 may also include one or more peripheral controller or interface 2805 for one or more peripheral devices 2806. Examples of peripheral devices may include one or more printers, scanners, input devices, output devices, sensors, and so forth. Communication controller 2814 may be connected to one or more communication devices 2815 that enable system 2800 to connect to remote devices through any of a variety of networks including the internet, cloud resources (e.g., ethernet cloud, fibre channel over ethernet (FCoE)/Data Center Bridge (DCB) cloud, etc.), local Area Network (LAN), wide Area Network (WAN), storage Area Network (SAN), or through any suitable electromagnetic carrier signal including infrared signals. As shown in the depicted embodiment, computing system 2800 includes one or more fans or fan trays 2818 and one or more cooling subsystem controllers 2817, the cooling subsystem controllers 2817 monitor the thermal temperature of the system 2800 (or components thereof) and operate the fans/fan trays 2818 to help regulate the temperature.
In the system shown, all major system components may be connected to bus 2816, and bus 2816 may represent more than one physical bus. However, the various system components may or may not be physically proximate to each other. For example, the input data and/or the output data may be remotely transmitted from one physical location to another. Further, programs embodying aspects of the present disclosure may be accessed from a remote location (e.g., server) over a network. Such data and/or programs may be transmitted by any of a variety of machine-readable media, including for example: magnetic media (such as hard disks, floppy disks, and magnetic tape); optical media such as Compact Discs (CDs) and holographic devices; a magneto-optical medium; and hard-wired devices (such as Application Specific Integrated Circuits (ASICs), programmable Logic Devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices (such as 3D XPoint-based devices), and ROM and RAM devices) that are specially configured to store or execute program code.
Aspects of the present disclosure may be encoded on one or more non-transitory computer-readable media having instructions for one or more processors or processing units to perform the steps. It should be noted that one or more non-transitory computer-readable media should include volatile and/or nonvolatile memory. It should be noted that alternative embodiments are possible, including hardware embodiments or software/hardware embodiments. The hardware-implemented functions may be implemented using an ASIC, a programmable array, digital signal processing circuitry, or the like. Accordingly, the term "apparatus" in any claim is intended to cover both software and hardware implementations. Similarly, the term "computer-readable medium" as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. In view of these implementation alternatives, it should be appreciated that the figures and accompanying description provide the functional information required by one skilled in the art in writing program code (i.e., software) and/or fabricating circuitry (i.e., hardware) to perform the desired processing.
It should be noted that embodiments of the present disclosure may also relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer readable media include, for example: magnetic media (such as hard disks, floppy disks, and magnetic tape); optical media (e.g., compact Discs (CDs) and holographic devices); a magneto-optical medium; and hardware devices that are specially configured to store or execute program code, such as Application Specific Integrated Circuits (ASICs), programmable Logic Devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices, such as 3D XPoint-based devices, and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions, which may be located in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In a distributed computing environment, program modules may be physically located in a local, remote, or both arrangement.
Those skilled in the art will recognize that the computing system or programming language is not critical to the practice of the present disclosure. Those skilled in the art will also recognize that the various elements described above may be physically and/or functionally separated into modules and/or sub-modules or combined together.
Those skilled in the art will appreciate that the foregoing examples and embodiments are exemplary and are not intended to limit the scope of the disclosure. All permutations, enhancements, equivalents, combinations and modifications as will become apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the spirit and scope of the present disclosure. It should also be noted that the elements of any claim may be arranged differently, including having a variety of dependencies, configurations, and combinations.

Claims (20)

1. A video generation method, comprising:
given the input text referring to an event in an activity,
parsing the input text using a text parsing module to identify the event referenced in the input text; and
converting the input text to TTS generated audio using a text-to-speech TTS module;
given at least a portion of the activity and an input video of the identified event:
Performing time anchoring to associate a runtime of the input video with a runtime of the activity;
identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event;
extracting features from the initial video clip;
obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model;
generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and
responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and
combining the TTS generated audio with the final video clip to generate an event highlight moment video.
2. The video generation method of claim 1, wherein the step of performing time anchoring to associate a runtime of the input video with a runtime of the activity comprises:
using optical character recognition on a set of video frames of the input video to read a time of a clock displayed in the input video;
generating a set of time anchors, including a start time of the activity and any time offset, given the identified time of the clock; and
a time map is generated using at least some of the set of time anchors, the time map mapping the time of the clock with the run time of the input video.
3. The video generation method of claim 2, wherein the step of identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event comprises:
parsing data from the metadata to obtain an approximate time of the event; and
The approximate time and the time map of the event in the input video are used to generate the initial video clip that includes the event.
4. The video generation method of claim 1, wherein the steps of extracting features from the initial video clip and obtaining a final time value for the event in the initial video clip using the extracted features and trained neural network model comprise:
feature extracting the initial video clip using a set of two or more models; and
the final time value for the event in the initial video clip is obtained using an integrated neural network model that receives input related to the feature from the set of two or more models and outputs the final time value.
5. The video generation method of claim 4, wherein the set of two or more models comprises:
a neural network model extracting video features from the initial video clip;
a multi-mode feature neural network model that generates multi-mode based features using video and audio information in the initial video clip; and
an audio feature extractor generates features of the initial video clip based on audio levels in the initial video clip.
6. The video generation method of claim 1, further comprising:
generating a dataset comprising one or more entities related to the activity; and
text parsed from the input text is filtered using the dataset to help identify which text parsed from the input text includes information about the event.
7. The video generation method of claim 1, further comprising:
given supplementary text about the activity or event:
parsing the supplemental text using a text parsing module to identify information related to the event mentioned in the supplemental text; and
the information extracted from the supplemental text is used to help identify the time of occurrence of the event.
8. The video generation method of claim 1, wherein the parsing is neural network-based, rule-based, or both.
9. A video generation system, comprising:
one or more processors; and
a non-transitory computer-readable medium comprising one or more sets of instructions that, when executed by at least one of the one or more processors, cause performance of steps comprising:
Given input text referring to an event in an activity:
parsing the input text using a text parsing module to identify the event referenced in the input text; and
converting the input text to TTS generated audio using a text-to-speech TTS module;
given at least a portion of the activity and an input video of the identified event:
performing time anchoring to associate a runtime of the input video with a runtime of the activity;
identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event;
extracting features from the initial video clip;
obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model;
generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and
Responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and
combining the TTS generated audio with the final video clip to generate an event highlight moment video.
10. The video generation system of claim 9, wherein the step of performing time anchoring to associate a runtime of the input video with a runtime of the activity comprises:
using optical character recognition on a set of video frames of the input video to read a time of a clock displayed in the input video;
generating a set of time anchors, including a start time of the activity and any time offset, given the identified time of the clock; and
a time map is generated using at least some of the set of time anchors, the time map mapping the time of the clock with the run time of the input video.
11. The video generation system of claim 10, wherein the step of identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, further generating an initial clip from the input video including the event comprises:
Parsing data from the metadata to obtain an approximate time of the event; and
the approximate time and the time map of the event in the input video are used to generate the initial video clip that includes the event.
12. The video generation system of claim 9, wherein the steps of extracting features from the initial video clip and obtaining a final time value for the event in the initial video clip using the extracted features and trained neural network model comprise:
feature extracting the initial video clip using a set of two or more models; and
the final time value for the event in the initial video clip is obtained using an integrated neural network model that receives input related to the feature from the set of two or more models and outputs the final time value.
13. The video generation system of claim 12, wherein the set of two or more models comprises:
a neural network model extracting video features from the initial video clip;
a multi-mode feature neural network model that generates multi-mode based features using video and audio information in the initial video clip; and
An audio feature extractor generates features of the initial video clip based on audio levels in the initial video clip.
14. The video generation system of claim 9, wherein the one or more non-transitory computer-readable media further comprise one or more sets of instructions that, when executed by at least one of the one or more processors, cause the following steps to be performed, the steps comprising:
generating a dataset comprising one or more entities related to the activity; and
text parsed from the input text is filtered using the dataset to help identify which text parsed from the input text includes information about the event.
15. The video generation system of claim 9, wherein the one or more non-transitory computer-readable media further comprise one or more sets of instructions that, when executed by at least one of the one or more processors, cause the following steps to be performed, the steps comprising:
given supplementary text about the activity or event:
parsing the supplemental text using a text parsing module to identify information related to the event mentioned in the supplemental text; and
The information extracted from the supplemental text is used to help identify the time of occurrence of the event.
16. A non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed, the steps comprising:
given the input text referring to an event in an activity,
parsing the input text using a text parsing module to identify the event referenced in the input text; and
converting the input text to TTS generated audio using a text-to-speech TTS module;
given at least a portion of the activity and an input video of the identified event:
performing time anchoring to associate a runtime of the input video with a runtime of the activity;
identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, thereby generating an initial clip from the input video including the event;
extracting features from the initial video clip;
Obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model;
generating a final video clip by editing the initial video clip to have a run-time consistent with the run-time of the TTS generated audio in response to the run-time of the initial video clip not consistent with the run-time of the TTS generated audio; and
responsive to the run time of the initial video clip coinciding with the run time of the TTS-generated audio, using the initial video clip as the final video clip; and
combining the TTS generated audio with the final video clip to generate an event highlight moment video.
17. The non-transitory computer-readable medium or media of claim 16, wherein the step of time anchoring to associate the runtime of the input video with the runtime of the activity comprises:
using optical character recognition on a set of video frames of the input video to read a time of a clock displayed in the input video;
generating a set of time anchors, including a start time of the activity and any time offset, given the identified time of the clock; and
A time map is generated using at least some of the set of time anchors, the time map mapping the time of the clock with the run time of the input video.
18. The non-transitory computer readable medium or media of claim 17, wherein identifying an approximate time at which the event occurred during the activity by using time information parsed from the input text, from additional sources related to the activity, or from both, and a related time obtained by time anchoring, further generating an initial clip from the input video including the event comprises:
parsing data from the metadata to obtain an approximate time of the event; and
the approximate time and the time map of the event in the input video are used to generate the initial video clip that includes the event.
19. The non-transitory computer readable medium or media of claim 16, wherein extracting features from the initial video clip and obtaining final time values for the events in the initial video clip using the extracted features and trained neural network model comprises:
Feature extracting the initial video clip using a set of two or more models; and
the final time value for the event in the initial video clip is obtained using an integrated neural network model that receives input related to the feature from the set of two or more models and outputs the final time value.
20. The non-transitory computer-readable medium or medium of claim 16, further comprising one or more sequences of instructions which, when executed by at least one processor, cause the steps to be performed comprising:
given supplementary text about the activity or event:
parsing the supplemental text using a text parsing module to identify information related to the event mentioned in the supplemental text; and
the information extracted from the supplemental text is used to help identify the time of occurrence of the event.
CN202210979659.0A 2021-11-23 2022-08-16 Method, system and storage medium for generating highlight moment video from video and text input Pending CN116170651A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/533,769 US20220189173A1 (en) 2020-12-13 2021-11-23 Generating highlight video from video and text inputs
US17/533,769 2021-11-23

Publications (1)

Publication Number Publication Date
CN116170651A true CN116170651A (en) 2023-05-26

Family

ID=86415212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210979659.0A Pending CN116170651A (en) 2021-11-23 2022-08-16 Method, system and storage medium for generating highlight moment video from video and text input

Country Status (1)

Country Link
CN (1) CN116170651A (en)

Similar Documents

Publication Publication Date Title
Shih A survey of content-aware video analysis for sports
Awad et al. Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning, and hyperlinking
US10755103B2 (en) Methods and systems of spatiotemporal pattern recognition for video content development
Xu et al. Using webcast text for semantic event detection in broadcast sports video
Xie et al. Event mining in multimedia streams
US20220189173A1 (en) Generating highlight video from video and text inputs
US10140575B2 (en) Sports formation retrieval
CN106464958B (en) System and method for performing spatiotemporal analysis of sporting events
CN102427507B (en) Football video highlight automatic synthesis method based on event model
Kolekar Bayesian belief network based broadcast sports video indexing
US8077930B2 (en) Method for recognizing content in an image sequence
Saba et al. Analysis of vision based systems to detect real time goal events in soccer videos
Xiong et al. Semantic retrieval of video-review of research on video retrieval in meetings, movies and broadcast news, and sports
WO2018053257A1 (en) Methods and systems of spatiotemporal pattern recognition for video content development
Yu et al. Soccer video event detection based on deep learning
Merler et al. Automatic curation of golf highlights using multimodal excitement features
US20230055636A1 (en) Transformer-based temporal detection in video
Berrani et al. Constraint satisfaction programming for video summarization
Qi et al. Sports video captioning by attentive motion representation based hierarchical recurrent neural networks
US11769327B2 (en) Automatically and precisely generating highlight videos with artificial intelligence
US11724171B2 (en) Reducing human interactions in game annotation
Zhang et al. Integration of visual temporal information and textual distribution information for news web video event mining
Boukadida et al. Automatically creating adaptive video summaries using constraint satisfaction programming: Application to sport content
Qu et al. Semantic movie summarization based on string of IE-RoleNets
Otani et al. Textual description-based video summarization for video blogs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination