WO2021112419A1 - Procédé et dispositif électronique pour modification automatique de vidéo - Google Patents

Procédé et dispositif électronique pour modification automatique de vidéo Download PDF

Info

Publication number
WO2021112419A1
WO2021112419A1 PCT/KR2020/015648 KR2020015648W WO2021112419A1 WO 2021112419 A1 WO2021112419 A1 WO 2021112419A1 KR 2020015648 W KR2020015648 W KR 2020015648W WO 2021112419 A1 WO2021112419 A1 WO 2021112419A1
Authority
WO
WIPO (PCT)
Prior art keywords
text block
video
electronic device
frame
text
Prior art date
Application number
PCT/KR2020/015648
Other languages
English (en)
Inventor
Pramesh DAHIYA
Aayush VIJAYVARGIYA
Vijay Kumar Raju NADIMPALLI
Sai Hemanth KASARANENI
Sony Venkatesh VADLAMUDI
Garima CHADHA
Nitin Jain
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2021112419A1 publication Critical patent/WO2021112419A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to electronic devices, and more specifically to a method and electronic device for automatically editing a video.
  • Social networking apps limit a length of the video in publishing the video to the social networks.
  • the social networking apps trim a portion from the video to a limited length (e.g. 30 seconds) for publishing, when the length of the video exceeds the limited length.
  • the trimmed portion in the video may include scenes which are less appealing or not appealing to the user, which hampers user experience.
  • the principal object of the embodiments herein is to provide a method and electronic device for automatically editing a video.
  • Another object of the embodiments herein is to segment the video based on check points in the video, and generate text blocks for image frames and the audio frames in each segment.
  • Another object of the embodiments herein is to refine the text blocks based on a relevant object in the text blocks and generate a context score for each text block.
  • Another object of the embodiments herein is to predict a candidate text block in the text blocks based on context scores of remaining text blocks, match the predicted text block with the candidate text block and determine an inflexion between the predicted text block and the candidate text block.
  • Another object of the embodiments herein is to edit the text blocks based on editing parameters, the relevant object in the text blocks, timestamps associated with the image frames and the audio frames and the inflexion, where the editing parameter includes a context of each frame of the video, a context of at least one object displayed in the each frame, a relationship among the frames, a sequence of actions displayed in the each frame, a duplicate portion of text in each text block, a user preference to delete at least one portion of the text in the each text block, a genre of the each text block, a genre of the video and a user behavior.
  • the editing parameter includes a context of each frame of the video, a context of at least one object displayed in the each frame, a relationship among the frames, a sequence of actions displayed in the each frame, a duplicate portion of text in each text block, a user preference to delete at least one portion of the text in the each text block, a genre of the each text block, a genre of the video and a user behavior.
  • Another object of the embodiments herein is to map the edited text block to the segment of the video and produce an edited video by combining the mapped segments.
  • Another object of the embodiments herein is to automatically edit a recorded video or a live video without a manual intervention in editing the video.
  • Another object of the embodiments herein is to automatically edit a recorded video or a live video without a manual intervention in editing the video to a length of video maximum supports by an application to publish.
  • the embodiments herein provide a method for automatically editing a video.
  • the method includes receiving, by the electronic device, a plurality of frames of a video. Further, the method includes generating, by the electronic device, a plurality of text block corresponding to the plurality of frames of the video. Further, the method includes editing, by the electronic device, at least one text block from the plurality of text block based on at least one editing parameter, where the editing parameter includes at least one of a contextual parameter and a user related parameter. Further, the method includes mapping, by the electronic device, the at least one edited text block to at least one frame from the plurality of frames. Further, the method includes producing, by the electronic device, an edited video comprising the at least one mapped frame of the video.
  • the edited video is producing by combining the at least one mapped frame of the video in a sequence based on a timestamp of the at least one mapped frame.
  • generating, by the electronic device, the plurality of text block corresponding to the plurality of frames of the video includes identifying, by the electronic device, at least one check point in the video, segmenting, by the electronic device, the video into a plurality of image segment and a plurality of audio segment with associated timestamps based on the at least one check point, wherein each audio segment includes a plurality of audio frames, and each video segment includes a plurality of image frames, generating, by the electronic device, a first text block for the each image segment in the plurality of image segment, generating, by the electronic device, a second text block for the each audio segment in the plurality of audio segment, determining, by the electronic device, a relevant object in the plurality of text block, where in the plurality of text block includes the first text block and the second text block, modifying, by the electronic device, the first text block and the second text block with respect to the relevant object, and collating, by the electronic device, the first text block and the second text block based on the
  • editing, by the electronic device, the at least one text block from the plurality of text block based on at least one editing parameter includes determining, by the electronic device, a context score for the at least one text block of the plurality of text block, predicting, by the electronic device, a candidate text block in the plurality of text block based on the context score of remaining text blocks in the plurality of text block, matching, by the electronic device, a predicted text block with the candidate text block in the plurality of text block, determining, by the electronic device, an inflexion between the predicted text block and the candidate text block based on the match, and editing, by the electronic device, the at least one text block from the plurality of text block based on the inflexion between the predicted text block and the candidate text block and the at least one editing parameter.
  • the contextual parameter includes a context of each frame of the video, a context of at least one object displayed in the each frame, a relationship among the frames, a sequence of actions displayed in the each frame, a duplicate portion of text in each text block, a genre of the each text block, and a genre of the video.
  • the user related parameter includes a user preference to delete at least one portion of the text in the each text block, and a user behavior.
  • the user behavior includes a user reactions while watching at least one frame of the video
  • a change in vital parameters of the user while watching the at least one frame of the video and a user preference for the at least one frame of the video is included in an embodiment.
  • the electronic device modifies the user preference for the at least one frame of the video based on the user reactions, the change in vital parameters and the user preference to delete the at least one portion of the text in the text block.
  • the text block includes at least one sentence with an associated timestamp, wherein each text block starts or ends at a timestamp of a check point in the video.
  • the embodiments herein provide an electronic device for automatically editing a video.
  • the electronic device includes a memory and a processor, where the processor is coupled to the memory.
  • the processor is configured to receive a plurality of frames of a video.
  • the processor is configured to generate a plurality of text block corresponding to the plurality of frames of the video.
  • the processor is configured to edit at least one text block from the plurality of text block based on at least one editing parameter, where the editing parameter includes at least one of a contextual parameter and a user related parameter.
  • the processor is configured to map the at least one edited text block to at least one frame from the plurality of frames.
  • the processor is configured to produce an edited video including the at least one mapped frame of the video.
  • FIG. 1 is a block diagram of an electronic device for automatically editing a video, according to an embodiment as disclosed herein;
  • FIG. 2 is a flow diagram illustrating a method for editing the video using the electronic device, according to an embodiment as disclosed herein;
  • FIG. 3 is an architectural diagram of the electronic device illustrating operations performed by hardware components of the electronic device for automatically editing the video, according to an embodiment as disclosed herein;
  • FIG. 4 illustrates an example scenario of managing duplicity in sentences by a text block editor for automatically editing the video, according to an embodiment as disclosed herein;
  • FIG. 5 is an example scenario illustrating operations performed by the hardware components of the electronic device for automatically editing the video, according to an embodiment as disclosed herein;
  • FIG. 6A-6B illustrates an example scenario of automatically editing the video for two applications by the electronic device, according to an embodiment as disclosed herein;
  • FIG. 7A-7C illustrates an example scenario of producing an edited video including relevant image frames from a stored video by the electronic device, according to an embodiment as disclosed herein;
  • FIG. 8 illustrates an example scenario of producing the edited video including the relevant image frames from a live video by the electronic device, according to an embodiment as disclosed herein;
  • FIG. 9A-9B illustrates an example scenario of producing the edited video including the relevant image frames from multiple videos by the electronic device, according to an embodiment as disclosed herein.
  • circuits may, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like.
  • circuits constituting a block may be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block.
  • a processor e.g., one or more programmed microprocessors and associated circuitry
  • Each block of the embodiments may be physically separated into two or more interacting and discrete blocks without departing from the scope of the disclosure.
  • the blocks of the embodiments may be physically combined into more complex blocks without departing from the scope of the disclosure.
  • the embodiments herein provide a method for automatically editing a video.
  • the method includes receiving, by the electronic device, a plurality of frames of a video. Further, the method includes generating, by the electronic device, a plurality of text block corresponding to the plurality of frames of the video. Further, the method includes editing, by the electronic device, at least one text block from the plurality of text block based on at least one editing parameter. Further, the method includes mapping, by the electronic device, the at least one edited text block to at least one frame from the plurality of frames. Further, the method includes producing, by the electronic device, an edited video comprising the at least one mapped frame of the video.
  • the proposed method allows the electronic device to produce the edited video by combining the at least one mapped frame of the video.
  • the at least one mapped frame of the video are interesting scenes in the video for the user.
  • the electronic device produces the edited video with a length within a maximum length of the video supporting by social networking apps for publishing the video to social networks. Therefore, the proposed method saves time consumption by the user in analyzing the video to find the interesting scenes in the video, reduces manual effort in finding the interesting scenes in the video and improves user experience in producing the edited video.
  • FIGS. 1 through 9B there are shown preferred embodiments.
  • FIG. 1 is a block diagram of an electronic device 100 for automatically editing a video, according to an embodiment as disclosed herein.
  • the video is one of a recorded video or a live video.
  • Examples for the electronic device 100 are, but not limited to a smart phone, a tablet computer, a personal computer, a desktop computer, a personal digital assistance (PDA), a multimedia device, and the like.
  • the electronic device 100 includes a processor 120, a memory 140 and a communicator 160.
  • electronic device 100 includes a camera (not shown) and a microphone (not shown).
  • the processor 120 includes a frame engine 121, a text block generator 122, a text block editor 123, a mapping engine 124, a text block refining engine 125, a context score generator 126, a user behavior engine 127, a text block prediction engine 128, a text block comparator 129, and a genre determiner 130.
  • the processor 120 is configured to execute instructions stored in the memory 140.
  • the memory 140 includes a user preference database, where a user behavior is recorded in the user preference database.
  • the user behavior includes user reactions while watching at least one frame of the video, a change in vital parameters of the user while watching the at least one frame of the video, and a user preference for the at least one frame of the video.
  • Examples for the user reactions are, but not limited to expressions of the user, comments of the user, scenes in the video skipped by the user, scenes in the video repeatedly watched by the user and the like.
  • Examples for the vital parameters are, but not limited to blood pressure, breathing rate, pulse rate, heart beat rate, sweating rate and the like.
  • Examples for the user preference for at least one frame of the video are, but not limited to a user preferred genre (e.g. horror, romantic, comedy, etc.) of the video, a user preferred character in the video, a user preferred artist, a user preferred audio, and the like.
  • the memory 140 stores applications (e.g. camera application, instant messaging application, social networking application, video playing application, etc.) installed in the electronic device 100 and a recorded video.
  • the memory 140 may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of an Electrically Programmable Memory (EPROM) or an Electrically Erasable and Programmable Memory (EEPROM).
  • EPROM Electrically Programmable Memory
  • EEPROM Electrically Erasable and Programmable Memory
  • the memory 140 may, in some examples, be considered a non-transitory storage medium.
  • the term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted that the memory 140 is non-movable.
  • the memory 140 can be configured to store larger amounts of information than the memory 140.
  • a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).
  • RAM Random Access Memory
  • the processor 120 is configured to receive a plurality of frames of the video, where each frame of the video includes an image frame and an audio frame.
  • the frame engine 121 receives the plurality of frames of the video.
  • the genre determiner 130 determines the genre of the video, in response to receiving the plurality of frames of the video. Further, the genre determiner 130 sends the genre of the video to the text block editor 123.
  • the user behavior engine 127 determines a user behavior based on a video watch history of the user, and records the user behavior to the user preference database. In an embodiment, the user behavior engine 127 analyzes a user experience in watching the video while progressing a playback of the video, for determining the user behavior. In an embodiment, the user behavior engine 127 analyzes the user experience using Long Short-Term Memory (LSTM) based neural network.
  • LSTM Long Short-Term Memory
  • the user behavior engine 127 determines a user preference to delete at least one portion of the text in the each text block based on a user input.
  • the text block includes at least one sentence with an associated timestamp.
  • the user behavior engine 127 fetches the recorded user behavior from the user preference database, in response to receiving the recorded video for editing.
  • the user behavior engine 127 dynamically determines the user behavior in watching the live video and fetches the recorded user behavior from the user preference database, in response to receiving the live video for editing. Further, the user behavior engine 127 updates the user behavior in the user preference database based on the dynamically determined user behavior.
  • the processor 120 is configured to generate a plurality of text block corresponding to the plurality of frames of the video.
  • the text block is the at least one sentence which defines a scene and an audio playing in at least one frame of the video.
  • the text block generator 122 generates the plurality of text block corresponding to the plurality of frames of the video.
  • the frame engine 121 identifies at least one check point in the video based on change in a scene in the video.
  • the frame engine 121 uses a Scene Transition Graphs (STG) method for identifying the at least one check point in the video.
  • STG Scene Transition Graphs
  • the frame engine 121 segments the video into a plurality of image segment and a plurality of audio segment with associated timestamps based on the at least one check point.
  • each audio segment includes a plurality of audio frames
  • each video segment includes a plurality of image frames.
  • the text block generator 122 generates a first text block for the each image segment in the plurality of image segment.
  • the text block generator 122 extracts visual features in the image segment using Convolutional Neural Network (CNN), generates words and rearranges the words using the LSTM based neural network for generating the first text block.
  • CNN Convolutional Neural Network
  • each text block starts or ends at a timestamp of a check point in the video.
  • the text block generator 122 generates a second text block for the each audio segment in the plurality of audio segment.
  • the text block generator 122 converts audio in the each audio segment to sentences by using Automatic Speech Recognition (ASR) for generating the second text block.
  • ASR Automatic Speech Recognition
  • the text block refining engine 125 determines a relevant object in the plurality of text block, where the plurality of text block includes the first text block and the second text block. Examples for the relevant object are, but not limited to an important character in the video, an important location in the video, an important location in the video, an artist in the video, a character mostly appear in the video, and the like.
  • the relevant object in the plurality of text block is the relevant object in the video.
  • the text block refining engine 125 modifies the first text block and the second text block with respect to the relevant object by mapping each text block in the plurality of text block to the relevant object.
  • the text block generator 122 generates the first text block "Man is riding the cycle" for one of the image segment.
  • the text block refining engine 125 recognizes that "Arthur” as the relevant object in the video.
  • the text block refining engine 125 modifies the first text block to "Arthur is riding the cycle", in response to detecting the relevant object.
  • the text block refining engine 125 collates the first text block and the second text block based on the timestamps associated with the plurality of image segment and the plurality of audio segment.
  • a time stamp of one of the first text block is same as the timestamp associated with a corresponding image segment of the one of the first text block.
  • a time stamp of one of the second text block is same as the timestamp associated with a corresponding audio segment of the one of the second text block.
  • the first text block and the second text block collates when the timestamp of the first text block and the timestamp of the second text block is same.
  • the text block refining engine 125 modifies the first text block to "Arthur is riding the cycle" where the time stamp of the first text block is "11:05 second”.
  • the text block refining engine 125 modifies the second text block to "Arthur says that he will reach home by 10" where the time stamp of the second text block is "11:05 second”.
  • the text block refining engine 125 collates the first text block and the second text block having the timestamp to form a single text block "Arthur is riding the cycle and Arthur says that he will reach home by 10".
  • the processor 120 is configured to edit at least one text block from the plurality of text block based on at least one editing parameter, where the editing parameter includes a contextual parameter and a user related parameter.
  • the contextual parameter includes a context of each frame of the video, a context of at least one object displayed in the each frame, a relationship among the frames, a sequence of actions displayed in the each frame, a duplicate portion of text in each text block, a genre of the each text block, and a genre of the video.
  • the user related parameter includes a user preference to delete the at least one portion of the text in the each text block, and the user behavior.
  • each frame of the video examples are, but not limited to an activity performed by the relevant object in the video, a time of a day shown in the each frame of the video, a location shown in the each frame of the video, a relevancy of the each frame of the video, a usefulness of the each frame of the video and the like.
  • the text block editor 123 edits the at least one text block from the plurality of text block based on the at least one editing parameter. In an embodiment, the text block editor 123 edits the at least one text block from the plurality of text block based on the at least one editing parameter.
  • the context score generator 126 determines a context score for the at least one text block of the plurality of text block based on the context of each frame of the video, the context of the at least one object displayed in the each frame, the relationship among the frames, the sequence of actions displayed in the each frame, the duplicate portion of text in each text block and the genre of the each text block. Further, the context score generator 126 provides the context score of the at least one text block and the plurality of text block to the user behavior engine 127, the text block prediction engine 128 and the text block comparator 129.
  • the text block prediction engine 128 predicts a candidate text block in the plurality of text block based on the context score of remaining text blocks in the plurality of text block.
  • the block prediction engine 128 uses a Bidirectional Encoder Representations from Transformers (BERT) language model for predicting the candidate text block in the plurality of text block.
  • BitT Bidirectional Encoder Representations from Transformers
  • the text block prediction engine 128 predicts the Text block 1 using the context score of the Text block 2, the Text block 3 and the Text block 4, where the Text block 1 is the candidate text block in a first prediction.
  • the text block prediction engine 128 predicts the Text block 2 using the context score of the Text block 1, the Text block 3 and the Text block 4, where the Text block 2 is the candidate text block in a second prediction.
  • the text block prediction engine 128 predicts the Text block 3 using the context score of the Text block 1, the Text block 2 and the Text block 4, where the Text block 3 is the candidate text block in a third prediction.
  • the text block prediction engine 128 predicts the Text block 4 using the context score of the Text block 1, the Text block 2 and the Text block 3, where the Text block 4 is the candidate text block in a fourth prediction.
  • the block prediction engine 128 provides each predicted text block to the text block comparator 129.
  • the text block comparator 129 matches the predicted text block with the candidate text block in the plurality of text block.
  • text block comparator 129 matches the predicted text block with the candidate text block in the plurality of text block by contextually comparing the predicted text block with the candidate text block using semantic textual similarity and highlighting differences between the predicted text block with the candidate text block. Further, the text block comparator 129 determines an inflexion between the predicted text block and the candidate text block based on the match. In an embodiment, the text block comparator 129 generates an inflexion score for the at least one text block in the plurality of text block based on the inflexion between the predicted text block and the candidate text block. Further, the text block comparator 129 provides the inflexion score to the text block editor 123.
  • the user behavior engine 127 generates the user behavior score for the at least one text block based on the user behavior, in response to receiving the context score and the plurality of text block. Further, the user behavior engine 127 sends the user behavior score for the at least one text block, the context score of the at least one text block and the plurality of text block to the text block editor 123.
  • the processor 120 is configured to modify the user preference for the at least one frame of the video based on the user reactions while watching the live video or a new video (i.e. a new live video or a new recorded video.), the change in vital parameters while watching the live video or the new video, and the user preference to delete the at least one portion of the text in the text block while watching the live video or the new video.
  • the user behavior engine 127 modifies the user preference for the at least one frame of the video based on the user reactions while watching the live video or the new video, the change in vital parameters while watching the live video or the new video and the user preference to delete the at least one portion of the text in the text block while watching the live video or the new video.
  • the text block editor 123 edits the at least one text block from the plurality of text block based on the inflexion between the predicted text block and the predicted text block and the at least one editing parameter. In an embodiment, the text block editor 123 edits the at least one text block from the plurality of text block based on the inflection score of the at least one text block, the user behavior score for the at least one text block and the genre of the video. In an embodiment, editing of the at least one text block includes setting a limit on a length of the timestamp associated with one of a duplicate text block in the plurality of text block and deleting remaining duplicate text blocks in the plurality of text block.
  • editing of the at least one text block in the plurality of text block includes rearranging a sequence of the at least one text block in the plurality of text block. In an embodiment, editing of the at least one text block in the plurality of text block includes selecting at least one relevant text block from the plurality of text block.
  • the processor 120 is configured to map the at least one edited text block to the at least one frame from the plurality of frames.
  • the mapping engine 124 maps the at least one edited text block to the at least one frame from the plurality of frames.
  • the processor 120 is configured to produce an edited video, where the edited video includes the at least one mapped frame of the video.
  • the frame engine 121 produces the edited video by combining the at least one mapped frame of the video.
  • the at least one mapped frame is combined in the sequence based on the timestamp of the at least one mapped frame.
  • the at least one mapped frame is combined based on the sequence rearranged sequence by the text block editor 123.
  • FIG. 1 shows the hardware components of the electronic device 100 but it is to be understood that other embodiments are not limited thereon.
  • the electronic device 100 may include less or more number of components.
  • the labels or names of the components are used only for illustrative purpose and does not limit the scope of the invention.
  • One or more components can be combined together to perform same or substantially similar function for automatically editing the video.
  • FIG. 2 is a flow diagram 200 illustrating a method for editing the video using the electronic device 100, according to an embodiment as disclosed herein.
  • the method includes receiving the plurality of frames of the video.
  • the method allows the frame engine 121 to receive the plurality of frames of the video.
  • the method includes generating the plurality of text block corresponding to the plurality of frames of the video.
  • the method allows the text block generator 122 to generate the plurality of text block corresponding to the plurality of frames of the video.
  • the method includes editing the at least one text block from the plurality of text block based on the at least one editing parameter.
  • the method allows the text block editor 123 to edit the at least one text block from the plurality of text block based on the at least one editing parameter.
  • the method includes mapping the at least one edited text block to the at least one frame from the plurality of frames. In an embodiment, the method allows the mapping engine 124 to map the at least one edited text block to the at least one frame from the plurality of frames. At 205, the method includes producing the edited video including the at least one mapped frame of the video. In an embodiment, the method allows the frame engine 121 to produce the edited video including the at least one mapped frame of the video.
  • FIG. 3 is an architectural diagram of the electronic device 100 illustrating operations performed by the hardware components of the electronic device 100 for automatically editing the video, according to an embodiment as disclosed herein.
  • the camera application in the electronic device 100 uses the microphone and the camera to generate the live video of a scene.
  • the frame engine 121 and genre engine 130 receives (301A) the live video from the camera application for editing the video.
  • the frame engine 141 and the genre engine 130 fetches (301B) the video stored in the memory 140 or a server (not shown) for editing the video.
  • the genre engine 130 determines the genre of the video and provides the genre of the video to the text block editor 123.
  • the user behavior engine 127 fetches (303) the user behavior recorded at the user preference database from the memory 140, when the stored video is received at the frame engine 141 for editing. In an embodiment, the user behavior engine 127 fetches the user behavior recorded at the user preference database from the memory 140 and dynamically determines the user behavior in watching the live video, when the live video is received at the frame engine 141 for editing. The user behavior engine 127 dynamically determines the user behavior in watching the live video based on the expression of the user while watching the video, the change in vital parameters of the user while watching the video and the comments of the user while watching the video. Further, the user behavior engine 127 updates the user behavior in the user preference database based on the dynamically determined user behavior.
  • the frame engine 141 identifies (305) the check points in the video. Further, the frame engine 141 segments the video to multiple video block based on the check points.
  • the text block generator 122 generates (306) the text block for the image frames and the audio frames in each video block.
  • the text block refining engine 125 determines (307) the relevant object in the plurality of text block, where the plurality of text block includes the first text block and the second text block. Further, the text block refining engine 125 modifies the first text block and the second text block with respect to the relevant object by mapping each text block in the plurality of text block to the relevant object. Further, the text block refining engine 125 collates the first text block and the second text block based on the timestamps associated with the plurality of image segment and the plurality of audio segment.
  • the context score generator 126 determines (308) the context score for the at least one text block of the plurality of text block. Further, the context score generator 126 provides the context score of the at least one text block and the plurality of text block to the user behavior engine 127, the text block prediction engine 128 and the text block comparator 129. The text block prediction engine 128 predicts (309) the candidate text block in the plurality of text block based on the context score of remaining text blocks in the plurality of text block. Further, the block prediction engine 128 provides each predicted text block to the text block comparator 129. The text block comparator 129 matches (310) the predicted text block with the candidate text block in the plurality of text block.
  • the text block comparator 129 determines the inflexion between the predicted text block and the candidate text block based on the match. Further, the text block comparator 129 generates the inflexion score for the at least one text block in the plurality of text block based on the inflexion between the predicted text block and the candidate text block. Further, the text block comparator 129 provides the inflexion score to the text block editor 123.
  • the user behavior engine 127 generates (311) the user behavior score for the at least one text block based on the user behavior, in response to receiving the context score and the plurality of text block. Further, the user behavior engine 127 sends the user behavior score for the at least one text block, the context score of the at least one text block and the plurality of text block to the text block editor 123.
  • the text block editor 123 edits (312) the at least one text block from the plurality of text block based on the inflection score of the at least one text block, the user behavior score for the at least one text block and the genre of the video.
  • the mapping engine 124 maps (313) the at least one edited text block to the at least one frame from the plurality of frames. Further, the frame engine 121 produces (314) the edited video by combining the at least one mapped frame of the video.
  • FIG. 4 illustrates an example scenario of managing duplicity in sentences by the text block editor 123 for automatically editing the video, according to an embodiment as disclosed herein.
  • the electronic device 100 receives a video of a man riding a cycle, where the length of the video is 18 seconds and the video includes 1080 image frames.
  • the frame engine 121 segments the video into three video blocks video block-a, video block-b and video block-c based on check points in the video, where the length of each video block is 6 second and each video includes 360 frames.
  • the timestamp of the video block-a starts from 0 second.
  • the timestamp of the video block-b starts from 7 second.
  • the timestamp of the video block-c starts from 13 second.
  • the text block generator 122 In response to receiving each video block, the text block generator 122 generates the sentence for each video block as "A man is riding a cycle". Further, the text block generator 122 assigns the timestamp at which each video block starts to the sentence corresponds to each video block. In response to determining a duplicity in the sentence, the text block editor 123 sets the limit on the length from the timestamp associated with the sentence corresponds to the video block-a, where the length from the timestamp is limited to 6 seconds. Further, the text block editor 123 deletes the sentences corresponds to the video block-b and the video block-c.
  • FIG. 5 is an example scenario illustrating operations performed by the hardware components of the electronic device 100 for automatically editing the video, according to an embodiment as disclosed herein.
  • the frame engine 121 receives (501) the video of the man riding the cycle for automatically editing the video, where the video includes four frames named as frame 1, frame 2, frame 3 and frame 4.
  • the timestamp of the frame 1, the frame 2, the frame 3 and the frame 4 of the video is 0 sec, 1 sec, 2 sec and 3 sec respectively.
  • the image frame in the frame 1 shows that the man riding the cycle and the audio in the audio frame of the frame 1 is "I am going to home”.
  • the image frame in frame 2 of the video shows that the man riding the cycle and the audio in the audio frame of the frame 2 is "I will reach home by 10 AM”.
  • the image frame in the frame 3 of the video shows that the man riding the cycle and the audio in the audio frame of the frame 3 is "Finally, I reached my home".
  • the image frame in the frame 4 of the video shows that the man gets out of the cycle and the audio in the audio frame of the frame 4 is "Arthur is moving towards the home”.
  • the user behavior engine 127 determines the user behavior of skipping the scenes based on the user reaction and records to the user preference database.
  • the frame engine 121 provides the image frame and the audio frame in each frame of the video to the text block generator 122.
  • the text block generator 122 generates the first text block (503) for each image frame and associates the timestamp of each image frame to corresponding text block as: Man is riding the cycle (timestamp: 0 sec), Man is riding the cycle (timestamp: 1 sec), Man is riding the cycle (timestamp: 2 sec), Man gets out of the cycle (timestamp: 3 sec).
  • the text block generator 122 generates the second text block for each audio frame and associates the timestamp of each audio frame to corresponding text block as: Man says that he is going to home (timestamp: 0 sec), Man says that he will reach home by 10 AM (timestamp: 1 sec), Man says that he reached the home (timestamp: 2 sec), A background audio plays that Arthur is moving towards the home (timestamp: 3 sec)
  • the text block refining engine 125 detects (504) that Arthur as the relevant object from the text blocks.
  • the text block refining engine 125 modifies the first text block as: Arthur is riding the cycle (timestamp: 0 sec), Arthur is riding the cycle (timestamp: 1 sec), Arthur is riding the cycle (timestamp: 2 sec), Arthur gets out of the cycle (timestamp: 3 sec). Further, the text block refining engine 125 modifies the second text block with respect to Arthur as: Arthur says that he is going to home (timestamp: 0 sec), Arthur says that he will reach home by 10 AM (timestamp: 1 sec), Arthur says that he reached the home (timestamp: 2 sec), A background audio plays that Arthur is moving towards the home (timestamp: 3 sec).
  • the text block refining engine 125 collates the text blocks in the first text block and the second text block with same timestamp to form a single text block as: Text block-1: Arthur is riding the cycle and Arthur says that he is going to home (timestamp: 0 sec), Text block-2: Arthur is riding the cycle and Arthur says that he will reach home by 10 AM (timestamp: 1 sec), Text block-3: Arthur is riding the cycle and Arthur says that he reached the home (timestamp: 2 sec), Text block-4: Arthur gets out of the cycle and move towards the home (timestamp: 3 sec).
  • the user behavior engine 127 provides the user behavior to the text block editor 123.
  • the text block editor 123 determines that the text block-2 and the text block-3 are similar to the text block-1.
  • the text block editor 123 determines text block-2 does not contain relevant information unlike in the text block-1.
  • the text block editor 123 determines text block-1 contains relevant information unlike the text block-1. Therefore, the text block editor 123 deletes the text block-2. Further, the text block editor 123 deletes the text block-4 is based on the user behavior.
  • the text block editor 123 provides the text block-1 and the text block-3 to the mapping engine 124.
  • the mapping engine 124 maps the text block-1 and the text block-3 to the frame 1 and the frame 3 of the video respectively based on the timestamp. Further, the frame engine 121 combines the frame 1 and frame 3 to produce the edited video.
  • FIG. 6A-6B illustrates an example scenario of automatically editing the video for two applications by the electronic device 100, according to an embodiment as disclosed herein.
  • the electronic device 100 includes the two applications, where the first application allows to publish the video of length 3 seconds and the second application allows to publish the video of length 2 seconds.
  • the electronic device 100 receives a recorded video of a concert for automatically editing the video, where the video includes six image frames named as image frame 1, image frame 2, image frame 3, image frame 4, image frame 5 and image frame 6 as shown in the FIG. 6A.
  • the timestamp of the image frame 1, the image frame 2, the image frame 3, the image frame 4 image frame 5 and image frame 6 of the video are 0 sec, 1 sec, 2 sec, 3 sec, 4 sec, and 5 sec respectively, where the length of each image frame is 1 sec.
  • the image frame 1 and the image frame 3 shows only artists of the concert.
  • the image frame 2, image frame 4 and image frame 6 shows only viewers of the concert.
  • the image frame 5 shows an artist and the viewers of the concert.
  • the user behavior engine 127 determines the user behavior of liking to watch the artists of the concert from the user reactions.
  • the electronic device 100 generates the first text block for each image frame and associates the timestamp of each image frame to corresponding text block. Further, the electronic device 100 detects the artists as the relevant object from the text blocks. Further, the electronic device 100 modifies the first text block with respect to the relevant object.
  • the electronic device 100 determines that the text blocks correspond to the image frame 1, the image frame 3 and the image frame 5 are the relevant image frames in the video based on the user behavior.
  • the electronic device 100 determines the text block corresponds to the image frame 3 as a high prioritized text block due to presence of more number of artists in the image frame 3.
  • the electronic device 100 determines that the text block corresponds to the image frame 5 as a low prioritized text block due to presence of viewers along with the artist in the image frame 3.
  • the electronic device 100 determines that the text block corresponds to the image frame 1 as a medium prioritized text block due to presence of less number of artists in the image frame 1.
  • the electronic device 100 selects the text blocks correspond to the image frame 1, the image frame 3 and the image frame 5 based on the length of video allows by the first application to publish and the priority of the text blocks. Further, the electronic device 100 produces the edited video by combining the image frame 1, the image frame 3 and the image frame 5 based on the timestamps as shown in the FIG. 6B. The electronic device 100 produces the edited video with the length of 3 seconds for publishing the edited video using the first application.
  • the electronic device 100 selects the text blocks correspond to the image frame 1 and the image frame 3 based on the length of video allows by the second application to publish and the priority of the text blocks. Further, the electronic device 100 produces the edited video by combining the image frame 1 and the image frame 3 based on the timestamps as shown in the FIG. 6B. The electronic device 100 produces the edited video with the length of 2 seconds for publishing the edited video using the second application.
  • FIG. 7A-7C illustrates an example scenario of producing the edited video including the relevant image frames from the stored video by the electronic device 100, according to an embodiment as disclosed herein.
  • a ground in front of a house is as shown in notation (a) of the FIG. 7A. Later, a person enters to the ground and walk towards the house as shown in notation (b) of the FIG. 7A.
  • a surveillance camera mounted on the house records all this scenes.
  • the electronic device 100 receives the surveillance camera footage of the scenes recorded by the surveillance camera for editing.
  • the video footage includes six image frames named as image frame 1, image frame 2, image frame 3, image frame 4, image frame 5 and image frame 6 as shown in the FIG. 7B.
  • the timestamp of the image frame 1, image frame 2, image frame 3, image frame 4, image frame 5 and image frame 6 of the video are 0 sec, 1 sec, 2 sec, 3 sec, 4 sec, and 5 sec respectively, where the length of each image frame is 1 sec.
  • the image frame 1, the image frame 2 and the image frame 6 shows the ground in front of the home.
  • the image frame 3, the image frame 4 and the image frame 5 shows the person walking towards the home.
  • the electronic device 100 generates the first text block for each image frame and associates the timestamp of each image frame to corresponding text block. Further, the electronic device 100 detects the person as the relevant object from the text blocks. Further, the electronic device 100 modifies the first text block with respect to the relevant object. The electronic device 100 selects the text blocks correspond to the image frame 3, the image frame 4 and the image frame 5 are the relevant image frames in the video based on the relevant object present in the text blocks. Further, the electronic device 100 combines the image frame 3, the image frame 4 and the image frame 5 based on the timestamps as shown in the FIG. 7C for generating the edited video.
  • FIG. 8 illustrates an example scenario of producing the edited video including the relevant frames from the live video by the electronic device 100, according to an embodiment as disclosed herein.
  • the electronic device 100 receives the recorded video of the concert for automatically editing the video, where the video includes three image frames named as image frame 1, image frame 2, and image frame 3.
  • the timestamp of the image frame 1, the image frame 2, and the image frame 3 are 0 sec, 1 sec and 2 sec respectively, where the length of each image frame is 1 sec.
  • the image frame 1 and the image frame 2 shows only artists of the concert, and the image frame 2 shows a horror scene in the concert as shown in notation (a) of the FIG. 8.
  • the electronic device 100 determines the user behavior in watching horror scenes from the video watching history of the user, where the user unlike to watch horror scenes.
  • the electronic device 100 generates the first text block for each image frame and associates the timestamp of each image frame to corresponding text block. Further, the electronic device 100 detects the artists as the relevant object from the text blocks. Further, the electronic device 100 modifies the first text block with respect to the relevant object. The electronic device 100 selects that the text blocks correspond to the image frame 1, and the image frame 3 are the relevant image frames in the video based on the user behavior. Further, the electronic device 100 combines the image frame 1, and the image frame 3 based on the timestamps as shown in notation (b) of the FIG. 8, for generating the edited video.
  • FIG. 9A-9B illustrates an example scenario of producing the edited video including the relevant image frames from multiple videos by the electronic device 100, according to an embodiment as disclosed herein.
  • the electronic device 100 receives the recorded video of the concert at a first instant, the surveillance camera footage at a second instant, and the video of the man riding the cycle at a third instant as shown in notation (a) of the FIG. 9B, for automatically editing.
  • the recorded video of the concert includes three image frames named as image frame 1, image frame 2, and image frame 3, where the timestamp of the image frame 1, the image frame 2, and the image frame 3 are 0 sec, 1 sec and 2 sec respectively.
  • the length of each image frame of the recorded video of the concert is 1 sec.
  • the image frame 2 of the recorded video of the concert shows the artist of the concert, where the image frame 1 and the image frame 3 of the recorded video of the concert shows the viewers of the concert as shown in the FIG. 9A.
  • the surveillance camera footage includes three image frames named as image frame 1, image frame 2, and image frame 3, where the timestamp of the image frame 1, the image frame 2, and the image frame 3 are 0 sec, 1 sec and 2 sec respectively.
  • the length of each image frame of the surveillance camera footage is 1 sec.
  • the image frame 2 and the image frame 3 of the surveillance camera footage shows the person walking on the ground, where the image frame 1 shows the ground as shown in the FIG. 9A.
  • the video of the man riding the cycle includes three image frames named as image frame 1, image frame 2, and image frame 3, where the timestamp of the image frame 1, the image frame 2, and the image frame 3 are 0 sec, 1 sec and 2 sec respectively.
  • the length of each image frame of the video of the man riding the cycle is 1 sec.
  • All the image frames of the video of the man riding the cycle shows the man riding the cycle as shown in FIG. 9A.
  • the user behavior engine 127 determines the user behavior of liking to watch the artists of the concert from the user reactions.
  • the electronic device 100 generates the first text block for each image frame of the recorded video of the concert and associates the timestamp of each image frame of the recorded video of the concert to the corresponding text block.
  • the electronic device 100 generates the first text block for each image frame of the surveillance camera footage and associates the timestamp of each image frame of the surveillance camera footage to the corresponding text block.
  • the electronic device 100 generates the first text block for each image frame of the video of the man riding the cycle and associates the timestamp of each image frame of the video of the man riding the cycle to the corresponding text block.
  • the electronic device 100 detects the artists as the relevant object from the text blocks generated for the recorded video of the concert. Further, the electronic device 100 modifies the first text block for the recorded video of the concert with respect to the relevant object. The electronic device 100 selects that the text blocks correspond to the image frame 2 as the relevant image frames in the recorded video of the concert based on the user behavior.
  • the electronic device 100 detects the person walking on the ground as the relevant object from the text blocks generated for the surveillance camera footage. Further, the electronic device 100 modifies the first text block for the surveillance camera footage with respect to the relevant object. The electronic device 100 selects that the text blocks correspond to the image frame 2 and the image frame 3 as the relevant image frames in the surveillance camera footage based on the relevant object in the surveillance camera footage.
  • the electronic device 100 detects the man as the relevant object from the text blocks generated for the video of man riding the cycle. Further, the electronic device 100 modifies the first text block for the video of man riding the cycle with respect to the relevant object. The electronic device 100 detects the duplicity in the text blocks generated for the video of the man riding the cycle, where all the image frames in the video of man riding the cycle is similar. The electronic device 100 selects the text blocks with the timestamp 0 sec from the text blocks generated for the video of man riding the cycle, in response to detecting the duplicity in the text blocks.
  • the electronic device 100 combines the image frame 2 of the recorded video of the concert, the image frame 2 and the image frame 3 of the surveillance camera footage, and the image frame 1 of the video of the man riding the cycle as shown in notation (b) of the FIG. 9B, for generating the edited video.
  • the steps in the claimed method and the steps performing by the claimed electronic device 100 disclosed herein may be accomplished via one or more techniques in artificial intelligence, machine learning, statistical and/or probabilistic machine learning, unsupervised learning, supervised learning, semi-supervised learning, other learning and classification technique, and/or some combination thereof.
  • the embodiments disclosed herein can be implemented using at least one software program running on at least one hardware device and performing network management functions to control the elements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Des modes de réalisation de la présente invention concernent un procédé de modification automatique de vidéo. Le procédé comprend la réception, par le dispositif électronique (100), d'une pluralité d'images d'une vidéo. En outre, le procédé comprend la génération, par le dispositif électronique (100), d'une pluralité de blocs de texte correspondant à la pluralité d'images de la vidéo. En outre, le procédé comprend la modification, par le dispositif électronique (100), d'au moins un bloc de texte parmi la pluralité de blocs de texte sur la base d'au moins un paramètre de modification, le paramètre de modification comprenant un paramètre contextuel et/ou un paramètre associé à l'utilisateur. En outre, le procédé comprend la mise en correspondance, par le dispositif électronique (100), dudit bloc de texte modifié avec au mois une image de la pluralité d'images. En outre, le procédé comprend la production, par le dispositif électronique (100), d'une vidéo modifiée comprenant ladite image mise en correspondance de la vidéo.
PCT/KR2020/015648 2019-12-04 2020-11-09 Procédé et dispositif électronique pour modification automatique de vidéo WO2021112419A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN201941050008 2019-12-04
IN201941050008 2019-12-04

Publications (1)

Publication Number Publication Date
WO2021112419A1 true WO2021112419A1 (fr) 2021-06-10

Family

ID=76222528

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/015648 WO2021112419A1 (fr) 2019-12-04 2020-11-09 Procédé et dispositif électronique pour modification automatique de vidéo

Country Status (1)

Country Link
WO (1) WO2021112419A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150660A (zh) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 一种基于字幕的视频编辑方法和相关设备
CN115278078A (zh) * 2022-07-27 2022-11-01 深圳市天和荣科技有限公司 一种拍摄方法、终端及拍摄系统
CN117152692A (zh) * 2023-10-30 2023-12-01 中国市政工程西南设计研究总院有限公司 基于视频监控的交通目标检测方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346144A1 (en) * 2010-08-27 2013-12-26 Intel Corporation Technique and apparatus for analyzing video and dialog to build viewing context
WO2014137201A2 (fr) * 2013-03-07 2014-09-12 (주)에이엔티홀딩스 Système pour fournir des services de conservation de contenu sur la base de contextes et procédé correspondant
US20170032823A1 (en) * 2015-01-15 2017-02-02 Magisto Ltd. System and method for automatic video editing with narration
US20190311743A1 (en) * 2018-04-05 2019-10-10 Tvu Networks Corporation Methods, apparatus, and systems for ai-assisted or automatic video production

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346144A1 (en) * 2010-08-27 2013-12-26 Intel Corporation Technique and apparatus for analyzing video and dialog to build viewing context
WO2014137201A2 (fr) * 2013-03-07 2014-09-12 (주)에이엔티홀딩스 Système pour fournir des services de conservation de contenu sur la base de contextes et procédé correspondant
US20170032823A1 (en) * 2015-01-15 2017-02-02 Magisto Ltd. System and method for automatic video editing with narration
US20190311743A1 (en) * 2018-04-05 2019-10-10 Tvu Networks Corporation Methods, apparatus, and systems for ai-assisted or automatic video production

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FENG MENGFEI; CHEN YISHUAI; GUO YUCHUN; ZHAO YONGXIANG; FU GUOWEI: "Learning Text Representations for Finding Similar Exercises", 2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), IEEE, 20 May 2019 (2019-05-20), pages 1 - 2, XP033712915, DOI: 10.1109/ICCE-TW46550.2019.8992012 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150660A (zh) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 一种基于字幕的视频编辑方法和相关设备
CN115150660B (zh) * 2022-06-09 2024-05-10 深圳市闪剪智能科技有限公司 一种基于字幕的视频编辑方法和相关设备
CN115278078A (zh) * 2022-07-27 2022-11-01 深圳市天和荣科技有限公司 一种拍摄方法、终端及拍摄系统
CN117152692A (zh) * 2023-10-30 2023-12-01 中国市政工程西南设计研究总院有限公司 基于视频监控的交通目标检测方法及系统
CN117152692B (zh) * 2023-10-30 2024-02-23 中国市政工程西南设计研究总院有限公司 基于视频监控的交通目标检测方法及系统

Similar Documents

Publication Publication Date Title
WO2021112419A1 (fr) Procédé et dispositif électronique pour modification automatique de vidéo
CN107918653B (zh) 一种基于喜好反馈的智能播放方法和装置
WO2015122691A1 (fr) Modification dynamique d'éléments d'une interface d'utilisateur d'après un graphe de connaissance
WO2013022156A1 (fr) Journalisation de vie et partage de mémoire
WO2023011094A1 (fr) Procédé et appareil de montage vidéo, dispositif électronique et support de stockage
US9311395B2 (en) Systems and methods for manipulating electronic content based on speech recognition
Smith et al. Parsing Movies in Context.
US8713008B2 (en) Apparatus and method for information processing, program, and recording medium
WO2017138766A1 (fr) Procédé de regroupement d'image à base hybride et serveur de fonctionnement associé
CN112732949B (zh) 一种业务数据的标注方法、装置、计算机设备和存储介质
TW200849030A (en) System and method of automated video editing
Vryzas et al. Speech emotion recognition adapted to multimodal semantic repositories
WO2019093599A1 (fr) Appareil permettant de générer des informations d'intérêt d'un utilisateur et procédé correspondant
WO2016093630A1 (fr) Enrichissement sémantique de données de trajectoire
JP2005025413A (ja) コンテンツ処理装置、コンテンツ処理方法及びプログラム
WO2021118050A1 (fr) Programme informatique d'édition automatique de vidéo de mises en évidence
WO2019088725A1 (fr) Procédé d'étiquetage automatique de métadonnées de contenu musical à l'aide d'un apprentissage automatique
WO2019228140A1 (fr) Procédé et appareil d'exécution d'instruction, support d'informations et dispositif électronique
CN114503100A (zh) 将情绪相关元数据标注到多媒体文件的方法和装置
KR102486806B1 (ko) 인공지능에 기반하여 시놉시스 텍스트를 분석하고 시청률을 예측하는 서버
US11152031B1 (en) System and method to compress a time frame of one or more videos
Kim et al. PERSONE: personalized experience recoding and searching on networked environment
CN115438633A (zh) 跨文档在线研讨处理方法、互动方法、装置和设备
CN114580790A (zh) 生命周期阶段预测和模型训练方法、装置、介质及设备
Khan Innovations in Multimedia Services: Deep-Learning Algorithm Integration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20896194

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20896194

Country of ref document: EP

Kind code of ref document: A1