WO2023284316A1 - Video editing method and apparatus, and electronic device and readable storage medium - Google Patents

Video editing method and apparatus, and electronic device and readable storage medium Download PDF

Info

Publication number
WO2023284316A1
WO2023284316A1 PCT/CN2022/080976 CN2022080976W WO2023284316A1 WO 2023284316 A1 WO2023284316 A1 WO 2023284316A1 CN 2022080976 W CN2022080976 W CN 2022080976W WO 2023284316 A1 WO2023284316 A1 WO 2023284316A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
frame
target frame
image
edited
Prior art date
Application number
PCT/CN2022/080976
Other languages
French (fr)
Chinese (zh)
Inventor
陈妙
廖玺举
贠挺
李远杭
田颖
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023284316A1 publication Critical patent/WO2023284316A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Definitions

  • the present disclosure relates to the field of computer technology, in particular to the field of artificial intelligence technology such as image processing and deep learning.
  • video As a kind of information medium, video, especially short video, has attracted more and more attention.
  • long-form videos have existed for a long time, such as TV series, movies, entertainment live videos, etc.
  • these video resources are relatively long, and it takes time to watch them completely.
  • short videos are shorter in duration, and are sought after because they can take advantage of fragmented time and highly concentrated information.
  • the video resources are generally edited according to the input editing operation, but the accuracy and efficiency of video editing are low because of the difficulty in controlling the editing time and the length of the clips.
  • a video editing method comprising: acquiring a video to be edited, determining at least one target frame in the video to be edited; extracting the initial frame of each target frame from the video to be edited Video; according to the image and audio data of the image frame contained in each initial video, determine the start frame and end frame corresponding to the target frame in each initial video; according to each target frame and its corresponding start frame and end frame, generate Clipped video for each target frame.
  • a video clipping device including: an acquisition unit, configured to acquire a video to be trimmed, and determine at least one target frame in the video to be trimmed; an extraction unit, configured to extract from the video to be trimmed The initial video of each target frame is extracted from the clip video; the processing unit is used to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame contained in each initial video; generate The unit is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
  • an electronic device including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the at least one processor Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method as described above.
  • a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
  • this embodiment first extracts the initial video corresponding to each target frame from the video to be edited, and then determines the target frame in the video to be edited according to the initial video. Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, thereby realizing automatic video editing and improving the quality of video editing. Accuracy and efficiency.
  • FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure
  • FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure.
  • Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure.
  • Fig. 4 is a block diagram of an electronic device used to implement the video clipping method of the embodiment of the present disclosure.
  • FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in Figure 1, the video editing method of the present embodiment may specifically include the following steps:
  • S104 Generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
  • the video editing method of this embodiment after determining at least one target frame in the video to be edited, first extract the initial video corresponding to each target frame from the video to be edited, and then determine the video to be edited according to the extracted initial video Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, which realizes automatic video editing and can improve the efficiency of video editing. Accuracy and efficiency.
  • the video to be edited obtained by executing S101 may be a game video, for example, a live game video is obtained as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc.
  • the embodiment does not limit this.
  • At least one target frame in the video to be edited is determined, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.
  • an optional implementation method that can be adopted is: obtain the first text of the image frame according to the acquired image of each image frame in the video to be edited Information, according to the audio data of each image frame, the second text information of the image frame is obtained, such as performing optical character recognition (Optical Character Recognition, OCR) on the image of the image frame to obtain the first text information, and performing automatic speech on the audio data of the image frame Recognition (Automatic Speech Recognition, ASR) to obtain the second text information; according to the first text information and the second text information of each image frame, determine at least one target frame in the acquired video to be edited.
  • OCR optical Character Recognition
  • ASR Automatic Speech Recognition
  • the video to be edited when performing S101 to determine at least one target frame in the video to be edited, the video to be edited may also be divided to obtain multiple video clips of equal length, and then the target frames in each video clip are respectively determined.
  • this embodiment can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.
  • the optional implementation method that can be adopted is: the first text information of each image frame A text information and a second text information are input into the pre-trained first classification model, and the classification result output by the first classification model for each image frame is obtained; the image frame whose classification result meets the preset requirements is used as the target frame, for example, the classification The image frame whose result is 1 is used as the target frame.
  • the first classification model used in the implementation of S101 in this embodiment can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 means that the image frame belongs to the target frame, and the classification result is 0 means that the image frame does not belong to the target frame.
  • the first text information and the second text information of each image frame can also be spliced. Afterwards, the similarity between the splicing result and the preset information is calculated, and then the image frame whose similarity calculation result exceeds the preset similarity threshold is used as the target frame.
  • an optional implementation method that can be adopted is: for each target frame, extract the video containing the target from the video to be edited frame and the video with a preset duration is used as the initial video of the target frame.
  • the preset duration in this embodiment is specifically obtained by counting the duration of existing highlight videos, for example, the average duration of existing game highlight videos may be used as the preset duration.
  • This embodiment executes S102 to extract the initial video from the video to be edited, specifically a video segment that contains the target frame and has a video duration of a preset duration.
  • the image frames located before and/or after the target frame in the initial video The number of is not limited.
  • the initial video extracted by executing S102 in this embodiment may be composed of image frame 1, image frame 2, image frame 3, and image frame 4, or may be composed of image frame 2, image frame 3, and image frame 4. Composed with 5 image frames.
  • this embodiment screens the image frames contained in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to realize the update of the video time range and further improve the The accuracy of the resulting clipped video.
  • the optional implementation method that can be adopted is : For each initial video, according to the image and audio data of the image frame contained in the initial video, the multimodal feature of each image frame in the initial video is obtained, and the obtained multimodal feature includes the image feature of the image frame and audio features; the multimodal features of each image frame in the initial video are spliced, and the splicing result is input into the second classification model obtained in advance; according to the output result of the second classification model, determine the beginning of the initial video frame and end frame.
  • the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value of the game character Features (position information, quantity information, and change information of the life value indicator template, etc.), magic value features (position information, quantity information, and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects), etc.;
  • the audio feature is a game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.
  • this embodiment can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.
  • the second classification model used in the implementation of S103 in this embodiment includes a convolutional layer and a fully connected layer. After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first The splicing result is subjected to convolution processing, and then the convolution processing result is input into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.
  • the following method can be used to pre-train the second classification model: obtain training data, the acquired training data includes multiple training videos and the labeling results of multiple training videos, and the labeling results include the start frame labeling result and the end frame Frame annotation result; for each training video, according to the image and audio data of the image frame contained in the training video, the multimodal feature of each image frame in the training video is obtained; the multimodal feature of each image frame in the training video State features are spliced, and the splicing result is input into the neural network model to obtain the prediction result output by the neural network model for each training video.
  • the prediction result includes the start frame prediction result and the end frame prediction result: use the annotation result of the training video Calculate the loss function value based on the prediction result, adjust the parameters of the neural network model according to the calculated loss function value until the neural network model converges, and obtain the second classification model.
  • this embodiment can use the following formula to calculate the loss function value:
  • w is the value of the loss function
  • gt is the time of the labeling result
  • pt is the time of the prediction result
  • takes a value of 1.
  • an optional implementation method that can be adopted is: for each target frame, from From the video to be edited, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame; generate a stop-motion video of the target frame, that is, extend the screen of the target frame A few seconds; the first video, the stop-motion video of the target frame and the second video are spliced in sequence to generate an edited video of the target frame.
  • the freeze-frame video of the target frame is obtained by extending the target frame for several seconds, and then the freeze-frame video of the target frame is used to generate a clipped video, which can further highlight the target frame in the clipped video, thereby improving clipping.
  • the display effect of the video is obtained by extending the target frame for several seconds, and then the freeze-frame video of the target frame is used to generate a clipped video, which can further highlight the target frame in the clipped video, thereby improving clipping.
  • preset music can be added to the generated video clip, and special effects such as the opening and closing credits can also be added to the generated video clip, thereby further improving the generated video clip.
  • the quality of the clipped video when performing S104 to generate a video clip of the target frame, preset music can be added to the generated video clip, and special effects such as the opening and closing credits can also be added to the generated video clip, thereby further improving the generated video clip.
  • FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in Figure 2, after performing S104 "generating clipped video of each target frame" in this embodiment, the following content may also be included:
  • this embodiment can also merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlights in the generated combined video clip are
  • the continuity of events further enhances the accuracy of video editing.
  • this embodiment when performing S202 to merge the determined multiple cut videos, since each highlight video corresponds to a different target frame, this embodiment only reserves the stop-motion video of the last target frame in the merged cut video, Restore the stop motion video of other target frames to the target frame itself.
  • the acquired video to be edited includes image frame 1, image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, image frame 7 and image frame 8, if the determined target The frames are image frame 3 and image frame 6. If the preset duration is 4s, if the generated clip video of image frame 3 contains "image frame 2, image frame 3, image frame 4 and image frame 5", the generated The clip video of image frame 6 includes "image frame 4, image frame 5, image frame 6, and image frame 7", and there are overlapped image frame 4 and image frame 5 in the two highlight videos, then the present embodiment divides the two clips The video is merged, and only the stop-motion video of image frame 6 is retained, and the generated merged clip video includes "image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, and image frame 7".
  • preset music can be added to the generated combined clip video, and special effects such as drainage titles and trailers can also be added to the generated combined clip video, thereby further improving The quality of the resulting clip video.
  • FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in Figure 3, the video editing device 300 of the present embodiment includes:
  • the acquiring unit 301 is configured to acquire a video to be edited, and determine at least one target frame in the video to be edited;
  • An extraction unit 302, configured to extract the initial video of each target frame from the video to be edited
  • the processing unit 303 is configured to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;
  • the generating unit 304 is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
  • the video to be edited acquired by the acquiring unit 301 may be a game video, for example, a live game video is acquired as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc., in this embodiment This is not limited.
  • the acquiring unit 301 determines at least one target frame in the video to be edited, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.
  • an optional implementation method that can be adopted is: obtain the first text information of the image frame according to the acquired image of each image frame in the video to be edited, According to the audio data of each image frame, the second text information of the image frame is obtained; according to the first text information and the second text information of each image frame, at least one target frame in the acquired video to be edited is determined.
  • the acquisition unit 301 may also divide the video to be edited to obtain multiple video clips of equal length, and then determine the target frame in each video clip respectively.
  • the acquisition unit 301 can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.
  • an optional implementation method that can be adopted is: the first text information of each image frame Inputting the pre-trained first classification model with the second text information, and obtaining the classification result output by the first classification model for each image frame; taking the image frame whose classification result meets the preset requirements as the target frame.
  • the first classification model used by the acquisition unit 301 can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 indicates that the image frame belongs to the target frame, and the classification result being 0 indicates that The image frame does not belong to the target frame.
  • the acquisition unit 301 determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, after splicing the first text information and the second text information of each image frame, Calculate the similarity between the splicing result and the preset information, and then use the image frame whose similarity calculation result exceeds the preset similarity threshold as the target frame.
  • the extraction unit 302 extracts the initial video of each target frame from the video to be edited.
  • an optional implementation manner that can be adopted is: for each target frame, the extracted video containing the target frame, And the video whose duration is the preset duration is used as the initial video of the target frame.
  • the preset duration in this embodiment is specifically obtained by counting the duration of existing highlight videos, for example, the average duration of existing game highlight videos may be used as the preset duration.
  • the initial video extracted by the extraction unit 302 from the video to be edited is specifically a video segment containing the target frame and the video duration is a preset duration.
  • the number of image frames located before and/or after the target frame in the initial video Not limited.
  • the processing unit 303 determines the target frame in each initial video according to the image and audio data of the image frame contained in each initial video.
  • the start frame and end frame corresponding to the frame.
  • the processing unit 303 screens the image frames included in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to update the time range of the video and further improve the The accuracy of the resulting clipped video.
  • an optional implementation method that can be adopted is: for For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained; the multimodal features of each image frame in the initial video are spliced, Input the splicing result into the pre-trained second classification model; determine the start frame and end frame in the initial video according to the output result of the second classification model.
  • the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value feature of the game character ( The position information, quantity information and change information of the life value indicator template), magic value features (the position information, quantity information and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects) and other features; audio features is the game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.
  • the processing unit 303 can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.
  • the second classification model used by the processing unit 303 includes a convolutional layer and a fully connected layer.
  • the convolutional layer After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first performs the splicing results. Perform convolution processing, and then input the convolution processing results into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.
  • the following method can be used to pre-train the second classification model: obtain training data, the acquired training data includes multiple training videos and the labeling results of multiple training videos, and the labeling results include the start frame labeling result and the end frame Frame annotation result; for each training video, according to the image and audio data of the image frame contained in the training video, the multimodal feature of each image frame in the training video is obtained; the multimodal feature of each image frame in the training video State features are spliced, and the splicing result is input into the neural network model to obtain the prediction result output by the neural network model for each training video.
  • the prediction result includes the start frame prediction result and the end frame prediction result: use the annotation result of the training video Calculate the loss function value based on the prediction result, adjust the parameters of the neural network model according to the calculated loss function value until the neural network model converges, and obtain the second classification model.
  • the generation unit 304 After the processing unit 303 determines the start frame and end frame corresponding to each target frame in the video to be edited, the generation unit 304 generates each target frame according to each target frame and its corresponding start frame and end frame video clips.
  • an optional implementation method that can be adopted is: for each target frame, from the target frame to be clipped Extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame in the video; generate a stop-motion video of the target frame; sequentially convert the first video, the target frame The stop-motion video and the second video are spliced to generate an edited video of the target frame.
  • the generation unit 304 obtains the stop-motion video of the target frame by extending the target frame for several seconds, and then uses the stop-motion video of the target frame to generate the clipped video, which can further highlight the target frame in the clipped video, thereby improving the clipping quality.
  • the display effect of the video is to say, the generation unit 304 obtains the stop-motion video of the target frame by extending the target frame for several seconds, and then uses the stop-motion video of the target frame to generate the clipped video, which can further highlight the target frame in the clipped video, thereby improving the clipping quality.
  • the display effect of the video is to obtain the stop-motion video of the target frame by extending the target frame for several seconds, and then uses the stop-motion video of the target frame to generate the clipped video, which can further highlight the target frame in the clipped video, thereby improving the clipping quality.
  • the generating unit 304 When the generating unit 304 generates the clipped video of the target frame, preset music can be added to the generated clipped video, and special effects such as leading titles and endings can also be added to the generated clipped video, thereby further improving the generated clipped video. the quality of.
  • the video clipping device 300 of this embodiment may also include a merging unit 305, which is used to perform the following content after the generating unit 304 generates the clipped video of each target frame: according to the image frames contained in the clipped video of each target frame, It is determined that there are multiple video clips with overlapped image frames; the multiple video clips determined are merged, and the stop-motion video of the last target frame is retained to generate a merged video clip.
  • a merging unit 305 which is used to perform the following content after the generating unit 304 generates the clipped video of each target frame: according to the image frames contained in the clipped video of each target frame, It is determined that there are multiple video clips with overlapped image frames; the multiple video clips determined are merged, and the stop-motion video of the last target frame is retained to generate a merged video clip.
  • the merging unit 305 can merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlight event in the generated combined video clip is It has continuity, which further improves the accuracy of video editing.
  • this embodiment only reserves the stop-motion video of the last target frame in the merged video clips, and other The stop motion video of the target frame reverts to the target frame itself.
  • the merging unit 305 When the merging unit 305 generates a merged clip video, it can add preset music in the generated merged clip video, and can also add special effects such as drainage titles and credits in the generated merged clip video, thereby further improving the generated clip video. the quality of
  • the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 4 it is a block diagram of an electronic device according to the video clipping method of the embodiment of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 400 includes a computing unit 401 that can execute according to a computer program stored in a read-only memory (ROM) 402 or loaded from a storage unit 408 into a random access memory (RAM) 403. Various appropriate actions and treatments. In the RAM 403, various programs and data necessary for the operation of the device 400 can also be stored.
  • the computing unit 401 , ROM 402 and RAM 403 are connected to each other through a bus 404 .
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • the I/O interface 405 includes: an input unit 406, such as a keyboard, a mouse, etc.; an output unit 407, such as various types of displays, speakers, etc.; a storage unit 408, such as a magnetic disk, an optical disk, etc. ; and a communication unit 409, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 409 allows the device 400 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 401 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 401 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 401 executes various methods and processes described above, such as a video editing method.
  • the video clipping method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 408 .
  • part or all of the computer program may be loaded and/or installed on the device 400 via the ROM 402 and/or the communication unit 409 .
  • the computer program When the computer program is loaded into the RAM 403 and executed by the computing unit 401, one or more steps of the video editing method described above can be performed.
  • the computing unit 401 may be configured to execute the video clipping method in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, systems integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC systems on chips system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or a combination thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS”) Among them, there are defects such as difficult management and weak business scalability.
  • the server can also be a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present disclosure relates to the technical field of artificial intelligence such as image processing and deep learning. Provided are a video editing method and apparatus, and an electronic device and a readable storage medium. The video editing method comprises: acquiring a video to be edited, and determining at least one target frame in the video to be edited; extracting an initial video of each target frame from the video to be edited; according to image and audio data of image frames included in each initial video, determining, from each initial video, a start frame and an end frame which correspond to the target frame; and generating an edited video of each target frame according to each target frame and the start frame and end frame corresponding thereto. By means of the present disclosure, automatic video editing is realized, such that the accuracy and efficiency of video editing can be improved.

Description

视频剪辑方法、装置、电子设备和可读存储介质Video editing method, device, electronic device and readable storage medium
本申请要求了申请日为2021年07月13日,申请号为202110790261.8发明名称为“视频剪辑方法、装置、电子设备和可读存储介质”的中国专利申请的优先权。This application claims the priority of a Chinese patent application with an application date of July 13, 2021 and an application number of 202110790261.8 titled "Video Editing Method, Device, Electronic Equipment, and Readable Storage Medium".
技术领域technical field
本公开涉及计算机技术领域,尤其涉及图像处理、深度学习等人工智能技术领域。提供了一种视频剪辑方法、装置、电子设备和可读存储介质。The present disclosure relates to the field of computer technology, in particular to the field of artificial intelligence technology such as image processing and deep learning. Provided are a video editing method, device, electronic equipment and readable storage medium.
背景技术Background technique
作为信息媒介的一种,视频,尤其是短视频,受到越来越多人的关注。目前,长视频已经存在很长一段时间了,如电视剧、电影、娱乐直播视频等,目前这些视频资源的时长较长,完整看下来比较耗时。相反,短视频时间较短,由于可以利用碎片化时间、信息高度集中等受到追捧。As a kind of information medium, video, especially short video, has attracted more and more attention. At present, long-form videos have existed for a long time, such as TV series, movies, entertainment live videos, etc. At present, these video resources are relatively long, and it takes time to watch them completely. On the contrary, short videos are shorter in duration, and are sought after because they can take advantage of fragmented time and highly concentrated information.
现有技术中一般是依据输入的剪辑操作,对视频资源进行剪辑,但由于剪辑时间、剪辑片段时长等难以把控,因此导致视频剪辑的准确性与效率较低。In the prior art, the video resources are generally edited according to the input editing operation, but the accuracy and efficiency of video editing are low because of the difficulty in controlling the editing time and the length of the clips.
发明内容Contents of the invention
根据本公开的第一方面,提供了一种视频剪辑方法,包括:获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;从所述待剪辑视频中提取每个目标帧的初始视频;根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧;根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。According to the first aspect of the present disclosure, there is provided a video editing method, comprising: acquiring a video to be edited, determining at least one target frame in the video to be edited; extracting the initial frame of each target frame from the video to be edited Video; according to the image and audio data of the image frame contained in each initial video, determine the start frame and end frame corresponding to the target frame in each initial video; according to each target frame and its corresponding start frame and end frame, generate Clipped video for each target frame.
根据本公开的第二方面,提供了一种视频剪辑装置,包括:获取单元,用于获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;提取单元,用于从所述待剪辑视频中提取每个目标帧的初始视频;处理单元,用于根据每个初始视频所包含图像帧的图像与音频数据,确定每 个初始视频中与目标帧对应的开始帧与结束帧;生成单元,用于根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。According to a second aspect of the present disclosure, there is provided a video clipping device, including: an acquisition unit, configured to acquire a video to be trimmed, and determine at least one target frame in the video to be trimmed; an extraction unit, configured to extract from the video to be trimmed The initial video of each target frame is extracted from the clip video; the processing unit is used to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame contained in each initial video; generate The unit is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
根据本公开的第三方面,提供了一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的方法。According to a third aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the at least one processor Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
根据本公开的第四方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行如上所述的方法。According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method as described above.
根据本公开的第五方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如上所述的方法。According to a fifth aspect of the present disclosure there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
由以上技术方案可以看出,本实施例在确定待剪辑视频中的至少一个目标帧之后,首先从待剪辑视频中提取对应每个目标帧的初始视频,然后再根据初始视频确定待剪辑视频中对应每个目标帧的开始帧与结束帧,最后根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频,从而实现了视频的自动剪辑,能够提升视频剪辑的准确性与效率。It can be seen from the above technical solutions that after determining at least one target frame in the video to be edited, this embodiment first extracts the initial video corresponding to each target frame from the video to be edited, and then determines the target frame in the video to be edited according to the initial video. Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, thereby realizing automatic video editing and improving the quality of video editing. Accuracy and efficiency.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:
图1是根据本公开第一实施例的示意图;FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
图2是根据本公开第二实施例的示意图;FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
图3是根据本公开第三实施例的示意图;Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure;
图4是用来实现本公开实施例的视频剪辑方法的电子设备的框图。Fig. 4 is a block diagram of an electronic device used to implement the video clipping method of the embodiment of the present disclosure.
具体实施方式detailed description
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此, 本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和机构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
图1是根据本公开第一实施例的示意图。如图1所示,本实施例的视频剪辑方法,具体可以包括如下步骤:FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in Figure 1, the video editing method of the present embodiment may specifically include the following steps:
S101、获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;S101. Acquire a video to be edited, and determine at least one target frame in the video to be edited;
S102、从所述待剪辑视频中提取每个目标帧的初始视频;S102. Extract the initial video of each target frame from the video to be edited;
S103、根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧;S103. According to the image and audio data of the image frame included in each initial video, determine the start frame and the end frame corresponding to the target frame in each initial video;
S104、根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。S104. Generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
本实施例的视频剪辑方法,在确定待剪辑视频中的至少一个目标帧之后,首先从待剪辑视频中提取对应每个目标帧的初始视频,然后再根据所提取的初始视频,确定待剪辑视频中对应每个目标帧的开始帧与结束帧,最后根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频,实现了视频的自动剪辑,能够提升视频剪辑的准确性与效率。In the video editing method of this embodiment, after determining at least one target frame in the video to be edited, first extract the initial video corresponding to each target frame from the video to be edited, and then determine the video to be edited according to the extracted initial video Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, which realizes automatic video editing and can improve the efficiency of video editing. Accuracy and efficiency.
本实施例执行S101所获取的待剪辑视频可以为游戏视频,例如获取游戏直播视频作为待剪辑视频;其中,游戏视频的游戏种类可以为角色扮演游戏、体育游戏、多人在线竞技游戏等,本实施例对此不进行限定。In this embodiment, the video to be edited obtained by executing S101 may be a game video, for example, a live game video is obtained as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc. The embodiment does not limit this.
本实施例在执行S101获取了待剪辑视频之后,确定待剪辑视频中的至少一个目标帧,所确定的目标帧为对应待剪辑视频中高光时刻的图像帧。In this embodiment, after executing S101 to obtain the video to be edited, at least one target frame in the video to be edited is determined, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.
具体地,本实施例在执行S101确定待剪辑视频中的至少一个目标帧时,可以采用的可选实现方式为:根据所获取的待剪辑视频中每个图像帧的图像得到图像帧的第一文字信息,根据每个图像帧的音频数据得到图像帧的第二文字信息,例如对图像帧的图像进行光学字符识别(Optical Character Recognition,OCR)得到第一文字信息,对图像帧的音频数据进行自动语音识别(Automatic Speech Recognition,ASR)得到第二文字信息;根据每个图像帧的第一文字信息与第二文字信息,确定所获取的待剪辑视频中的至少一个目标帧。Specifically, in this embodiment, when performing S101 to determine at least one target frame in the video to be edited, an optional implementation method that can be adopted is: obtain the first text of the image frame according to the acquired image of each image frame in the video to be edited Information, according to the audio data of each image frame, the second text information of the image frame is obtained, such as performing optical character recognition (Optical Character Recognition, OCR) on the image of the image frame to obtain the first text information, and performing automatic speech on the audio data of the image frame Recognition (Automatic Speech Recognition, ASR) to obtain the second text information; according to the first text information and the second text information of each image frame, determine at least one target frame in the acquired video to be edited.
本实施例在执行S101确定待剪辑视频中的至少一个目标帧时,还可以对待剪辑视频进行分割,得到多个等长的视频片段之后,再分别确定各视频片段中的目标帧。In this embodiment, when performing S101 to determine at least one target frame in the video to be edited, the video to be edited may also be divided to obtain multiple video clips of equal length, and then the target frames in each video clip are respectively determined.
也就是说,本实施例能够根据图像帧及其音频数据所得到的两部分文字信息来确定待剪辑视频中的目标帧,从而提升所确定的目标帧的准确性。That is to say, this embodiment can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.
本实施例在执行S101根据根据每个图像帧的第一文字信息与第二文字信息,确定待剪辑视频中的至少一个目标帧时,可以采用的可选实现方式为:将每个图像帧的第一文字信息与第二文字信息输入预先训练得到的第一分类模型,得到该第一分类模型针对每个图像帧输出的分类结果;将分类结果满足预设要求的图像帧作为目标帧,例如将分类结果为1的图像帧作为目标帧。In this embodiment, when performing S101 to determine at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, the optional implementation method that can be adopted is: the first text information of each image frame A text information and a second text information are input into the pre-trained first classification model, and the classification result output by the first classification model for each image frame is obtained; the image frame whose classification result meets the preset requirements is used as the target frame, for example, the classification The image frame whose result is 1 is used as the target frame.
其中,本实施例执行S101所使用的第一分类模型,能够根据所输入的文字信息来输出该图像帧是否为目标帧的分类结果,分类结果为1表示该图像帧属于目标帧,分类结果为0表示该图像帧不属于目标帧。Wherein, the first classification model used in the implementation of S101 in this embodiment can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 means that the image frame belongs to the target frame, and the classification result is 0 means that the image frame does not belong to the target frame.
本实施例在执行S101根据每个图像帧的第一文字信息与第二文字信息,确定待剪辑视频中的至少一个目标帧时,还可以将各图像帧的第一文字信息与第二文字信息进行拼接之后,计算拼接结果与预设信息之间的相似度,进而将相似度计算结果超过预设相似度阈值的图像帧作为目标帧。In this embodiment, when performing S101 to determine at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, the first text information and the second text information of each image frame can also be spliced. Afterwards, the similarity between the splicing result and the preset information is calculated, and then the image frame whose similarity calculation result exceeds the preset similarity threshold is used as the target frame.
本实施例在执行S101确定了待剪辑视频中的至少一个目标帧之后,执行S102从待剪辑视频中提取每个目标帧的初始视频。In this embodiment, after performing S101 to determine at least one target frame in the video to be edited, perform S102 to extract an initial video of each target frame from the video to be edited.
具体地,本实施例在执行S102从待剪辑视频中提取每个目标帧的初始视频时,可以采用的可选实现方式为:针对每个目标帧,将从待剪辑视频中提取的包含该目标帧、且时长为预设时长的视频,作为该目标帧的初始视频。Specifically, in this embodiment, when executing S102 to extract the initial video of each target frame from the video to be edited, an optional implementation method that can be adopted is: for each target frame, extract the video containing the target from the video to be edited frame and the video with a preset duration is used as the initial video of the target frame.
本实施例中的预设时长,具体为通过对已有高光视频的时长进行统计得到的,例如可以将已有游戏高光视频的时长的平均值作为预设时长。The preset duration in this embodiment is specifically obtained by counting the duration of existing highlight videos, for example, the average duration of existing game highlight videos may be used as the preset duration.
本实施例执行S102从待剪辑视频中所提取的初始视频,具体为包含目标帧且视频时长为预设时长的视频片段,本实施例对初始视频中位于目标帧之前和/或之后的图像帧的数量不进行限定。This embodiment executes S102 to extract the initial video from the video to be edited, specifically a video segment that contains the target frame and has a video duration of a preset duration. In this embodiment, the image frames located before and/or after the target frame in the initial video The number of is not limited.
举例来说,若所获取的待剪辑视频中包含图像帧1、图像帧2、图像帧3、图像帧4、图像帧5与图像帧6,若所确定的目标帧为图像帧3,若预设时长为4s,则本实施例执行S102所提取的初始视频可以由图像帧1、图像帧2、图像帧3、图像帧4所组成,也可以由图像帧2、图像帧3、图像帧4与图像帧5所组成。For example, if the acquired video to be edited includes image frame 1, image frame 2, image frame 3, image frame 4, image frame 5, and image frame 6, if the determined target frame is image frame 3, if the predetermined Assuming that the duration is 4s, the initial video extracted by executing S102 in this embodiment may be composed of image frame 1, image frame 2, image frame 3, and image frame 4, or may be composed of image frame 2, image frame 3, and image frame 4. Composed with 5 image frames.
本实施例在执行S102从待剪辑视频中提取每个目标帧的初始视频之后,执行S103根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧。In this embodiment, after executing S102 to extract the initial video of each target frame from the video to be edited, execute S103 to determine the beginning of each initial video corresponding to the target frame according to the image and audio data of the image frame contained in each initial video. frame and end frame.
也就是说,本实施例对初始视频中所包含的图像帧进行筛选,进而确定每个初始视频中与目标帧对应的准确开始帧与准确结束帧,从而实现视频时间范围的更新,进一步提升所生成的剪辑视频的准确性。That is to say, this embodiment screens the image frames contained in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to realize the update of the video time range and further improve the The accuracy of the resulting clipped video.
具体地,本实施例在执行S103根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧时,可以采用的可选实现方式为:针对每个初始视频,根据该初始视频所包含图像帧的图像与音频数据,得到该初始视频中每个图像帧的多模态特征,所得到的多模态特征中包含图像帧的图像特征与音频特征;将该初始视频中各图像帧的多模态特征进行拼接,将拼接结果输入预先训练得到的第二分类模型;根据该第二分类模型的输出结果,确定该初始视频中的开始帧与结束帧。Specifically, when performing S103 in this embodiment to determine the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video, the optional implementation method that can be adopted is : For each initial video, according to the image and audio data of the image frame contained in the initial video, the multimodal feature of each image frame in the initial video is obtained, and the obtained multimodal feature includes the image feature of the image frame and audio features; the multimodal features of each image frame in the initial video are spliced, and the splicing result is input into the second classification model obtained in advance; according to the output result of the second classification model, determine the beginning of the initial video frame and end frame.
本实施例执行S103得到的多模态特征中,图像特征具体为从图像帧的图像中提取的游戏人物属性特征,用于表示图像帧中游戏人物的活动是否激烈,可以包含游戏人物的生命值特征(生命值指示模板的位置信息、数量信息与变化信息等)、魔法值特征(魔法值指示模板的位置信息、数量信息与变化信息等)、动作特效特征(是否存在动作特效)等特征;音频特征为从图像帧的音频数据中提取的游戏音频特征,用于表示图像帧中是否存在竞技语音活动。Among the multimodal features obtained by executing S103 in this embodiment, the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value of the game character Features (position information, quantity information, and change information of the life value indicator template, etc.), magic value features (position information, quantity information, and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects), etc.; The audio feature is a game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.
也就是说,本实施例能够结合图像帧的图像特征与音频特征,来确定初始视频中准确的开始帧与结束帧,从而相应地得到待剪辑视频中对应每个目标帧的开始帧与结束帧,进一步提升所生成的剪辑视频的准确性。That is to say, this embodiment can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.
其中,本实施例执行S103所使用的第二分类模型包含卷积层与全连 接层,在将由各图像帧的多模态特征的拼接结果输入到第二分类模型之后,先由卷积层对拼接结果进行卷积处理,然后将卷积处理结果输入全连接层进行分类,从而得到由全连接层输出的开始帧与结束帧。Among them, the second classification model used in the implementation of S103 in this embodiment includes a convolutional layer and a fully connected layer. After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first The splicing result is subjected to convolution processing, and then the convolution processing result is input into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.
本实施例可以使用以下方式来预先训练得到第二分类模型:获取训练数据,所获取的训练数据中包含多个训练视频与多个训练视频的标注结果,标注结果中包含开始帧标注结果与结束帧标注结果;针对每个训练视频,根据该训练视频所包含图像帧的图像与音频数据,得到该训练视频中每个图像帧的多模态特征;将该训练视频中各图像帧的多模态特征进行拼接,将拼接结果输入神经网络模型,得到该神经网络模型针对每个训练视频输出的预测结果,该预测结果中包含开始帧预测结果与结束帧预测结果:使用该训练视频的标注结果与预测结果计算损失函数值,根据计算得到的损失函数值调整神经网络模型的参数,直至神经网络模型收敛,得到第二分类模型。In this embodiment, the following method can be used to pre-train the second classification model: obtain training data, the acquired training data includes multiple training videos and the labeling results of multiple training videos, and the labeling results include the start frame labeling result and the end frame Frame annotation result; for each training video, according to the image and audio data of the image frame contained in the training video, the multimodal feature of each image frame in the training video is obtained; the multimodal feature of each image frame in the training video State features are spliced, and the splicing result is input into the neural network model to obtain the prediction result output by the neural network model for each training video. The prediction result includes the start frame prediction result and the end frame prediction result: use the annotation result of the training video Calculate the loss function value based on the prediction result, adjust the parameters of the neural network model according to the calculated loss function value until the neural network model converges, and obtain the second classification model.
其中,本实施例可以使用以下公式计算损失函数值:Among them, this embodiment can use the following formula to calculate the loss function value:
Figure PCTCN2022080976-appb-000001
Figure PCTCN2022080976-appb-000001
在公式中:w为损失函数值;gt为标注结果的时刻,pt为预测结果的时刻,若标注结果为开始帧标注结果,则预测结果为开始帧预测结果,若标注结果为结束帧标注结果,则预测结果为结束帧预测结果;δ取值为1。In the formula: w is the value of the loss function; gt is the time of the labeling result, pt is the time of the prediction result, if the labeling result is the start frame labeling result, the prediction result is the start frame prediction result, and if the labeling result is the end frame labeling result , then the predicted result is the predicted result of the end frame; δ takes a value of 1.
本实施例在执行S103确定待剪辑视频中对应每个目标帧的开始帧与结束帧之后,执行S104根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。In this embodiment, after executing S103 to determine the start frame and end frame corresponding to each target frame in the video to be edited, execute S104 to generate a cut video of each target frame according to each target frame and its corresponding start frame and end frame.
具体地,本实施例在执行S104根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频时,可以采用的可选实现方式为:针对每个目标帧,从待剪辑视频中提取开始帧到目标帧的前一帧之间的第一视频与目标帧的后一帧到结束帧之间的第二视频;生成目标帧的定格视频,即将目标帧的画面延伸数秒;依次将第一视频、目标帧的定格视频与第二视频进行拼接,生成目标帧的剪辑视频。Specifically, when performing S104 in this embodiment to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame, an optional implementation method that can be adopted is: for each target frame, from From the video to be edited, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame; generate a stop-motion video of the target frame, that is, extend the screen of the target frame A few seconds; the first video, the stop-motion video of the target frame and the second video are spliced in sequence to generate an edited video of the target frame.
也就是说,本实施例通过将目标帧延伸数秒的方式来得到目标帧的定格视频,进而使用目标帧的定格视频来生成剪辑视频,能够进一步在 剪辑视频中对目标帧进行突出,从而提升剪辑视频的显示效果。That is to say, in this embodiment, the freeze-frame video of the target frame is obtained by extending the target frame for several seconds, and then the freeze-frame video of the target frame is used to generate a clipped video, which can further highlight the target frame in the clipped video, thereby improving clipping. The display effect of the video.
本实施例在执行S104生成目标帧的剪辑视频时,可以在所生成的剪辑视频中添加预设音乐,还可以在所生成的剪辑视频中添加引流片头与片尾等特效,从而进一步提升所生成的剪辑视频的质量。In this embodiment, when performing S104 to generate a video clip of the target frame, preset music can be added to the generated video clip, and special effects such as the opening and closing credits can also be added to the generated video clip, thereby further improving the generated video clip. The quality of the clipped video.
图2是根据本公开第二实施例的示意图。如图2所示,本实施例在执行S104“生成每个目标帧的剪辑视频”之后,还可以包含以下内容:FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in Figure 2, after performing S104 "generating clipped video of each target frame" in this embodiment, the following content may also be included:
S201、根据每个目标帧的剪辑视频中包含的图像帧,确定图像帧存在重合的多个剪辑视频;S201. According to the image frames contained in the clip video of each target frame, determine that there are multiple clip videos in which the image frames overlap;
S202、将所确定的多个剪辑视频进行合并,保留最后一个目标帧的定格视频,生成合并剪辑视频。S202. Merge the determined video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.
也就是说,本实施例还能够对图像帧存在重合的多个剪辑视频进行合并,并对最后一个目标帧的定格视频进行保留,从而生成合并剪辑视频,确保了所生成的合并剪辑视频中高光事件具有连续性,进一步提升了视频剪辑的准确性。That is to say, this embodiment can also merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlights in the generated combined video clip are The continuity of events further enhances the accuracy of video editing.
本实施例在执行S202将所确定的多个剪辑视频进行合并时,由于每个高光视频对应不同的目标帧,因此本实施例仅在合并剪辑视频中对最后一个目标帧的定格视频进行保留,将其他目标帧的定格视频还原为目标帧本身。In this embodiment, when performing S202 to merge the determined multiple cut videos, since each highlight video corresponds to a different target frame, this embodiment only reserves the stop-motion video of the last target frame in the merged cut video, Restore the stop motion video of other target frames to the target frame itself.
举例来说,若所获取的待剪辑视频中包含图像帧1、图像帧2、图像帧3、图像帧4、图像帧5、图像帧6、图像帧7与图像帧8,若所确定的目标帧为图像帧3与图像帧6,若预设时长为4s,若所生成的图像帧3的剪辑视频中包含“图像帧2、图像帧3、图像帧4与图像帧5”,所生成的图像帧6的剪辑视频中包含“图像帧4、图像帧5、图像帧6与图像帧7”,两个高光视频中存在重合的图像帧4与图像帧5,则本实施例将两个剪辑视频进行合并,仅保留图像帧6的定格视频,所生成的合并剪辑视频中包含“图像帧2、图像帧3、图像帧4、图像帧5、图像帧6与图像帧7”。For example, if the acquired video to be edited includes image frame 1, image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, image frame 7 and image frame 8, if the determined target The frames are image frame 3 and image frame 6. If the preset duration is 4s, if the generated clip video of image frame 3 contains "image frame 2, image frame 3, image frame 4 and image frame 5", the generated The clip video of image frame 6 includes "image frame 4, image frame 5, image frame 6, and image frame 7", and there are overlapped image frame 4 and image frame 5 in the two highlight videos, then the present embodiment divides the two clips The video is merged, and only the stop-motion video of image frame 6 is retained, and the generated merged clip video includes "image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, and image frame 7".
同样地,本实施例在执行S202生成合并剪辑视频时,可以在所生成的合并剪辑视频中添加预设音乐,还可以在所生成的合并剪辑视频中添加引流片头与片尾等特效,从而进一步提升所生成的剪辑视频的质量。Similarly, in this embodiment, when performing S202 to generate a combined clip video, preset music can be added to the generated combined clip video, and special effects such as drainage titles and trailers can also be added to the generated combined clip video, thereby further improving The quality of the resulting clip video.
图3是根据本公开第三实施例的示意图。如图3所示,本实施例的 视频剪辑装置300,包括:FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in Figure 3, the video editing device 300 of the present embodiment includes:
获取单元301、用于获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;The acquiring unit 301 is configured to acquire a video to be edited, and determine at least one target frame in the video to be edited;
提取单元302、用于从所述待剪辑视频中提取每个目标帧的初始视频;An extraction unit 302, configured to extract the initial video of each target frame from the video to be edited;
处理单元303、用于根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧;The processing unit 303 is configured to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;
生成单元304、用于根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。The generating unit 304 is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
获取单元301所获取的待剪辑视频可以为游戏视频,例如获取游戏直播视频作为待剪辑视频;其中,游戏视频的游戏种类可以为角色扮演游戏、体育游戏、多人在线竞技游戏等,本实施例对此不进行限定。The video to be edited acquired by the acquiring unit 301 may be a game video, for example, a live game video is acquired as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc., in this embodiment This is not limited.
获取单元301在获取了待剪辑视频之后,确定待剪辑视频中的至少一个目标帧,所确定的目标帧为对应待剪辑视频中高光时刻的图像帧。After acquiring the video to be edited, the acquiring unit 301 determines at least one target frame in the video to be edited, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.
具体地,获取单元301在确定待剪辑视频中的至少一个目标帧时,可以采用的可选实现方式为:根据所获取的待剪辑视频中每个图像帧的图像得到图像帧的第一文字信息,根据每个图像帧的音频数据得到图像帧的第二文字信息;根据每个图像帧的第一文字信息与第二文字信息,确定所获取的待剪辑视频中的至少一个目标帧。Specifically, when the acquisition unit 301 determines at least one target frame in the video to be edited, an optional implementation method that can be adopted is: obtain the first text information of the image frame according to the acquired image of each image frame in the video to be edited, According to the audio data of each image frame, the second text information of the image frame is obtained; according to the first text information and the second text information of each image frame, at least one target frame in the acquired video to be edited is determined.
获取单元301在确定待剪辑视频中的至少一个目标帧时,还可以对待剪辑视频进行分割,得到多个等长的视频片段之后,再分别确定各视频片段中的目标帧。When determining at least one target frame in the video to be edited, the acquisition unit 301 may also divide the video to be edited to obtain multiple video clips of equal length, and then determine the target frame in each video clip respectively.
也就是说,获取单元301能够根据图像帧及其音频数据所得到的两部分文字信息来确定待剪辑视频中的目标帧,从而提升所确定的目标帧的准确性。That is to say, the acquisition unit 301 can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.
获取单元301在根据根据每个图像帧的第一文字信息与第二文字信息,确定待剪辑视频中的至少一个目标帧时,可以采用的可选实现方式为:将每个图像帧的第一文字信息与第二文字信息输入预先训练得到的第一分类模型,得到该第一分类模型针对每个图像帧输出的分类结果;将分类结果满足预设要求的图像帧作为目标帧。When the acquisition unit 301 determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, an optional implementation method that can be adopted is: the first text information of each image frame Inputting the pre-trained first classification model with the second text information, and obtaining the classification result output by the first classification model for each image frame; taking the image frame whose classification result meets the preset requirements as the target frame.
其中,获取单元301所使用的第一分类模型,能够根据所输入的文 字信息来输出该图像帧是否为目标帧的分类结果,分类结果为1表示该图像帧属于目标帧,分类结果为0表示该图像帧不属于目标帧。Among them, the first classification model used by the acquisition unit 301 can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 indicates that the image frame belongs to the target frame, and the classification result being 0 indicates that The image frame does not belong to the target frame.
获取单元301在根据每个图像帧的第一文字信息与第二文字信息,确定待剪辑视频中的至少一个目标帧时,还可以将各图像帧的第一文字信息与第二文字信息进行拼接之后,计算拼接结果与预设信息之间的相似度,进而将相似度计算结果超过预设相似度阈值的图像帧作为目标帧。When the acquisition unit 301 determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, after splicing the first text information and the second text information of each image frame, Calculate the similarity between the splicing result and the preset information, and then use the image frame whose similarity calculation result exceeds the preset similarity threshold as the target frame.
本实施例在由获取单元301确定了待剪辑视频中的至少一个目标帧之后,由提取单元302从待剪辑视频中提取每个目标帧的初始视频。In this embodiment, after the acquisition unit 301 determines at least one target frame in the video to be edited, the extraction unit 302 extracts the initial video of each target frame from the video to be edited.
具体地,提取单元302在从待剪辑视频中提取每个目标帧的初始视频时,可以采用的可选实现方式为:针对每个目标帧,将从待剪辑视频中提取的包含该目标帧、且时长为预设时长的视频,作为该目标帧的初始视频。Specifically, when the extracting unit 302 extracts the initial video of each target frame from the video to be edited, an optional implementation manner that can be adopted is: for each target frame, the extracted video containing the target frame, And the video whose duration is the preset duration is used as the initial video of the target frame.
本实施例中的预设时长,具体为通过对已有高光视频的时长进行统计得到的,例如可以将已有游戏高光视频的时长的平均值作为预设时长。The preset duration in this embodiment is specifically obtained by counting the duration of existing highlight videos, for example, the average duration of existing game highlight videos may be used as the preset duration.
提取单元302从待剪辑视频中所提取的初始视频,具体为包含目标帧且视频时长为预设时长的视频片段,本实施例对初始视频中位于目标帧之前和/或之后的图像帧的数量不进行限定。The initial video extracted by the extraction unit 302 from the video to be edited is specifically a video segment containing the target frame and the video duration is a preset duration. In this embodiment, the number of image frames located before and/or after the target frame in the initial video Not limited.
本实施例在由提取单元302从待剪辑视频中提取每个目标帧的初始视频之后,由处理单元303根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧。In this embodiment, after the initial video of each target frame is extracted from the video to be edited by the extraction unit 302, the processing unit 303 determines the target frame in each initial video according to the image and audio data of the image frame contained in each initial video. The start frame and end frame corresponding to the frame.
也就是说,处理单元303对初始视频中所包含的图像帧进行筛选,进而确定每个初始视频中与目标帧对应的准确开始帧与准确结束帧,从而实现视频时间范围的更新,进一步提升所生成的剪辑视频的准确性。That is to say, the processing unit 303 screens the image frames included in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to update the time range of the video and further improve the The accuracy of the resulting clipped video.
具体地,处理单元303在根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧时,可以采用的可选实现方式为:针对每个初始视频,根据该初始视频所包含图像帧的图像与音频数据,得到该初始视频中每个图像帧的多模态特征;将该初始视频中各图像帧的多模态特征进行拼接,将拼接结果输入预先训练得到的第二分类模型;根据该第二分类模型的输出结果,确定该初始视频中的开始帧与结束帧。Specifically, when the processing unit 303 determines the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video, an optional implementation method that can be adopted is: for For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained; the multimodal features of each image frame in the initial video are spliced, Input the splicing result into the pre-trained second classification model; determine the start frame and end frame in the initial video according to the output result of the second classification model.
处理单元303得到的多模态特征中,图像特征具体为从图像帧的图 像中提取的游戏人物属性特征,用于表示图像帧中游戏人物的活动是否激烈,可以包含游戏人物的生命值特征(生命值指示模板的位置信息、数量信息与变化信息等)、魔法值特征(魔法值指示模板的位置信息、数量信息与变化信息等)、动作特效特征(是否存在动作特效)等特征;音频特征为从图像帧的音频数据中提取的游戏音频特征,用于表示图像帧中是否存在竞技语音活动。Among the multi-modal features obtained by the processing unit 303, the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value feature of the game character ( The position information, quantity information and change information of the life value indicator template), magic value features (the position information, quantity information and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects) and other features; audio features is the game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.
也就是说,处理单元303能够结合图像帧的图像特征与音频特征,来确定初始视频中准确的开始帧与结束帧,从而相应地得到待剪辑视频中对应每个目标帧的开始帧与结束帧,进一步提升所生成的剪辑视频的准确性。That is to say, the processing unit 303 can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.
其中,处理单元303所使用的第二分类模型包含卷积层与全连接层,在将由各图像帧的多模态特征的拼接结果输入到第二分类模型之后,先由卷积层对拼接结果进行卷积处理,然后将卷积处理结果输入全连接层进行分类,从而得到由全连接层输出的开始帧与结束帧。Wherein, the second classification model used by the processing unit 303 includes a convolutional layer and a fully connected layer. After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first performs the splicing results. Perform convolution processing, and then input the convolution processing results into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.
本实施例可以使用以下方式来预先训练得到第二分类模型:获取训练数据,所获取的训练数据中包含多个训练视频与多个训练视频的标注结果,标注结果中包含开始帧标注结果与结束帧标注结果;针对每个训练视频,根据该训练视频所包含图像帧的图像与音频数据,得到该训练视频中每个图像帧的多模态特征;将该训练视频中各图像帧的多模态特征进行拼接,将拼接结果输入神经网络模型,得到该神经网络模型针对每个训练视频输出的预测结果,该预测结果中包含开始帧预测结果与结束帧预测结果:使用该训练视频的标注结果与预测结果计算损失函数值,根据计算得到的损失函数值调整神经网络模型的参数,直至神经网络模型收敛,得到第二分类模型。In this embodiment, the following method can be used to pre-train the second classification model: obtain training data, the acquired training data includes multiple training videos and the labeling results of multiple training videos, and the labeling results include the start frame labeling result and the end frame Frame annotation result; for each training video, according to the image and audio data of the image frame contained in the training video, the multimodal feature of each image frame in the training video is obtained; the multimodal feature of each image frame in the training video State features are spliced, and the splicing result is input into the neural network model to obtain the prediction result output by the neural network model for each training video. The prediction result includes the start frame prediction result and the end frame prediction result: use the annotation result of the training video Calculate the loss function value based on the prediction result, adjust the parameters of the neural network model according to the calculated loss function value until the neural network model converges, and obtain the second classification model.
本实施例在由处理单元303确定待剪辑视频中对应每个目标帧的开始帧与结束帧之后,由生成单元304根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。In this embodiment, after the processing unit 303 determines the start frame and end frame corresponding to each target frame in the video to be edited, the generation unit 304 generates each target frame according to each target frame and its corresponding start frame and end frame video clips.
具体地,生成单元304在根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频时,可以采用的可选实现方式为:针对每个目标帧,从待剪辑视频中提取开始帧到目标帧的前一帧之间的第一视频与目标帧的后一帧到结束帧之间的第二视频;生成目标帧的定格视 频;依次将第一视频、目标帧的定格视频与第二视频进行拼接,生成该目标帧的剪辑视频。Specifically, when the generation unit 304 generates the clipped video of each target frame according to each target frame and its corresponding start frame and end frame, an optional implementation method that can be adopted is: for each target frame, from the target frame to be clipped Extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame in the video; generate a stop-motion video of the target frame; sequentially convert the first video, the target frame The stop-motion video and the second video are spliced to generate an edited video of the target frame.
也就是说,生成单元304通过将目标帧延伸数秒的方式来得到目标帧的定格视频,进而使用目标帧的定格视频来生成剪辑视频,能够进一步在剪辑视频中对目标帧进行突出,从而提升剪辑视频的显示效果。That is to say, the generation unit 304 obtains the stop-motion video of the target frame by extending the target frame for several seconds, and then uses the stop-motion video of the target frame to generate the clipped video, which can further highlight the target frame in the clipped video, thereby improving the clipping quality. The display effect of the video.
生成单元304在生成目标帧的剪辑视频时,可以在所生成的剪辑视频中添加预设音乐,还可以在所生成的剪辑视频中添加引流片头与片尾等特效,从而进一步提升所生成的剪辑视频的质量。When the generating unit 304 generates the clipped video of the target frame, preset music can be added to the generated clipped video, and special effects such as leading titles and endings can also be added to the generated clipped video, thereby further improving the generated clipped video. the quality of.
本实施例的视频剪辑装置300中还可以包含合并单元305,用于在生成单元304生成每个目标帧的剪辑视频之后,执行以下内容:根据每个目标帧的剪辑视频中包含的图像帧,确定图像帧存在重合的多个剪辑视频;将所确定的多个剪辑视频进行合并,保留最后一个目标帧的定格视频,生成合并剪辑视频。The video clipping device 300 of this embodiment may also include a merging unit 305, which is used to perform the following content after the generating unit 304 generates the clipped video of each target frame: according to the image frames contained in the clipped video of each target frame, It is determined that there are multiple video clips with overlapped image frames; the multiple video clips determined are merged, and the stop-motion video of the last target frame is retained to generate a merged video clip.
也就是说,合并单元305能够对图像帧存在重合的多个剪辑视频进行合并,并对最后一个目标帧的定格视频进行保留,从而生成合并剪辑视频,确保了所生成的合并剪辑视频中高光事件具有连续性,进一步提升了视频剪辑的准确性。That is to say, the merging unit 305 can merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlight event in the generated combined video clip is It has continuity, which further improves the accuracy of video editing.
合并单元305在将所确定的多个剪辑视频进行合并时,由于每个剪辑视频对应不同的目标帧,因此本实施例仅在合并剪辑视频中对最后一个目标帧的定格视频进行保留,将其他目标帧的定格视频还原为目标帧本身。When the merging unit 305 merges the determined plurality of video clips, since each video clip corresponds to a different target frame, this embodiment only reserves the stop-motion video of the last target frame in the merged video clips, and other The stop motion video of the target frame reverts to the target frame itself.
合并单元305在生成合并剪辑视频时,可以在所生成的合并剪辑视频中添加预设音乐,还可以在所生成的合并剪辑视频中添加引流片头与片尾等特效,从而进一步提升所生成的剪辑视频的质量When the merging unit 305 generates a merged clip video, it can add preset music in the generated merged clip video, and can also add special effects such as drainage titles and credits in the generated merged clip video, thereby further improving the generated clip video. the quality of
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
如图4所示,是根据本公开实施例的视频剪辑方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计 算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。As shown in FIG. 4 , it is a block diagram of an electronic device according to the video clipping method of the embodiment of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图4所示,设备400包括计算单元401,其可以根据存储在只读存储器(ROM)402中的计算机程序或者从存储单元408加载到随机访问存储器(RAM)403中的计算机程序,来执行各种适当的动作和处理。在RAM403中,还可存储设备400操作所需的各种程序和数据。计算单元401、ROM402以及RAM403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4, the device 400 includes a computing unit 401 that can execute according to a computer program stored in a read-only memory (ROM) 402 or loaded from a storage unit 408 into a random access memory (RAM) 403. Various appropriate actions and treatments. In the RAM 403, various programs and data necessary for the operation of the device 400 can also be stored. The computing unit 401 , ROM 402 and RAM 403 are connected to each other through a bus 404 . An input/output (I/O) interface 405 is also connected to bus 404 .
设备400中的多个部件连接至I/O接口405,包括:输入单元406,例如键盘、鼠标等;输出单元407,例如各种类型的显示器、扬声器等;存储单元408,例如磁盘、光盘等;以及通信单元409,例如网卡、调制解调器、无线通信收发机等。通信单元409允许设备400通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 400 are connected to the I/O interface 405, including: an input unit 406, such as a keyboard, a mouse, etc.; an output unit 407, such as various types of displays, speakers, etc.; a storage unit 408, such as a magnetic disk, an optical disk, etc. ; and a communication unit 409, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 409 allows the device 400 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
计算单元401可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元401的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元401执行上文所描述的各个方法和处理,例如视频剪辑方法。例如,在一些实施例中,视频剪辑方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元408。The computing unit 401 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 401 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 executes various methods and processes described above, such as a video editing method. For example, in some embodiments, the video clipping method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 408 .
在一些实施例中,计算机程序的部分或者全部可以经由ROM402和/或通信单元409而被载入和/或安装到设备400上。当计算机程序加载到RAM 403并由计算单元401执行时,可以执行上文描述的视频剪辑方法的一个或多个步骤。备选地,在其他实施例中,计算单元401可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行视频剪辑方法。In some embodiments, part or all of the computer program may be loaded and/or installed on the device 400 via the ROM 402 and/or the communication unit 409 . When the computer program is loaded into the RAM 403 and executed by the computing unit 401, one or more steps of the video editing method described above can be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to execute the video clipping method in any other suitable manner (for example, by means of firmware).
此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、 专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, systems integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反 馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(“Virtual Private Server”,或简称“VPS”)中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims (17)

  1. 一种视频剪辑方法,包括:A video editing method, comprising:
    获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;Acquire a video to be edited, and determine at least one target frame in the video to be edited;
    从所述待剪辑视频中提取每个目标帧的初始视频;Extract the initial video of each target frame from the video to be edited;
    根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧;Determine the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;
    根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。According to each target frame and its corresponding start frame and end frame, a video clip of each target frame is generated.
  2. 根据权利要求1所述的方法,其中,所述确定所述待剪辑视频中的至少一个目标帧包括:The method according to claim 1, wherein said determining at least one target frame in the video to be edited comprises:
    根据所述待剪辑视频中每个图像帧的图像得到图像帧的第一文字信息,根据每个图像帧的音频数据得到图像帧的第二文字信息;Obtain the first text information of the image frame according to the image of each image frame in the video to be edited, and obtain the second text information of the image frame according to the audio data of each image frame;
    根据每个图像帧的第一文字信息与第二文字信息,确定所述待剪辑视频中的至少一个目标帧。At least one target frame in the video to be edited is determined according to the first text information and the second text information of each image frame.
  3. 根据权利要求2所述的方法,其中,所述根据每个图像帧的第一文字信息与第二文字信息,确定所述待剪辑视频中的至少一个目标帧包括:The method according to claim 2, wherein said determining at least one target frame in the video to be edited according to the first text information and the second text information of each image frame comprises:
    将每个图像帧的第一文字信息与第二文字信息输入预先训练得到的第一分类模型,得到所述第一分类模型针对每个图像帧输出的分类结果;Inputting the first text information and the second text information of each image frame into the first classification model obtained in advance training, and obtaining the classification result output by the first classification model for each image frame;
    将分类结果满足预设要求的图像帧作为目标帧。The image frame whose classification result meets the preset requirements is taken as the target frame.
  4. 根据权利要求1所述的方法,其中,所述从所述待剪辑视频中提取每个目标帧的初始视频包括:The method according to claim 1, wherein said extracting the initial video of each target frame from the video to be edited comprises:
    针对每个目标帧,将从所述待剪辑视频中提取的包含该目标帧、且时长为预设时长的视频,作为该目标帧的初始视频。For each target frame, the video containing the target frame and having a preset duration extracted from the video to be edited is used as the initial video of the target frame.
  5. 根据权利要求1所述的方法,其中,所述根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧包括:The method according to claim 1, wherein, according to the image and audio data of the image frame included in each initial video, determining the start frame and the end frame corresponding to the target frame in each initial video comprises:
    针对每个初始视频,根据该初始视频所包含图像帧的图像与音频数据,得到该初始视频中每个图像帧的多模态特征;For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained;
    将各图像帧的多模态特征进行拼接,将拼接结果输入预先训练得到 的第二分类模型;The multimodal features of each image frame are spliced, and the splicing result is input into the second classification model that is trained in advance;
    根据所述第二分类模型的输出结果,确定该初始视频中与目标帧所对应的开始帧与结束帧。According to the output result of the second classification model, the start frame and the end frame corresponding to the target frame in the initial video are determined.
  6. 根据权利要求1所述的方法,其中,所述根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频包括:The method according to claim 1, wherein said generating a video clip of each target frame according to each target frame and its corresponding start frame and end frame comprises:
    针对每个目标帧,从所述待剪辑视频中提取开始帧到目标帧的前一帧之间的第一视频与目标帧的后一帧到结束帧之间的第二视频;For each target frame, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame from the video to be edited;
    生成目标帧的定格视频;Generate a stop-motion video of the target frame;
    依次将所述第一视频、所述目标帧的定格视频与所述第二视频进行拼接,生成目标帧的剪辑视频。Splicing the first video, the stop-motion video of the target frame, and the second video in sequence to generate an edited video of the target frame.
  7. 根据权利要求1所述的方法,还包括,The method of claim 1, further comprising,
    在生成每个目标帧的剪辑视频之后,根据每个目标帧的剪辑视频中包含的图像帧,确定图像帧存在重合的多个剪辑视频;After generating the clip video of each target frame, according to the image frames contained in the clip video of each target frame, it is determined that there are multiple clip videos overlapping the image frames;
    将所确定的多个剪辑视频进行合并,保留最后一个目标帧的定格视频,生成合并剪辑视频。Merge the determined plurality of video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.
  8. 一种视频剪辑装置,包括:A video editing device, comprising:
    获取单元,用于获取待剪辑视频,确定所述待剪辑视频中的至少一个目标帧;An acquisition unit, configured to acquire a video to be edited, and determine at least one target frame in the video to be edited;
    提取单元,用于从所述待剪辑视频中提取每个目标帧的初始视频;An extraction unit, configured to extract the initial video of each target frame from the video to be edited;
    处理单元,用于根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧;A processing unit, configured to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;
    生成单元,用于根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频。The generating unit is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
  9. 根据权利要求8所述的装置,其中,所述获取单元在确定所述待剪辑视频中的至少一个目标帧时,具体执行:The device according to claim 8, wherein, when the acquiring unit determines at least one target frame in the video to be edited, specifically execute:
    根据所述待剪辑视频中每个图像帧的图像得到图像帧的第一文字信息,根据每个图像帧的音频数据得到图像帧的第二文字信息;Obtain the first text information of the image frame according to the image of each image frame in the video to be edited, and obtain the second text information of the image frame according to the audio data of each image frame;
    根据每个图像帧的第一文字信息与第二文字信息,确定所述待剪辑视频中的至少一个目标帧。At least one target frame in the video to be edited is determined according to the first text information and the second text information of each image frame.
  10. 根据权利要求9所述的装置,其中,所述获取单元在根据每个图像帧的第一文字信息与第二文字信息,确定所述待剪辑视频中的至少 一个目标帧时,具体执行:The device according to claim 9, wherein, when the acquiring unit determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, specifically execute:
    将每个图像帧的第一文字信息与第二文字信息输入预先训练得到的第一分类模型,得到所述第一分类模型针对每个图像帧输出的分类结果;Inputting the first text information and the second text information of each image frame into the first classification model obtained in advance training, and obtaining the classification result output by the first classification model for each image frame;
    将分类结果满足预设要求的图像帧作为目标帧。The image frame whose classification result meets the preset requirements is taken as the target frame.
  11. 根据权利要求8所述的装置,其中,所述提取单元在从所述待剪辑视频中提取每个目标帧的初始视频时,具体执行:The device according to claim 8, wherein, when the extracting unit extracts the initial video of each target frame from the video to be edited, specifically perform:
    针对每个目标帧,将从所述待剪辑视频中提取的包含该目标帧、且时长为预设时长的视频,作为该目标帧的初始视频。For each target frame, the video containing the target frame and having a preset duration extracted from the video to be edited is used as the initial video of the target frame.
  12. 根据权利要求8所述的装置,其中,所述处理单元在根据每个初始视频所包含图像帧的图像与音频数据,确定每个初始视频中与目标帧对应的开始帧与结束帧时,具体执行:The device according to claim 8, wherein, when the processing unit determines the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frames included in each initial video, specifically implement:
    针对每个初始视频,根据该初始视频所包含图像帧的图像与音频数据,得到该初始视频中每个图像帧的多模态特征;For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained;
    将各图像帧的多模态特征进行拼接,将拼接结果输入预先训练得到的第二分类模型;splicing the multimodal features of each image frame, and inputting the splicing result into the second classification model obtained in advance training;
    根据所述第二分类模型的输出结果,确定该初始视频中与目标帧所对应的开始帧与结束帧。According to the output result of the second classification model, the start frame and the end frame corresponding to the target frame in the initial video are determined.
  13. 根据权利要求8所述的装置,其中,所述生成单元在根据每个目标帧及其对应的开始帧与结束帧,生成每个目标帧的剪辑视频时,具体执行:The device according to claim 8, wherein, when the generating unit generates the video clip of each target frame according to each target frame and its corresponding start frame and end frame, it specifically performs:
    针对每个目标帧,从所述待剪辑视频中提取开始帧到目标帧的前一帧之间的第一视频与目标帧的后一帧到结束帧之间的第二视频;For each target frame, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame from the video to be edited;
    生成目标帧的定格视频;Generate a stop-motion video of the target frame;
    依次将所述第一视频、所述目标帧的定格视频与所述第二视频进行拼接,生成目标帧的剪辑视频。Splicing the first video, the stop-motion video of the target frame, and the second video in sequence to generate an edited video of the target frame.
  14. 根据权利要求8所述的装置,还包括合并单元,具体执行:The device according to claim 8, further comprising a merging unit, specifically performing:
    在所述生成单元生成每个目标帧的剪辑视频之后,根据每个目标帧的剪辑视频中包含的图像帧,确定图像帧存在重合的多个剪辑视频;After the generating unit generates the video clips of each target frame, according to the image frames contained in the video clips of each target frame, it is determined that there are multiple video clips in which the image frames overlap;
    将所确定的多个剪辑视频进行合并,保留最后一个目标帧的定格视频,生成合并剪辑视频。Merge the determined plurality of video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.
  15. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-7. Methods.
  16. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1-7中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-7.
  17. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-7中任一项所述的方法。A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
PCT/CN2022/080976 2021-07-13 2022-03-15 Video editing method and apparatus, and electronic device and readable storage medium WO2023284316A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110790261.8 2021-07-13
CN202110790261.8A CN113691864A (en) 2021-07-13 2021-07-13 Video clipping method, video clipping device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
WO2023284316A1 true WO2023284316A1 (en) 2023-01-19

Family

ID=78577191

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080976 WO2023284316A1 (en) 2021-07-13 2022-03-15 Video editing method and apparatus, and electronic device and readable storage medium

Country Status (2)

Country Link
CN (1) CN113691864A (en)
WO (1) WO2023284316A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113691864A (en) * 2021-07-13 2021-11-23 北京百度网讯科技有限公司 Video clipping method, video clipping device, electronic equipment and readable storage medium
CN114339075A (en) * 2021-12-20 2022-04-12 北京达佳互联信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN115022732B (en) * 2022-05-25 2023-11-03 阿里巴巴(中国)有限公司 Video generation method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354762A1 (en) * 2018-05-17 2019-11-21 Chandru Bolaki Method and device for time lapsed digital video recording and navigation through the same
CN111428660A (en) * 2020-03-27 2020-07-17 腾讯科技(深圳)有限公司 Video editing method and device, storage medium and electronic device
CN111726525A (en) * 2020-06-19 2020-09-29 维沃移动通信有限公司 Video recording method, video recording device, electronic equipment and storage medium
CN111988638A (en) * 2020-08-19 2020-11-24 北京字节跳动网络技术有限公司 Method and device for acquiring spliced video, electronic equipment and storage medium
CN112380929A (en) * 2020-10-30 2021-02-19 北京字节跳动网络技术有限公司 Highlight segment obtaining method and device, electronic equipment and storage medium
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN113691864A (en) * 2021-07-13 2021-11-23 北京百度网讯科技有限公司 Video clipping method, video clipping device, electronic equipment and readable storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681894A (en) * 2016-01-04 2016-06-15 努比亚技术有限公司 Device and method for displaying video file
CN106162223B (en) * 2016-05-27 2020-06-05 北京奇虎科技有限公司 News video segmentation method and device
CN107172487A (en) * 2017-06-09 2017-09-15 成都索贝数码科技股份有限公司 A kind of method that Highlight is extracted by camera lens playback feature
CN108833969A (en) * 2018-06-28 2018-11-16 腾讯科技(深圳)有限公司 A kind of clipping method of live stream, device and equipment
CN109089128A (en) * 2018-07-10 2018-12-25 武汉斗鱼网络科技有限公司 A kind of method for processing video frequency, device, equipment and medium
CN109922373B (en) * 2019-03-14 2021-09-28 上海极链网络科技有限公司 Video processing method, device and storage medium
CN110505519B (en) * 2019-08-14 2021-12-03 咪咕文化科技有限公司 Video editing method, electronic equipment and storage medium
CN111343496A (en) * 2020-02-21 2020-06-26 北京字节跳动网络技术有限公司 Video processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354762A1 (en) * 2018-05-17 2019-11-21 Chandru Bolaki Method and device for time lapsed digital video recording and navigation through the same
CN111428660A (en) * 2020-03-27 2020-07-17 腾讯科技(深圳)有限公司 Video editing method and device, storage medium and electronic device
CN111726525A (en) * 2020-06-19 2020-09-29 维沃移动通信有限公司 Video recording method, video recording device, electronic equipment and storage medium
CN111988638A (en) * 2020-08-19 2020-11-24 北京字节跳动网络技术有限公司 Method and device for acquiring spliced video, electronic equipment and storage medium
CN112380929A (en) * 2020-10-30 2021-02-19 北京字节跳动网络技术有限公司 Highlight segment obtaining method and device, electronic equipment and storage medium
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN113691864A (en) * 2021-07-13 2021-11-23 北京百度网讯科技有限公司 Video clipping method, video clipping device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN113691864A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
WO2023284316A1 (en) Video editing method and apparatus, and electronic device and readable storage medium
US20220147822A1 (en) Training method and apparatus for target detection model, device and storage medium
US20210201886A1 (en) Method and device for dialogue with virtual object, client end, and storage medium
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN111145732B (en) Processing method and system after multi-task voice recognition
US11816891B2 (en) Video recognition method and apparatus, electronic device and storage medium
US11800042B2 (en) Video processing method, electronic device and storage medium thereof
US20220301108A1 (en) Image quality enhancing
JP7267379B2 (en) Image processing method, pre-trained model training method, device and electronic equipment
WO2023005253A1 (en) Method, apparatus and system for training text recognition model framework
CN113055751B (en) Data processing method, device, electronic equipment and storage medium
CN113641807A (en) Training method, device, equipment and storage medium of dialogue recommendation model
WO2023159819A1 (en) Visual processing and model training methods, device, storage medium and program product
US20220207427A1 (en) Method for training data processing model, electronic device and storage medium
CN116935287A (en) Video understanding method and device
JP2022116231A (en) Training method of organism detection model, device, electronic apparatus and storage medium
US10936823B2 (en) Method and system for displaying automated agent comprehension
CN116778040B (en) Face image generation method based on mouth shape, training method and device of model
WO2023109103A1 (en) Video editing method and apparatus, electronic device, and medium
US20220335316A1 (en) Data annotation method and apparatus, electronic device and readable storage medium
CN113873323B (en) Video playing method, device, electronic equipment and medium
KR20210081308A (en) Method, device, electronic equipment and storage medium for video processing
CN113327311A (en) Virtual character based display method, device, equipment and storage medium
JP7556063B2 (en) Video editing method, device, electronic device, and medium
US20220358929A1 (en) Voice activity detection method and apparatus, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22840961

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22840961

Country of ref document: EP

Kind code of ref document: A1