WO2023284316A1

WO2023284316A1 - Video editing method and apparatus, and electronic device and readable storage medium

Info

Publication number: WO2023284316A1
Application number: PCT/CN2022/080976
Authority: WO
Inventors: 陈妙; 廖玺举; 贠挺; 李远杭; 田颖
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-07-13
Filing date: 2022-03-15
Publication date: 2023-01-19
Also published as: CN113691864A

Abstract

The present disclosure relates to the technical field of artificial intelligence such as image processing and deep learning. Provided are a video editing method and apparatus, and an electronic device and a readable storage medium. The video editing method comprises: acquiring a video to be edited, and determining at least one target frame in the video to be edited; extracting an initial video of each target frame from the video to be edited; according to image and audio data of image frames included in each initial video, determining, from each initial video, a start frame and an end frame which correspond to the target frame; and generating an edited video of each target frame according to each target frame and the start frame and end frame corresponding thereto. By means of the present disclosure, automatic video editing is realized, such that the accuracy and efficiency of video editing can be improved.

Description

Video editing method, device, electronic device and readable storage medium

This application claims the priority of a Chinese patent application with an application date of July 13, 2021 and an application number of 202110790261.8 titled "Video Editing Method, Device, Electronic Equipment, and Readable Storage Medium".

technical field

The present disclosure relates to the field of computer technology, in particular to the field of artificial intelligence technology such as image processing and deep learning. Provided are a video editing method, device, electronic equipment and readable storage medium.

Background technique

As a kind of information medium, video, especially short video, has attracted more and more attention. At present, long-form videos have existed for a long time, such as TV series, movies, entertainment live videos, etc. At present, these video resources are relatively long, and it takes time to watch them completely. On the contrary, short videos are shorter in duration, and are sought after because they can take advantage of fragmented time and highly concentrated information.

In the prior art, the video resources are generally edited according to the input editing operation, but the accuracy and efficiency of video editing are low because of the difficulty in controlling the editing time and the length of the clips.

Contents of the invention

According to the first aspect of the present disclosure, there is provided a video editing method, comprising: acquiring a video to be edited, determining at least one target frame in the video to be edited; extracting the initial frame of each target frame from the video to be edited Video; according to the image and audio data of the image frame contained in each initial video, determine the start frame and end frame corresponding to the target frame in each initial video; according to each target frame and its corresponding start frame and end frame, generate Clipped video for each target frame.

According to a second aspect of the present disclosure, there is provided a video clipping device, including: an acquisition unit, configured to acquire a video to be trimmed, and determine at least one target frame in the video to be trimmed; an extraction unit, configured to extract from the video to be trimmed The initial video of each target frame is extracted from the clip video; the processing unit is used to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame contained in each initial video; generate The unit is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.

According to a third aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the at least one processor Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method as described above.

According to a fifth aspect of the present disclosure there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

It can be seen from the above technical solutions that after determining at least one target frame in the video to be edited, this embodiment first extracts the initial video corresponding to each target frame from the video to be edited, and then determines the target frame in the video to be edited according to the initial video. Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, thereby realizing automatic video editing and improving the quality of video editing. Accuracy and efficiency.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure;

Fig. 4 is a block diagram of an electronic device used to implement the video clipping method of the embodiment of the present disclosure.

detailed description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in Figure 1, the video editing method of the present embodiment may specifically include the following steps:

S101. Acquire a video to be edited, and determine at least one target frame in the video to be edited;

S102. Extract the initial video of each target frame from the video to be edited;

S103. According to the image and audio data of the image frame included in each initial video, determine the start frame and the end frame corresponding to the target frame in each initial video;

S104. Generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.

In the video editing method of this embodiment, after determining at least one target frame in the video to be edited, first extract the initial video corresponding to each target frame from the video to be edited, and then determine the video to be edited according to the extracted initial video Corresponding to the start frame and end frame of each target frame, and finally according to each target frame and its corresponding start frame and end frame, the edited video of each target frame is generated, which realizes automatic video editing and can improve the efficiency of video editing. Accuracy and efficiency.

In this embodiment, the video to be edited obtained by executing S101 may be a game video, for example, a live game video is obtained as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc. The embodiment does not limit this.

In this embodiment, after executing S101 to obtain the video to be edited, at least one target frame in the video to be edited is determined, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.

Specifically, in this embodiment, when performing S101 to determine at least one target frame in the video to be edited, an optional implementation method that can be adopted is: obtain the first text of the image frame according to the acquired image of each image frame in the video to be edited Information, according to the audio data of each image frame, the second text information of the image frame is obtained, such as performing optical character recognition (Optical Character Recognition, OCR) on the image of the image frame to obtain the first text information, and performing automatic speech on the audio data of the image frame Recognition (Automatic Speech Recognition, ASR) to obtain the second text information; according to the first text information and the second text information of each image frame, determine at least one target frame in the acquired video to be edited.

In this embodiment, when performing S101 to determine at least one target frame in the video to be edited, the video to be edited may also be divided to obtain multiple video clips of equal length, and then the target frames in each video clip are respectively determined.

That is to say, this embodiment can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.

In this embodiment, when performing S101 to determine at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, the optional implementation method that can be adopted is: the first text information of each image frame A text information and a second text information are input into the pre-trained first classification model, and the classification result output by the first classification model for each image frame is obtained; the image frame whose classification result meets the preset requirements is used as the target frame, for example, the classification The image frame whose result is 1 is used as the target frame.

Wherein, the first classification model used in the implementation of S101 in this embodiment can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 means that the image frame belongs to the target frame, and the classification result is 0 means that the image frame does not belong to the target frame.

In this embodiment, when performing S101 to determine at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, the first text information and the second text information of each image frame can also be spliced. Afterwards, the similarity between the splicing result and the preset information is calculated, and then the image frame whose similarity calculation result exceeds the preset similarity threshold is used as the target frame.

In this embodiment, after performing S101 to determine at least one target frame in the video to be edited, perform S102 to extract an initial video of each target frame from the video to be edited.

Specifically, in this embodiment, when executing S102 to extract the initial video of each target frame from the video to be edited, an optional implementation method that can be adopted is: for each target frame, extract the video containing the target from the video to be edited frame and the video with a preset duration is used as the initial video of the target frame.

The preset duration in this embodiment is specifically obtained by counting the duration of existing highlight videos, for example, the average duration of existing game highlight videos may be used as the preset duration.

This embodiment executes S102 to extract the initial video from the video to be edited, specifically a video segment that contains the target frame and has a video duration of a preset duration. In this embodiment, the image frames located before and/or after the target frame in the initial video The number of is not limited.

For example, if the acquired video to be edited includes image frame 1, image frame 2, image frame 3, image frame 4, image frame 5, and image frame 6, if the determined target frame is image frame 3, if the predetermined Assuming that the duration is 4s, the initial video extracted by executing S102 in this embodiment may be composed of image frame 1, image frame 2, image frame 3, and image frame 4, or may be composed of image frame 2, image frame 3, and image frame 4. Composed with 5 image frames.

In this embodiment, after executing S102 to extract the initial video of each target frame from the video to be edited, execute S103 to determine the beginning of each initial video corresponding to the target frame according to the image and audio data of the image frame contained in each initial video. frame and end frame.

That is to say, this embodiment screens the image frames contained in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to realize the update of the video time range and further improve the The accuracy of the resulting clipped video.

Specifically, when performing S103 in this embodiment to determine the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video, the optional implementation method that can be adopted is : For each initial video, according to the image and audio data of the image frame contained in the initial video, the multimodal feature of each image frame in the initial video is obtained, and the obtained multimodal feature includes the image feature of the image frame and audio features; the multimodal features of each image frame in the initial video are spliced, and the splicing result is input into the second classification model obtained in advance; according to the output result of the second classification model, determine the beginning of the initial video frame and end frame.

Among the multimodal features obtained by executing S103 in this embodiment, the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value of the game character Features (position information, quantity information, and change information of the life value indicator template, etc.), magic value features (position information, quantity information, and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects), etc.; The audio feature is a game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.

That is to say, this embodiment can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.

Among them, the second classification model used in the implementation of S103 in this embodiment includes a convolutional layer and a fully connected layer. After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first The splicing result is subjected to convolution processing, and then the convolution processing result is input into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.

In this embodiment, the following method can be used to pre-train the second classification model: obtain training data, the acquired training data includes multiple training videos and the labeling results of multiple training videos, and the labeling results include the start frame labeling result and the end frame Frame annotation result; for each training video, according to the image and audio data of the image frame contained in the training video, the multimodal feature of each image frame in the training video is obtained; the multimodal feature of each image frame in the training video State features are spliced, and the splicing result is input into the neural network model to obtain the prediction result output by the neural network model for each training video. The prediction result includes the start frame prediction result and the end frame prediction result: use the annotation result of the training video Calculate the loss function value based on the prediction result, adjust the parameters of the neural network model according to the calculated loss function value until the neural network model converges, and obtain the second classification model.

Among them, this embodiment can use the following formula to calculate the loss function value:

In the formula: w is the value of the loss function; gt is the time of the labeling result, pt is the time of the prediction result, if the labeling result is the start frame labeling result, the prediction result is the start frame prediction result, and if the labeling result is the end frame labeling result , then the predicted result is the predicted result of the end frame; δ takes a value of 1.

In this embodiment, after executing S103 to determine the start frame and end frame corresponding to each target frame in the video to be edited, execute S104 to generate a cut video of each target frame according to each target frame and its corresponding start frame and end frame.

Specifically, when performing S104 in this embodiment to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame, an optional implementation method that can be adopted is: for each target frame, from From the video to be edited, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame; generate a stop-motion video of the target frame, that is, extend the screen of the target frame A few seconds; the first video, the stop-motion video of the target frame and the second video are spliced in sequence to generate an edited video of the target frame.

That is to say, in this embodiment, the freeze-frame video of the target frame is obtained by extending the target frame for several seconds, and then the freeze-frame video of the target frame is used to generate a clipped video, which can further highlight the target frame in the clipped video, thereby improving clipping. The display effect of the video.

In this embodiment, when performing S104 to generate a video clip of the target frame, preset music can be added to the generated video clip, and special effects such as the opening and closing credits can also be added to the generated video clip, thereby further improving the generated video clip. The quality of the clipped video.

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in Figure 2, after performing S104 "generating clipped video of each target frame" in this embodiment, the following content may also be included:

S201. According to the image frames contained in the clip video of each target frame, determine that there are multiple clip videos in which the image frames overlap;

S202. Merge the determined video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.

That is to say, this embodiment can also merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlights in the generated combined video clip are The continuity of events further enhances the accuracy of video editing.

In this embodiment, when performing S202 to merge the determined multiple cut videos, since each highlight video corresponds to a different target frame, this embodiment only reserves the stop-motion video of the last target frame in the merged cut video, Restore the stop motion video of other target frames to the target frame itself.

For example, if the acquired video to be edited includes image frame 1, image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, image frame 7 and image frame 8, if the determined target The frames are image frame 3 and image frame 6. If the preset duration is 4s, if the generated clip video of image frame 3 contains "image frame 2, image frame 3, image frame 4 and image frame 5", the generated The clip video of image frame 6 includes "image frame 4, image frame 5, image frame 6, and image frame 7", and there are overlapped image frame 4 and image frame 5 in the two highlight videos, then the present embodiment divides the two clips The video is merged, and only the stop-motion video of image frame 6 is retained, and the generated merged clip video includes "image frame 2, image frame 3, image frame 4, image frame 5, image frame 6, and image frame 7".

Similarly, in this embodiment, when performing S202 to generate a combined clip video, preset music can be added to the generated combined clip video, and special effects such as drainage titles and trailers can also be added to the generated combined clip video, thereby further improving The quality of the resulting clip video.

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in Figure 3, the video editing device 300 of the present embodiment includes:

The acquiring unit 301 is configured to acquire a video to be edited, and determine at least one target frame in the video to be edited;

An extraction unit 302, configured to extract the initial video of each target frame from the video to be edited;

The processing unit 303 is configured to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;

The generating unit 304 is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.

The video to be edited acquired by the acquiring unit 301 may be a game video, for example, a live game video is acquired as a video to be edited; wherein, the game type of the game video may be a role-playing game, a sports game, a multiplayer online competitive game, etc., in this embodiment This is not limited.

After acquiring the video to be edited, the acquiring unit 301 determines at least one target frame in the video to be edited, and the determined target frame is an image frame corresponding to a highlight moment in the video to be edited.

Specifically, when the acquisition unit 301 determines at least one target frame in the video to be edited, an optional implementation method that can be adopted is: obtain the first text information of the image frame according to the acquired image of each image frame in the video to be edited, According to the audio data of each image frame, the second text information of the image frame is obtained; according to the first text information and the second text information of each image frame, at least one target frame in the acquired video to be edited is determined.

When determining at least one target frame in the video to be edited, the acquisition unit 301 may also divide the video to be edited to obtain multiple video clips of equal length, and then determine the target frame in each video clip respectively.

That is to say, the acquisition unit 301 can determine the target frame in the video to be edited according to the two parts of text information obtained from the image frame and its audio data, thereby improving the accuracy of the determined target frame.

When the acquisition unit 301 determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, an optional implementation method that can be adopted is: the first text information of each image frame Inputting the pre-trained first classification model with the second text information, and obtaining the classification result output by the first classification model for each image frame; taking the image frame whose classification result meets the preset requirements as the target frame.

Among them, the first classification model used by the acquisition unit 301 can output the classification result of whether the image frame is the target frame according to the input text information, the classification result being 1 indicates that the image frame belongs to the target frame, and the classification result being 0 indicates that The image frame does not belong to the target frame.

When the acquisition unit 301 determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, after splicing the first text information and the second text information of each image frame, Calculate the similarity between the splicing result and the preset information, and then use the image frame whose similarity calculation result exceeds the preset similarity threshold as the target frame.

In this embodiment, after the acquisition unit 301 determines at least one target frame in the video to be edited, the extraction unit 302 extracts the initial video of each target frame from the video to be edited.

Specifically, when the extracting unit 302 extracts the initial video of each target frame from the video to be edited, an optional implementation manner that can be adopted is: for each target frame, the extracted video containing the target frame, And the video whose duration is the preset duration is used as the initial video of the target frame.

The initial video extracted by the extraction unit 302 from the video to be edited is specifically a video segment containing the target frame and the video duration is a preset duration. In this embodiment, the number of image frames located before and/or after the target frame in the initial video Not limited.

In this embodiment, after the initial video of each target frame is extracted from the video to be edited by the extraction unit 302, the processing unit 303 determines the target frame in each initial video according to the image and audio data of the image frame contained in each initial video. The start frame and end frame corresponding to the frame.

That is to say, the processing unit 303 screens the image frames included in the initial video, and then determines the exact start frame and the exact end frame corresponding to the target frame in each initial video, so as to update the time range of the video and further improve the The accuracy of the resulting clipped video.

Specifically, when the processing unit 303 determines the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video, an optional implementation method that can be adopted is: for For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained; the multimodal features of each image frame in the initial video are spliced, Input the splicing result into the pre-trained second classification model; determine the start frame and end frame in the initial video according to the output result of the second classification model.

Among the multi-modal features obtained by the processing unit 303, the image feature is specifically the attribute feature of the game character extracted from the image of the image frame, which is used to indicate whether the activity of the game character in the image frame is intense, and may include the life value feature of the game character ( The position information, quantity information and change information of the life value indicator template), magic value features (the position information, quantity information and change information of the magic value indicator template, etc.), action special effect features (whether there are action special effects) and other features; audio features is the game audio feature extracted from the audio data of the image frame, and is used to indicate whether there is a competitive voice activity in the image frame.

That is to say, the processing unit 303 can combine the image features and audio features of the image frame to determine the accurate start frame and end frame in the initial video, thereby correspondingly obtaining the start frame and end frame corresponding to each target frame in the video to be edited , to further improve the accuracy of the generated clipped video.

Wherein, the second classification model used by the processing unit 303 includes a convolutional layer and a fully connected layer. After inputting the splicing results of the multimodal features of each image frame into the second classification model, the convolutional layer first performs the splicing results. Perform convolution processing, and then input the convolution processing results into the fully connected layer for classification, so as to obtain the start frame and end frame output by the fully connected layer.

In this embodiment, after the processing unit 303 determines the start frame and end frame corresponding to each target frame in the video to be edited, the generation unit 304 generates each target frame according to each target frame and its corresponding start frame and end frame video clips.

Specifically, when the generation unit 304 generates the clipped video of each target frame according to each target frame and its corresponding start frame and end frame, an optional implementation method that can be adopted is: for each target frame, from the target frame to be clipped Extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame in the video; generate a stop-motion video of the target frame; sequentially convert the first video, the target frame The stop-motion video and the second video are spliced to generate an edited video of the target frame.

That is to say, the generation unit 304 obtains the stop-motion video of the target frame by extending the target frame for several seconds, and then uses the stop-motion video of the target frame to generate the clipped video, which can further highlight the target frame in the clipped video, thereby improving the clipping quality. The display effect of the video.

When the generating unit 304 generates the clipped video of the target frame, preset music can be added to the generated clipped video, and special effects such as leading titles and endings can also be added to the generated clipped video, thereby further improving the generated clipped video. the quality of.

The video clipping device 300 of this embodiment may also include a merging unit 305, which is used to perform the following content after the generating unit 304 generates the clipped video of each target frame: according to the image frames contained in the clipped video of each target frame, It is determined that there are multiple video clips with overlapped image frames; the multiple video clips determined are merged, and the stop-motion video of the last target frame is retained to generate a merged video clip.

That is to say, the merging unit 305 can merge multiple video clips with overlapping image frames, and retain the stop-motion video of the last target frame, thereby generating a combined video clip, ensuring that the highlight event in the generated combined video clip is It has continuity, which further improves the accuracy of video editing.

When the merging unit 305 merges the determined plurality of video clips, since each video clip corresponds to a different target frame, this embodiment only reserves the stop-motion video of the last target frame in the merged video clips, and other The stop motion video of the target frame reverts to the target frame itself.

When the merging unit 305 generates a merged clip video, it can add preset music in the generated merged clip video, and can also add special effects such as drainage titles and credits in the generated merged clip video, thereby further improving the generated clip video. the quality of

In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

As shown in FIG. 4 , it is a block diagram of an electronic device according to the video clipping method of the embodiment of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 4, the device 400 includes a computing unit 401 that can execute according to a computer program stored in a read-only memory (ROM) 402 or loaded from a storage unit 408 into a random access memory (RAM) 403. Various appropriate actions and treatments. In the RAM 403, various programs and data necessary for the operation of the device 400 can also be stored. The computing unit 401 , ROM 402 and RAM 403 are connected to each other through a bus 404 . An input/output (I/O) interface 405 is also connected to bus 404 .

Multiple components in the device 400 are connected to the I/O interface 405, including: an input unit 406, such as a keyboard, a mouse, etc.; an output unit 407, such as various types of displays, speakers, etc.; a storage unit 408, such as a magnetic disk, an optical disk, etc. ; and a communication unit 409, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 409 allows the device 400 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

The computing unit 401 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 401 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 executes various methods and processes described above, such as a video editing method. For example, in some embodiments, the video clipping method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 408 .

In some embodiments, part or all of the computer program may be loaded and/or installed on the device 400 via the ROM 402 and/or the communication unit 409 . When the computer program is loaded into the RAM 403 and executed by the computing unit 401, one or more steps of the video editing method described above can be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to execute the video clipping method in any other suitable manner (for example, by means of firmware).

Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, systems integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A video editing method, comprising:

Acquire a video to be edited, and determine at least one target frame in the video to be edited;

Extract the initial video of each target frame from the video to be edited;

Determine the start frame and end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;

According to each target frame and its corresponding start frame and end frame, a video clip of each target frame is generated.
The method according to claim 1, wherein said determining at least one target frame in the video to be edited comprises:

Obtain the first text information of the image frame according to the image of each image frame in the video to be edited, and obtain the second text information of the image frame according to the audio data of each image frame;

At least one target frame in the video to be edited is determined according to the first text information and the second text information of each image frame.
The method according to claim 2, wherein said determining at least one target frame in the video to be edited according to the first text information and the second text information of each image frame comprises:

Inputting the first text information and the second text information of each image frame into the first classification model obtained in advance training, and obtaining the classification result output by the first classification model for each image frame;

The image frame whose classification result meets the preset requirements is taken as the target frame.
The method according to claim 1, wherein said extracting the initial video of each target frame from the video to be edited comprises:

For each target frame, the video containing the target frame and having a preset duration extracted from the video to be edited is used as the initial video of the target frame.
The method according to claim 1, wherein, according to the image and audio data of the image frame included in each initial video, determining the start frame and the end frame corresponding to the target frame in each initial video comprises:

For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained;

The multimodal features of each image frame are spliced, and the splicing result is input into the second classification model that is trained in advance;

According to the output result of the second classification model, the start frame and the end frame corresponding to the target frame in the initial video are determined.
The method according to claim 1, wherein said generating a video clip of each target frame according to each target frame and its corresponding start frame and end frame comprises:

For each target frame, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame from the video to be edited;

Generate a stop-motion video of the target frame;

Splicing the first video, the stop-motion video of the target frame, and the second video in sequence to generate an edited video of the target frame.
The method of claim 1, further comprising,

After generating the clip video of each target frame, according to the image frames contained in the clip video of each target frame, it is determined that there are multiple clip videos overlapping the image frames;

Merge the determined plurality of video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.
A video editing device, comprising:

An acquisition unit, configured to acquire a video to be edited, and determine at least one target frame in the video to be edited;

An extraction unit, configured to extract the initial video of each target frame from the video to be edited;

A processing unit, configured to determine the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frame included in each initial video;

The generating unit is configured to generate a video clip of each target frame according to each target frame and its corresponding start frame and end frame.
The device according to claim 8, wherein, when the acquiring unit determines at least one target frame in the video to be edited, specifically execute:

Obtain the first text information of the image frame according to the image of each image frame in the video to be edited, and obtain the second text information of the image frame according to the audio data of each image frame;

At least one target frame in the video to be edited is determined according to the first text information and the second text information of each image frame.
The device according to claim 9, wherein, when the acquiring unit determines at least one target frame in the video to be edited according to the first text information and the second text information of each image frame, specifically execute:

Inputting the first text information and the second text information of each image frame into the first classification model obtained in advance training, and obtaining the classification result output by the first classification model for each image frame;

The image frame whose classification result meets the preset requirements is taken as the target frame.
The device according to claim 8, wherein, when the extracting unit extracts the initial video of each target frame from the video to be edited, specifically perform:

For each target frame, the video containing the target frame and having a preset duration extracted from the video to be edited is used as the initial video of the target frame.
The device according to claim 8, wherein, when the processing unit determines the start frame and the end frame corresponding to the target frame in each initial video according to the image and audio data of the image frames included in each initial video, specifically implement:

For each initial video, according to the image and audio data of the image frames contained in the initial video, the multimodal features of each image frame in the initial video are obtained;

splicing the multimodal features of each image frame, and inputting the splicing result into the second classification model obtained in advance training;

According to the output result of the second classification model, the start frame and the end frame corresponding to the target frame in the initial video are determined.
The device according to claim 8, wherein, when the generating unit generates the video clip of each target frame according to each target frame and its corresponding start frame and end frame, it specifically performs:

For each target frame, extract the first video between the start frame and the previous frame of the target frame and the second video between the next frame of the target frame and the end frame from the video to be edited;

Generate a stop-motion video of the target frame;

Splicing the first video, the stop-motion video of the target frame, and the second video in sequence to generate an edited video of the target frame.
The device according to claim 8, further comprising a merging unit, specifically performing:

After the generating unit generates the video clips of each target frame, according to the image frames contained in the video clips of each target frame, it is determined that there are multiple video clips in which the image frames overlap;

Merge the determined plurality of video clips, retain the stop-motion video of the last target frame, and generate a merged video clip.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-7. Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-7.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.