WO2021120685A1

WO2021120685A1 - Video generation method and apparatus, and computer system

Info

Publication number: WO2021120685A1
Application number: PCT/CN2020/111945
Authority: WO
Inventors: 殷俊; 赵筠; 李勇; 任宇; 于思远
Original assignee: 苏宁云计算有限公司
Priority date: 2019-12-20
Filing date: 2020-08-28
Publication date: 2021-06-24
Also published as: CN111161392B; CN111161392A; CA3164771A1

Abstract

A video generation method and apparatus, and a computer system (1500). The method comprises: receiving an initial video and a target video classification (410); segmenting the initial video into video segments according to a preset video segmentation method (420); inputting the video segments into a preset model to determine the confidence of each of the video segments corresponding to all preset video classifications (430); determining the video segments corresponding to the target video classification according to the target video classification and the confidence of each of the video segments corresponding to all preset video classifications (440); and stitching the video segments corresponding to the target video classification according to preset stitching parameters to obtain a target video (450). A target video that meet the requirements is automatically generated according to the initial video, thus ensuring the timeliness and accuracy of video generation.

Description

Method, device and computer system for generating video

Technical field

The present invention relates to the field of computer vision technology, and in particular to a method, device and computer system for generating video.

Background technique

With the accelerating pace of life, consumers want to be able to obtain product-related information more intuitively. The traditional method of relying on a certain number of product images to display products can no longer satisfy e-commerce platforms for displaying product characteristics and helping consumers to identify products. The demand for decision-making, short product display videos used to display product functions or actual use effects have become the mainstream of product promotion by major e-commerce companies. However, a large number of commercial videos uploaded by merchants and other users have uneven quality levels and varying lengths, which cannot meet the requirements of platform delivery.

In the prior art, commercial video generation methods are divided into two categories: traditional manual methods and image-text video conversion generation. The traditional manual method is to manually segment the uploaded original video according to the scene content, target material, etc., and then manually screen and splice each video segment that meets the delivery criteria to obtain creative product launches that meet the needs of users. Video has high technical requirements for the operator, and the manual operation has low timeliness and subjectivity, which cannot guarantee to meet the demand for video delivery.

The method of image-text video conversion is to cut out the product display map provided by the merchant, and then lay it out into the preset image background to form the product image, and obtain the video template, background music, etc. from the existing video material library in the platform Template files, based on these template files to generate product videos in batches. Although it is possible to generate a large number of commodity videos, the style and format of the commodity videos are completely dependent on the pre-configured template files in the material library, resulting in the generated videos with close styles and few layouts, and the actual status of the commodities cannot be visually presented to Consumers have limited expressive power.

Summary of the invention

In order to solve the shortcomings of the prior art, the main purpose of the present invention is to provide a method for generating a video, so as to automatically generate a target video based on the initial video.

In order to achieve the above objective, in the first aspect, the present invention provides a video generation method, the method includes:

Receive initial video and target video classification;

Segment the initial video into video segments according to a preset video segmentation method;

Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;

Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;

According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.

In some embodiments, the segmenting the initial video into video segments according to a preset video segmentation method includes:

Using a preset shot boundary detection method to determine the shot boundary included in the initial video;

According to the determined shot boundary, the initial video is divided into video segments.

In some embodiments, the shot boundary includes a sudden change shot and a gradual shot of the initial video, and dividing the initial video into video segments according to the determined shot boundary includes:

The mutation shot and the gradual change shot are removed from the initial video to obtain a set of video clips, and the set of video clips is composed of the video clips remaining after the removal.

In some embodiments, the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:

Calculating the degree of difference between all the frames and the adjacent frames of the frame;

When the degree of difference exceeds a first preset threshold, determining that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames;

When the degree of difference is between a first preset threshold and a second preset threshold, determining that the frame is a potential gradual change frame;

When the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.

In some embodiments, the inputting the video clips into a preset model, and determining the confidence of each of the video clips corresponding to all preset video classifications includes:

Sampling the video segment according to a preset sampling method to obtain at least two sampling frames corresponding to the video segment;

Preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.

In some embodiments, the inputting the preprocessed sampling frame into the preset model includes:

The spatio-temporal features included in the sample frame after preprocessing are extracted, and the spatio-temporal features are input into the preset model.

In some embodiments, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.

In some embodiments, the method further includes receiving a target duration, and determining the target video category corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories. Video clips include:

The video segment corresponding to the target video category is determined according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.

In a second aspect, a video generation device, the device includes:

Receiving module, used to receive initial video and target video classification;

A segmentation module, configured to segment the initial video into video segments according to a preset video segmentation method;

A processing module, configured to input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;

A matching module, configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;

The splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.

In the third aspect, this application provides a computer system, which includes:

One or more processors;

And a memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

Receive initial video and target video classification;

The beneficial effects achieved by the present invention are:

The present invention discloses a video generation method. By receiving initial video and target video classification, according to a preset video segmentation method, the initial video is divided into video segments, and the video segments are input into a preset model, Obtain the confidence of each of the video segments corresponding to all preset video categories According to the target video category and the confidence of each of the video segments corresponding to all the preset video categories, determine the video corresponding to the target video category Fragments; according to preset splicing parameters, the video fragments corresponding to the target video classification are spliced to obtain the target video, which realizes the generation of the target video that meets the requirements according to the initial video, and ensures the timeliness and accuracy of video generation ；

The present invention also proposes using a preset shot boundary detection method to determine the shot boundary included in the initial video; according to the determined shot boundary, the initial video is divided into video segments, and further proposes the The shot boundary includes a sudden change shot and a gradual change shot of the initial video. The segmenting the initial video into video segments according to the determined shot boundary includes: dividing the sudden change shot and the gradual change shot from the initial The video is eliminated to obtain a set of video segments, and the set of video segments is composed of the remaining video segments after the elimination. Ensure the accuracy of video segmentation;

This application discloses sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; preprocessing the sampling frame, and inputting the preprocessed sampling frame The preset model obtains the confidence levels of all preset video categories corresponding to the video clip; determines that the preset video category corresponding to the confidence level with the largest value is the preset video category corresponding to the video clip Set the video classification, the confidence with the largest value is the confidence of the video segment; according to the preset video classification and the confidence corresponding to all the video fragments, determine the corresponding to the target video classification The confidence of the video segment and the corresponding video segment ensures the accuracy of the calculation of the confidence.

All products of the present invention do not need to have all the above-mentioned effects.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is a schematic diagram of a model network structure provided by an embodiment of the present application;

FIG. 2 is a flowchart of lens segmentation provided by an embodiment of the present application;

Fig. 3 is a flow chart of model training provided by an embodiment of the present application;

Figure 4 is a flowchart of a method provided by an embodiment of the present application;

FIG. 5 is a structural diagram of an apparatus provided by an embodiment of the present application;

Fig. 6 is a structural diagram of a computer system provided by an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present invention clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

As described in the background art, the two commonly used methods for generating commercial videos in the prior art each have certain limitations. The manual editing method requires high labor costs and low efficiency, and cannot meet the actual needs of generating large-scale commodity videos; the video generation method based on image-text conversion is more efficient, but there are fewer video formats and video styles available. Fixed, limited expression ability.

In order to solve the above technical problems, this application proposes to segment the videos uploaded by users using a preset segmentation method to obtain video segments, use a preset classification model to classify each video segment, and obtain each video The confidence level corresponding to the segment; according to the target video classification selected by the user, the video segment in the classification whose confidence level meets the preset condition is spliced to obtain the target video. It is realized that the target video that meets the requirements is generated according to the video uploaded by the user, while ensuring the timeliness of the video generation.

Example one

In order to realize the classification of the video clips obtained by segmentation, the classification model needs to be trained in advance. Specifically, the MFnet three-dimensional convolutional neural network model can be used as the classification model. The MFnet three-dimensional convolutional neural network model is a lightweight deep learning model. Compared with the recent deep learning models such as I3D and SlowFastnet, its model is more streamlined, the amount of floating-point operations is less, and it is on the test data set. The test effect is better.

The training process includes:

110. Import the training data set;

The training data set can be generated by the following methods:

111. Obtain a preset number of commodity videos, and create a corresponding video folder for each video;

112. Divide the clips contained in each video into different categories according to the different content presented, the categories include but are not limited to the appearance of the main body of the product, the use scene of the product, and the introduction of the product content, and manually edit according to the divided categories.

113. Create a main folder corresponding to each category in the folder corresponding to each video, the main folder marks the corresponding category, and each main folder contains one or the corresponding category of the video. Multiple sub-video clip folders, one or more image frames of the corresponding video clip are saved under the sub-video clip folder;

114. Perform intensive sampling on the folder corresponding to each video, and normalize the sampled samples to a size of N×C×H×W, where N represents the number of sampling frames for each sub-video clip folder, and C represents each sub-video clip folder. For the RGB channels of one frame, H represents the preset height of each frame, and W represents the preset width of each frame. Preferably, N is at least 8.

120. Use the training data set to train the MFnet three-dimensional convolutional neural network model to obtain a preset model;

Figure 1 shows a schematic diagram of the network structure of the model, including 3DCNN, which is used to extract the three-dimensional convolutional features contained in each sample. The three-dimensional convolutional features include temporal and spatial features, including the movement trend of commodities, changes in background, and other video streams. Movement information of the inner object.

3Dpooling is the pooling layer of the model, used to pool the output of 3DCNN, and input the pooling result into the 3D MF-Unit layer for 1×1×1, 3×3×3, 1×3×3, etc. Different convolution operations;

Global Pool is the global pool layer, used to retain the main characteristics of the input results while reducing unnecessary parameters;

FClayer is a fully connected layer, used to output the confidence of each video segment corresponding to each category.

Using the model, 56 commercial short video test sets were tested, and the test results are shown in Table 1:

Table 1

The model can classify samples obtained by intensive sampling of a single lens. In the test results of the above-mentioned video data set with a total of 1119 test samples, the classification accuracy rate reaches 95.92%, and the single model is only 29.6MB, which is aimed at a single lens. The forward inference time of densely sampled video is 330ms, with high accuracy and fast speed.

After obtaining the preset model, the video can be generated according to the model. As shown in Figure 2, the generation process includes:

Step 1: Receive the initial video input by the user;

Step 2: Perform shot boundary detection on the initial video, segment the video according to the detection result, remove redundant segments, and obtain video segments;

As shown in Figure 3, the shot boundary detection process includes:

Firstly, each frame of the initial video is divided into a preset number of sub-blocks using the same preset method, and then the sub-histogram of each sub-block is calculated. According to the sub-histogram, the sub-blocks at the same position in adjacent frames are calculated. The histogram difference, the adjacent frames of each frame include the previous frame and the next frame of the frame. When the difference exceeds a first predetermined threshold T _H, the difference indicates that the corresponding subblock is large between adjacent frames, the frame difference when a large number of sub-blocks is higher than a second preset threshold, i.e., that This frame is a sudden change frame, and the continuous sudden change frames constitute a sudden change shot. For the value of the frame difference value between T _H T _L at the third predetermined threshold in a first predetermined threshold, i.e., is identified as a potential start frame, when the difference which sequentially subsequent frames in the same T _L and T _{When H is} between H and the duration exceeds the fourth preset threshold, these consecutive frames are considered to be gradual frames to form a gradual lens, and the lens after excluding the gradual and abrupt lens is regarded as a normal lens.

In order to ensure the effect of the generated video, it is necessary to eliminate the excessively short shots whose length is less than the fifth preset threshold among the normal shots, and finally obtain the required video clip set.

Step 3: Sample the video clips, input the sampling results into a preset model, and obtain the category and confidence level corresponding to each video clip;

First, according to the time sequence of the video, the above video clips are randomly and densely sampled.

The random dense sampling process includes:

Randomly initialize sampling points on the video segment, take the sampling point as seven points, and focus on the end of the video segment, uniformly sample N frames, and preprocess the sample frames to meet the input size requirements of the preset model.

Then, the preprocessed sample frames are input into a preset model, and the confidence levels of the video clips containing the sample frames corresponding to all categories are obtained.

Step 4: According to the target category and target duration selected by the user, stitch the video clips corresponding to the target category to generate the target video;

For example, when a user displays a video to obtain the appearance of a current commodity, the video clips are sorted according to the confidence of the corresponding appearance display category, and video clips that meet the requirements are screened.

Specific screening rules can include:

_{When the duration T i} of the video segment with the highest confidence level has met the requirement of the target duration, the video segment with the highest confidence level is directly used as the target video;

_{When the duration T i} of the video segment with the highest confidence does not meet the requirement of the target duration, the next n video segments T _j are sequentially selected according to the order of the confidence value, where j ∈ [1,n], until the following formula is satisfied :

T ₂ -T ₁ represents the target duration;

When the duration of the n+1 shots selected according to the confidence score exceeds the maximum duration T ₂ , the longest shot among them will be intercepted head and tail according to the duration of each shot until the total duration meets the requirement of the target duration.

Step 5. The video clips obtained in step 4 are sequentially spliced according to the time sequence of the initial video to obtain the target video.

For the generated target video, it can be stored in a video database and reused when needed next time, or used to continue training the model.

Based on the above-mentioned solution provided by this application, it is possible to generate a target video that meets the requirements based on the video uploaded by the user, while ensuring the timeliness of video generation.

Example two

Corresponding to the foregoing embodiment, this application provides a method for generating a video. As shown in FIG. 4, the method includes:

410. Receive initial video and target video classification;

420. According to a preset video segmentation method, segment the initial video into video segments.

Preferably, the method includes:

421. Use a preset shot boundary detection method to determine the shot boundary included in the initial video.

Preferably, the shot boundary includes a sudden change shot and a gradual shot of the initial video, and the method includes:

422. Remove the mutation shots and the gradual shots from the initial video to obtain a set of video clips, where the set of video clips is composed of the video clips remaining after the removal.

Preferably, the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:

423. Calculate the degree of difference between all the frames and adjacent frames of the frame.

430. Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications.

Preferably, the method includes:

431. Sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip.

Preferably, the obtained sampling frames are at least eight frames.

Preferably, the inputting the preprocessed sampling frame into the preset model includes:

432. Extract the spatiotemporal features included in the sample frame after preprocessing, and input the spatiotemporal features into the preset model.

Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.

440. Determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications.

Preferably, the method further includes receiving a target duration, and determining the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories includes :

441. Determine the video segment corresponding to the target video category according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.

450. According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain a target video.

Example three

Corresponding to the foregoing method embodiment, this application provides a video generation device. As shown in FIG. 5, the device includes:

The receiving module 510 is used to receive initial video and target video classification;

The segmentation module 520 is configured to segment the initial video into video segments according to a preset video segmentation method;

The processing module 530 is configured to input the video clips into a preset model, and determine the confidence of each of the video clips corresponding to all preset video classifications;

The matching module 540 is configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;

The splicing module 550 is configured to splice the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.

Preferably, the segmentation module 520 may also be used to use a preset shot boundary detection method to determine the shot boundary included in the initial video;

Preferably, the shot boundary includes a sudden change shot and a gradual change shot of the initial video, and the segmentation module 520 may also be used to remove the sudden change shot and the gradual change shot from the initial video to obtain a set of video clips , The set of video clips is composed of the video clips remaining after culling.

Preferably, the video is composed of continuous frames, and the segmentation module 520 may also be used to calculate the degree of difference between all the frames and adjacent frames of the frame; when the degree of difference exceeds a first preset Threshold, determine that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames; when the degree of difference is between a first preset threshold and a second preset threshold, determine that the frame is Potential gradient frame; when the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frame is a gradient frame, and the gradient lens is composed of the continuous gradient frames.

Preferably, the matching module 530 can also be used to sample the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; The subsequent sampling frames are input to the preset model, and the confidence levels of the video segments corresponding to all the preset video classifications are obtained.

Preferably, the matching module 530 may also be used to extract the spatiotemporal features contained in the sample frame after preprocessing, and input the spatiotemporal features into the preset model.

Preferably, the receiving module 510 can also be used to receive a target duration, and the matching module 540 can also be used to receive the target duration, the target video classification, and the confidence level of each of the video segments corresponding to all preset video classifications. , The duration of the video segment, determining the video segment corresponding to the target video category.

Example four

Corresponding to the foregoing method, device, and system, the fourth embodiment of the present application provides a computer system, including: one or more processors; and a memory associated with the one or more processors, and the memory is used to store program instructions , When the program instructions are read and executed by the one or more processors, perform the following operations: receive initial video and target video classification;

Among them, FIG. 6 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.

Among them, the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.

The memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored. The foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510. The input/output interface 1513 is used to connect input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

The bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).

In addition, the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.

It should be noted that although the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.

From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD-ROM, etc., including a number of instructions to enable a computer device (which may be a personal computer, a cloud server, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of the present application.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment. The system and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

The above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A video generation method, characterized in that the method includes:

Receive initial video and target video classification;

Segment the initial video into video segments according to a preset video segmentation method;

Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;

Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;

According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
The method according to claim 1, wherein the segmenting the initial video into video segments according to a preset video segmentation method comprises:

Using a preset shot boundary detection method to determine the shot boundary included in the initial video;

According to the determined shot boundary, the initial video is divided into video segments.
The method according to claim 2, wherein the shot boundary includes a sudden change shot and a gradual shot of the initial video, and the dividing the initial video into video segments according to the determined shot boundary comprises :

The mutation shot and the gradual change shot are removed from the initial video to obtain a set of video clips, and the set of video clips is composed of the video clips remaining after the removal.
The method according to claim 3, wherein the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:

Calculating the degree of difference between all the frames and the adjacent frames of the frame;

When the degree of difference exceeds a first preset threshold, determining that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames;

When the degree of difference is between the first preset threshold and the second preset threshold, determining that the frame is a potentially gradual frame;

When the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
The method according to any one of claims 1 to 4, wherein the inputting the video clips into a preset model and determining the confidence of each of the video clips corresponding to all preset video classifications comprises:

Sampling the video segment according to a preset sampling method to obtain at least two sampling frames corresponding to the video segment;

Preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
The method according to claim 5, wherein the inputting the preprocessed sample frame into the preset model comprises:

The spatio-temporal features included in the sample frame after preprocessing are extracted, and the spatio-temporal features are input into the preset model.
The method according to any one of claims 1 to 4, wherein the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
The method according to any one of claims 1 to 4, characterized in that the method further comprises receiving a target duration, said according to the target video classification and the confidence level of all preset video classifications corresponding to each said video segment , Determining the video segment corresponding to the target video classification includes:

The video segment corresponding to the target video category is determined according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.
A video generating device, characterized in that the device includes:

Receiving module, used to receive initial video and target video classification;

A segmentation module, configured to segment the initial video into video segments according to a preset video segmentation method;

A processing module, configured to input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;

A matching module, configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;

The splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
A computer system, characterized in that the system includes:

One or more processors;

And a memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

Receive initial video and target video classification;

Segment the initial video into video segments according to a preset video segmentation method;

Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;

Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;

According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.