WO2021120685A1 - Video generation method and apparatus, and computer system - Google Patents

Video generation method and apparatus, and computer system Download PDF

Info

Publication number
WO2021120685A1
WO2021120685A1 PCT/CN2020/111945 CN2020111945W WO2021120685A1 WO 2021120685 A1 WO2021120685 A1 WO 2021120685A1 CN 2020111945 W CN2020111945 W CN 2020111945W WO 2021120685 A1 WO2021120685 A1 WO 2021120685A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
preset
target
initial
classification
Prior art date
Application number
PCT/CN2020/111945
Other languages
French (fr)
Chinese (zh)
Inventor
殷俊
赵筠
李勇
任宇
于思远
Original Assignee
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁云计算有限公司 filed Critical 苏宁云计算有限公司
Priority to CA3164771A priority Critical patent/CA3164771A1/en
Publication of WO2021120685A1 publication Critical patent/WO2021120685A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering

Definitions

  • the present invention relates to the field of computer vision technology, and in particular to a method, device and computer system for generating video.
  • the method of image-text video conversion is to cut out the product display map provided by the merchant, and then lay it out into the preset image background to form the product image, and obtain the video template, background music, etc. from the existing video material library in the platform Template files, based on these template files to generate product videos in batches.
  • the style and format of the commodity videos are completely dependent on the pre-configured template files in the material library, resulting in the generated videos with close styles and few layouts, and the actual status of the commodities cannot be visually presented to Consumers have limited expressive power.
  • the main purpose of the present invention is to provide a method for generating a video, so as to automatically generate a target video based on the initial video.
  • the present invention provides a video generation method, the method includes:
  • splicing the video segments corresponding to the target video classification to obtain the target video According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
  • the segmenting the initial video into video segments according to a preset video segmentation method includes:
  • the initial video is divided into video segments.
  • the shot boundary includes a sudden change shot and a gradual shot of the initial video, and dividing the initial video into video segments according to the determined shot boundary includes:
  • the mutation shot and the gradual change shot are removed from the initial video to obtain a set of video clips, and the set of video clips is composed of the video clips remaining after the removal.
  • the video is composed of continuous frames
  • the process of determining the mutation shot and the gradual shot includes:
  • the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
  • the inputting the video clips into a preset model, and determining the confidence of each of the video clips corresponding to all preset video classifications includes:
  • Preprocessing the sampling frame inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
  • the inputting the preprocessed sampling frame into the preset model includes:
  • the spatio-temporal features included in the sample frame after preprocessing are extracted, and the spatio-temporal features are input into the preset model.
  • the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
  • the method further includes receiving a target duration, and determining the target video category corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories.
  • Video clips include:
  • the video segment corresponding to the target video category is determined according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.
  • a video generation device the device includes:
  • Receiving module used to receive initial video and target video classification
  • a segmentation module configured to segment the initial video into video segments according to a preset video segmentation method
  • a processing module configured to input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
  • a matching module configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;
  • the splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
  • this application provides a computer system, which includes:
  • One or more processors are One or more processors;
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • splicing the video segments corresponding to the target video classification to obtain the target video According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
  • the present invention discloses a video generation method.
  • the initial video is divided into video segments, and the video segments are input into a preset model, Obtain the confidence of each of the video segments corresponding to all preset video categories According to the target video category and the confidence of each of the video segments corresponding to all the preset video categories, determine the video corresponding to the target video category Fragments; according to preset splicing parameters, the video fragments corresponding to the target video classification are spliced to obtain the target video, which realizes the generation of the target video that meets the requirements according to the initial video, and ensures the timeliness and accuracy of video generation ;
  • the present invention also proposes using a preset shot boundary detection method to determine the shot boundary included in the initial video; according to the determined shot boundary, the initial video is divided into video segments, and further proposes the The shot boundary includes a sudden change shot and a gradual change shot of the initial video.
  • the segmenting the initial video into video segments according to the determined shot boundary includes: dividing the sudden change shot and the gradual change shot from the initial The video is eliminated to obtain a set of video segments, and the set of video segments is composed of the remaining video segments after the elimination. Ensure the accuracy of video segmentation;
  • This application discloses sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; preprocessing the sampling frame, and inputting the preprocessed sampling frame
  • the preset model obtains the confidence levels of all preset video categories corresponding to the video clip; determines that the preset video category corresponding to the confidence level with the largest value is the preset video category corresponding to the video clip Set the video classification, the confidence with the largest value is the confidence of the video segment; according to the preset video classification and the confidence corresponding to all the video fragments, determine the corresponding to the target video classification
  • the confidence of the video segment and the corresponding video segment ensures the accuracy of the calculation of the confidence.
  • Fig. 1 is a schematic diagram of a model network structure provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of lens segmentation provided by an embodiment of the present application.
  • Fig. 3 is a flow chart of model training provided by an embodiment of the present application.
  • Figure 4 is a flowchart of a method provided by an embodiment of the present application.
  • FIG. 5 is a structural diagram of an apparatus provided by an embodiment of the present application.
  • Fig. 6 is a structural diagram of a computer system provided by an embodiment of the present application.
  • the two commonly used methods for generating commercial videos in the prior art each have certain limitations.
  • the manual editing method requires high labor costs and low efficiency, and cannot meet the actual needs of generating large-scale commodity videos; the video generation method based on image-text conversion is more efficient, but there are fewer video formats and video styles available. Fixed, limited expression ability.
  • this application proposes to segment the videos uploaded by users using a preset segmentation method to obtain video segments, use a preset classification model to classify each video segment, and obtain each video The confidence level corresponding to the segment; according to the target video classification selected by the user, the video segment in the classification whose confidence level meets the preset condition is spliced to obtain the target video. It is realized that the target video that meets the requirements is generated according to the video uploaded by the user, while ensuring the timeliness of the video generation.
  • the classification model In order to realize the classification of the video clips obtained by segmentation, the classification model needs to be trained in advance.
  • the MFnet three-dimensional convolutional neural network model can be used as the classification model.
  • the MFnet three-dimensional convolutional neural network model is a lightweight deep learning model. Compared with the recent deep learning models such as I3D and SlowFastnet, its model is more streamlined, the amount of floating-point operations is less, and it is on the test data set. The test effect is better.
  • the training process includes:
  • the training data set can be generated by the following methods:
  • the categories include but are not limited to the appearance of the main body of the product, the use scene of the product, and the introduction of the product content, and manually edit according to the divided categories.
  • N represents the number of sampling frames for each sub-video clip folder
  • C represents each sub-video clip folder.
  • H represents the preset height of each frame
  • W represents the preset width of each frame.
  • N is at least 8.
  • Figure 1 shows a schematic diagram of the network structure of the model, including 3DCNN, which is used to extract the three-dimensional convolutional features contained in each sample.
  • the three-dimensional convolutional features include temporal and spatial features, including the movement trend of commodities, changes in background, and other video streams. Movement information of the inner object.
  • 3Dpooling is the pooling layer of the model, used to pool the output of 3DCNN, and input the pooling result into the 3D MF-Unit layer for 1 ⁇ 1 ⁇ 1, 3 ⁇ 3 ⁇ 3, 1 ⁇ 3 ⁇ 3, etc.
  • Different convolution operations
  • Global Pool is the global pool layer, used to retain the main characteristics of the input results while reducing unnecessary parameters
  • FClayer is a fully connected layer, used to output the confidence of each video segment corresponding to each category.
  • the model can classify samples obtained by intensive sampling of a single lens.
  • the classification accuracy rate reaches 95.92%
  • the single model is only 29.6MB, which is aimed at a single lens.
  • the forward inference time of densely sampled video is 330ms, with high accuracy and fast speed.
  • the video can be generated according to the model. As shown in Figure 2, the generation process includes:
  • Step 1 Receive the initial video input by the user
  • Step 2 Perform shot boundary detection on the initial video, segment the video according to the detection result, remove redundant segments, and obtain video segments;
  • the shot boundary detection process includes:
  • each frame of the initial video is divided into a preset number of sub-blocks using the same preset method, and then the sub-histogram of each sub-block is calculated.
  • the sub-histogram the sub-blocks at the same position in adjacent frames are calculated.
  • the histogram difference the adjacent frames of each frame include the previous frame and the next frame of the frame.
  • T H a first predetermined threshold
  • the difference indicates that the corresponding subblock is large between adjacent frames
  • the frame difference when a large number of sub-blocks is higher than a second preset threshold, i.e., that This frame is a sudden change frame, and the continuous sudden change frames constitute a sudden change shot.
  • Step 3 Sample the video clips, input the sampling results into a preset model, and obtain the category and confidence level corresponding to each video clip;
  • the above video clips are randomly and densely sampled.
  • the random dense sampling process includes:
  • Randomly initialize sampling points on the video segment take the sampling point as seven points, and focus on the end of the video segment, uniformly sample N frames, and preprocess the sample frames to meet the input size requirements of the preset model.
  • the preprocessed sample frames are input into a preset model, and the confidence levels of the video clips containing the sample frames corresponding to all categories are obtained.
  • Step 4 According to the target category and target duration selected by the user, stitch the video clips corresponding to the target category to generate the target video;
  • the video clips are sorted according to the confidence of the corresponding appearance display category, and video clips that meet the requirements are screened.
  • Specific screening rules can include:
  • the video segment with the highest confidence level is directly used as the target video;
  • the next n video segments T j are sequentially selected according to the order of the confidence value, where j ⁇ [1,n], until the following formula is satisfied :
  • T 2 -T 1 represents the target duration
  • the duration of the n+1 shots selected according to the confidence score exceeds the maximum duration T 2 , the longest shot among them will be intercepted head and tail according to the duration of each shot until the total duration meets the requirement of the target duration.
  • Step 5 The video clips obtained in step 4 are sequentially spliced according to the time sequence of the initial video to obtain the target video.
  • the generated target video it can be stored in a video database and reused when needed next time, or used to continue training the model.
  • this application provides a method for generating a video. As shown in FIG. 4, the method includes:
  • a preset video segmentation method segment the initial video into video segments.
  • the method includes:
  • the shot boundary includes a sudden change shot and a gradual shot of the initial video
  • the method includes:
  • the video is composed of continuous frames
  • the process of determining the mutation shot and the gradual shot includes:
  • the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
  • the method includes:
  • Preprocessing the sampling frame inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
  • the obtained sampling frames are at least eight frames.
  • the inputting the preprocessed sampling frame into the preset model includes:
  • the method further includes receiving a target duration, and determining the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories includes :
  • splicing the video segments corresponding to the target video classification to obtain a target video According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain a target video.
  • this application provides a video generation device. As shown in FIG. 5, the device includes:
  • the receiving module 510 is used to receive initial video and target video classification
  • the segmentation module 520 is configured to segment the initial video into video segments according to a preset video segmentation method
  • the processing module 530 is configured to input the video clips into a preset model, and determine the confidence of each of the video clips corresponding to all preset video classifications;
  • the matching module 540 is configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;
  • the splicing module 550 is configured to splice the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
  • the segmentation module 520 may also be used to use a preset shot boundary detection method to determine the shot boundary included in the initial video;
  • the initial video is divided into video segments.
  • the shot boundary includes a sudden change shot and a gradual change shot of the initial video
  • the segmentation module 520 may also be used to remove the sudden change shot and the gradual change shot from the initial video to obtain a set of video clips
  • the set of video clips is composed of the video clips remaining after culling.
  • the video is composed of continuous frames
  • the segmentation module 520 may also be used to calculate the degree of difference between all the frames and adjacent frames of the frame; when the degree of difference exceeds a first preset Threshold, determine that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames; when the degree of difference is between a first preset threshold and a second preset threshold, determine that the frame is Potential gradient frame; when the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frame is a gradient frame, and the gradient lens is composed of the continuous gradient frames.
  • the matching module 530 can also be used to sample the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; The subsequent sampling frames are input to the preset model, and the confidence levels of the video segments corresponding to all the preset video classifications are obtained.
  • the matching module 530 may also be used to extract the spatiotemporal features contained in the sample frame after preprocessing, and input the spatiotemporal features into the preset model.
  • the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
  • the receiving module 510 can also be used to receive a target duration
  • the matching module 540 can also be used to receive the target duration, the target video classification, and the confidence level of each of the video segments corresponding to all preset video classifications. , The duration of the video segment, determining the video segment corresponding to the target video category.
  • the fourth embodiment of the present application provides a computer system, including: one or more processors; and a memory associated with the one or more processors, and the memory is used to store program instructions , When the program instructions are read and executed by the one or more processors, perform the following operations: receive initial video and target video classification;
  • splicing the video segments corresponding to the target video classification to obtain the target video According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
  • FIG. 6 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520.
  • the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
  • the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.
  • a general CPU Central Processing Unit, central processing unit
  • a microprocessor e.g., a central processing unit
  • ASIC Application Specific Integrated Circuit
  • the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.
  • ASIC Application Specific Integrated Circuit
  • the memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc.
  • the memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500.
  • BIOS basic input output system
  • a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored.
  • the foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application.
  • the related program code is stored in the memory 1520 and is called and executed by the processor 1510.
  • the input/output interface 1513 is used to connect input/output modules to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
  • various components of the device for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
  • the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
  • the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation.
  • the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.

Abstract

A video generation method and apparatus, and a computer system (1500). The method comprises: receiving an initial video and a target video classification (410); segmenting the initial video into video segments according to a preset video segmentation method (420); inputting the video segments into a preset model to determine the confidence of each of the video segments corresponding to all preset video classifications (430); determining the video segments corresponding to the target video classification according to the target video classification and the confidence of each of the video segments corresponding to all preset video classifications (440); and stitching the video segments corresponding to the target video classification according to preset stitching parameters to obtain a target video (450). A target video that meet the requirements is automatically generated according to the initial video, thus ensuring the timeliness and accuracy of video generation.

Description

一种视频的生成方法、装置及计算机系统Method, device and computer system for generating video 技术领域Technical field
本发明涉及计算机视觉技术领域,尤其涉及一种视频的生成方法、装置及计算机系统。The present invention relates to the field of computer vision technology, and in particular to a method, device and computer system for generating video.
背景技术Background technique
随着生活节奏的加快,消费者希望能够更直观地获取商品的相关信息,传统的依靠一定数量的商品图来展示商品的方法已经不能满足电商平台对展示商品特性、帮助消费者进行商品甄别决策的需求,用于展示商品功能或是实际使用效果的商品展示短视频已成为各大电商进行商品宣传的主流。然而,海量的由商家等用户上传的商品视频的质量层次不齐、长度也不固定,无法满足平台投放的要求。With the accelerating pace of life, consumers want to be able to obtain product-related information more intuitively. The traditional method of relying on a certain number of product images to display products can no longer satisfy e-commerce platforms for displaying product characteristics and helping consumers to identify products. The demand for decision-making, short product display videos used to display product functions or actual use effects have become the mainstream of product promotion by major e-commerce companies. However, a large number of commercial videos uploaded by merchants and other users have uneven quality levels and varying lengths, which cannot meet the requirements of platform delivery.
现有技术中,商品视频的生成方法分为传统手工方法和图文视频转换生成两大类。传统人工方法是通过对上传的原始视频根据场景内容、目标素材等进行人工镜头切分,然后对满足投放标准的各个视频片段进行人工的筛选、拼接,得到满足用户需求的具有创意的商品投放短视频,对操作者的技术要求高,同时人工操作时效性低、主观性大,不能保证满足对视频的投放需求。In the prior art, commercial video generation methods are divided into two categories: traditional manual methods and image-text video conversion generation. The traditional manual method is to manually segment the uploaded original video according to the scene content, target material, etc., and then manually screen and splice each video segment that meets the delivery criteria to obtain creative product launches that meet the needs of users. Video has high technical requirements for the operator, and the manual operation has low timeliness and subjectivity, which cannot guarantee to meet the demand for video delivery.
图文视频转换的方法是需要对商家提供的商品展示图进行抠图,然后布局到预设的图像背景中形成商品成图,从平台中已有的视频素材库中获得视频模板、背景音乐等模板文件,根据这些模板文件批量生成商品视频。虽然能实现大批量的商品视频的生成,但商品视频的风格、版式完全依赖于素材库中预先配置的模板文件,导致生成的视频风格近、版式少,不能将商品的实际状态直观地呈现给消费者,表达能力有限。The method of image-text video conversion is to cut out the product display map provided by the merchant, and then lay it out into the preset image background to form the product image, and obtain the video template, background music, etc. from the existing video material library in the platform Template files, based on these template files to generate product videos in batches. Although it is possible to generate a large number of commodity videos, the style and format of the commodity videos are completely dependent on the pre-configured template files in the material library, resulting in the generated videos with close styles and few layouts, and the actual status of the commodities cannot be visually presented to Consumers have limited expressive power.
发明内容Summary of the invention
为了解决现有技术的不足,本发明的主要目的在于提供一种视频的生成方 法,以实现根据初始视频自动生成目标视频。In order to solve the shortcomings of the prior art, the main purpose of the present invention is to provide a method for generating a video, so as to automatically generate a target video based on the initial video.
为了达到上述目的,第一方面本发明提供了一种视频的生成方法,所述方法包括:In order to achieve the above objective, in the first aspect, the present invention provides a video generation method, the method includes:
接收初始视频及目标视频分类;Receive initial video and target video classification;
按照预设的视频切分方法,将所述初始视频切分为视频片段;Segment the initial video into video segments according to a preset video segmentation method;
将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;
根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
在一些实施例中,所述按照预设的视频切分方法,将所述初始视频切分为视频片段包括:In some embodiments, the segmenting the initial video into video segments according to a preset video segmentation method includes:
使用预设的镜头边界检测方法,确定所述初始视频包含的镜头边界;Using a preset shot boundary detection method to determine the shot boundary included in the initial video;
按照确定的所述镜头边界,将所述初始视频切分为视频片段。According to the determined shot boundary, the initial video is divided into video segments.
在一些实施例中,所述镜头边界包含所述初始视频的突变镜头及渐变镜头,所述按照确定的所述镜头边界,将所述初始视频切分为视频片段包括:In some embodiments, the shot boundary includes a sudden change shot and a gradual shot of the initial video, and dividing the initial video into video segments according to the determined shot boundary includes:
将所述突变镜头及所述渐变镜头从所述初始视频中剔除,获得视频片段集合,所述视频片段集合由剔除后剩余的所述视频片段组成。The mutation shot and the gradual change shot are removed from the initial video to obtain a set of video clips, and the set of video clips is composed of the video clips remaining after the removal.
在一些实施例中,所述视频由连续的帧组成,所述突变镜头及所述渐变镜头的确定过程包括:In some embodiments, the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:
计算所有所述帧与所述帧的相邻帧之间的差异程度;Calculating the degree of difference between all the frames and the adjacent frames of the frame;
当所述差异程度超过第一预设阈值时,判断所述帧为突变帧,所述突变镜头由连续的所述突变帧组成;When the degree of difference exceeds a first preset threshold, determining that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames;
当所述差异程度在第一预设阈值及第二预设阈值之间时,判断所述帧为潜在渐变帧;When the degree of difference is between a first preset threshold and a second preset threshold, determining that the frame is a potential gradual change frame;
当连续的所述潜在渐变帧的数量超过第三预设阈值时,判断所述潜在渐变帧为渐变帧,所述渐变镜头由连续的所述渐变帧组成。When the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
在一些实施例中,所述将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度包括:In some embodiments, the inputting the video clips into a preset model, and determining the confidence of each of the video clips corresponding to all preset video classifications includes:
按照预设的采样方法,对所述视频片段采样,获得所述视频片段对应的至少两帧采样帧;Sampling the video segment according to a preset sampling method to obtain at least two sampling frames corresponding to the video segment;
对所述采样帧进行预处理,将预处理后的所述采样帧输入所述预设模型,获得所述视频片段对应所有所述预设视频分类的所述置信度。Preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
在一些实施例中,所述将预处理后的所述采样帧输入所述预设模型包括:In some embodiments, the inputting the preprocessed sampling frame into the preset model includes:
提取预处理后的所述采样帧包含的时空特征,将所述时空特征输入所述预设模型。The spatio-temporal features included in the sample frame after preprocessing are extracted, and the spatio-temporal features are input into the preset model.
在一些实施例中,所述预设模型为预先训练的MFnet三维卷积神经网络模型。In some embodiments, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
在一些实施例中,所述方法还包括接收目标时长,所述根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段包括:In some embodiments, the method further includes receiving a target duration, and determining the target video category corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories. Video clips include:
根据所述目标时长、所述目标视频分类、每一所述视频片段对应所有预设视频分类的置信度、所述视频片段的时长,确定所述目标视频分类对应的所述视频片段。The video segment corresponding to the target video category is determined according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.
第二方面,一种视频的生成装置,所述装置包括:In a second aspect, a video generation device, the device includes:
接收模块,用于接收初始视频及目标视频分类;Receiving module, used to receive initial video and target video classification;
切分模块,用于按照预设的视频切分方法,将所述初始视频切分为视频片段;A segmentation module, configured to segment the initial video into video segments according to a preset video segmentation method;
处理模块,用于将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;A processing module, configured to input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
匹配模块,用于根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;A matching module, configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;
拼接模块,用于根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。The splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
第三方面,本申请提供了一种计算机系统,所述系统包括:In the third aspect, this application provides a computer system, which includes:
一个或多个处理器;One or more processors;
以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:And a memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
接收初始视频及目标视频分类;Receive initial video and target video classification;
按照预设的视频切分方法,将所述初始视频切分为视频片段;Segment the initial video into video segments according to a preset video segmentation method;
将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;
根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
本发明实现的有益效果为:The beneficial effects achieved by the present invention are:
本发明公开了一种视频的生成方法,通过接收初始视频及目标视频分类,按照预设的视频切分方法,将所述初始视频切分为视频片段,将所述视频片段输入预设模型,获取每一所述视频片段对应所有预设视频分类的置信度根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频,实现了根据初始视频生成符合需求的目标视频,保证了视频生成的时效性和准确性;The present invention discloses a video generation method. By receiving initial video and target video classification, according to a preset video segmentation method, the initial video is divided into video segments, and the video segments are input into a preset model, Obtain the confidence of each of the video segments corresponding to all preset video categories According to the target video category and the confidence of each of the video segments corresponding to all the preset video categories, determine the video corresponding to the target video category Fragments; according to preset splicing parameters, the video fragments corresponding to the target video classification are spliced to obtain the target video, which realizes the generation of the target video that meets the requirements according to the initial video, and ensures the timeliness and accuracy of video generation ;
本发明还提出了使用预设的镜头边界检测方法,确定所述初始视频包含的镜头边界;按照确定的所述镜头边界,将所述初始视频切分为视频片段,并进一步地提出了所述镜头边界包含所述初始视频的突变镜头及渐变镜头,所述按照确定的所述镜头边界,将所述初始视频切分为视频片段包括:将所述突变镜头及 所述渐变镜头从所述初始视频中剔除,获得视频片段集合,所述视频片段集合由剔除后剩余的所述视频片段组成。保证了视频片段切分的准确性;The present invention also proposes using a preset shot boundary detection method to determine the shot boundary included in the initial video; according to the determined shot boundary, the initial video is divided into video segments, and further proposes the The shot boundary includes a sudden change shot and a gradual change shot of the initial video. The segmenting the initial video into video segments according to the determined shot boundary includes: dividing the sudden change shot and the gradual change shot from the initial The video is eliminated to obtain a set of video segments, and the set of video segments is composed of the remaining video segments after the elimination. Ensure the accuracy of video segmentation;
本申请公开了按照预设的采样方法,对所述视频片段采样,获得所述视频片段对应的至少两帧采样帧;对所述采样帧进行预处理,将预处理后的所述采样帧输入所述预设模型,获得所述视频片段对应所有的预设视频分类的所述置信度;确定数值最大的所述置信度对应的所述预设视频分类为所述视频片段对应的所述预设视频分类,数值最大的所述置信度为所述视频片段的所述置信度;根据所有所述视频片段对应的所述预设视频分类及所述置信度,确定所述目标视频分类对应的所述视频片段及对应的所述视频片段的置信度,保证了置信度计算的准确性。This application discloses sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; preprocessing the sampling frame, and inputting the preprocessed sampling frame The preset model obtains the confidence levels of all preset video categories corresponding to the video clip; determines that the preset video category corresponding to the confidence level with the largest value is the preset video category corresponding to the video clip Set the video classification, the confidence with the largest value is the confidence of the video segment; according to the preset video classification and the confidence corresponding to all the video fragments, determine the corresponding to the target video classification The confidence of the video segment and the corresponding video segment ensures the accuracy of the calculation of the confidence.
本发明所有产品并不需要具备上述所有效果。All products of the present invention do not need to have all the above-mentioned effects.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1是本申请实施例提供的模型网络结构示意图;Fig. 1 is a schematic diagram of a model network structure provided by an embodiment of the present application;
图2是本申请实施例提供的镜头分割流程图;FIG. 2 is a flowchart of lens segmentation provided by an embodiment of the present application;
图3是本申请实施例提供的模型训练流程图;Fig. 3 is a flow chart of model training provided by an embodiment of the present application;
图4是本申请实施例提供的方法流程图;Figure 4 is a flowchart of a method provided by an embodiment of the present application;
图5是本申请实施例提供的装置结构图;FIG. 5 is a structural diagram of an apparatus provided by an embodiment of the present application;
图6是本申请实施例提供的计算机系统结构图。Fig. 6 is a structural diagram of a computer system provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施 例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
如背景技术中所述,现有技术中常用的两种商品视频的生成方法,都各自存在一定的局限性。采用人工剪辑的方法,需要的人力成本高、效率低下,不能满足生成大批量商品视频的实际需求;基于图文转换的视频生成方法,虽然效率更高,但是可用的视频版式、视频风格少且固定,表达能力有限。As described in the background art, the two commonly used methods for generating commercial videos in the prior art each have certain limitations. The manual editing method requires high labor costs and low efficiency, and cannot meet the actual needs of generating large-scale commodity videos; the video generation method based on image-text conversion is more efficient, but there are fewer video formats and video styles available. Fixed, limited expression ability.
为了解决上述技术问题,本申请提出了通过将用户上传的视频使用预设的切分方法进行切分,获得视频片段,使用预设的分类模型对每一视频片段进行分类,并获得每一视频片段对应的置信度;根据用户选择的目标视频分类,将该分类中置信度满足预设条件的视频片段进行拼接,即可获得目标视频。实现了根据用户的上传的视频生成符合需求的目标视频,同时保证了视频生成的时效性。In order to solve the above technical problems, this application proposes to segment the videos uploaded by users using a preset segmentation method to obtain video segments, use a preset classification model to classify each video segment, and obtain each video The confidence level corresponding to the segment; according to the target video classification selected by the user, the video segment in the classification whose confidence level meets the preset condition is spliced to obtain the target video. It is realized that the target video that meets the requirements is generated according to the video uploaded by the user, while ensuring the timeliness of the video generation.
实施例一Example one
为了实现对切分获得的视频片段的分类,需要预先对分类模型进行训练,具体的,可以使用MFnet三维卷积神经网络模型作为分类模型。MFnet三维卷积神经网络模型是一种轻量级的深度学习模型,相较于新近的I3D、SlowFastnet等深度学习模型,其模型更加精简,浮点运算量FLOPs更少,且在测试数据集上测试效果更优。In order to realize the classification of the video clips obtained by segmentation, the classification model needs to be trained in advance. Specifically, the MFnet three-dimensional convolutional neural network model can be used as the classification model. The MFnet three-dimensional convolutional neural network model is a lightweight deep learning model. Compared with the recent deep learning models such as I3D and SlowFastnet, its model is more streamlined, the amount of floating-point operations is less, and it is on the test data set. The test effect is better.
所述训练过程包括:The training process includes:
110、导入训练数据集;110. Import the training data set;
该训练数据集可以由如下方法生成:The training data set can be generated by the following methods:
111、获取预设数量的商品视频,为每个视频创建对应的视频文件夹;111. Obtain a preset number of commodity videos, and create a corresponding video folder for each video;
112、将每一视频包含的片段按照呈现的不同内容划分为不同的类别,所述类别包括但不限于商品主体外观,商品使用场景、商品内容介绍,并按划分的类别进行手工剪辑。112. Divide the clips contained in each video into different categories according to the different content presented, the categories include but are not limited to the appearance of the main body of the product, the use scene of the product, and the introduction of the product content, and manually edit according to the divided categories.
113、在每个视频对应的文件夹建立对应每个所述类别的主文件夹,所述主文件夹标记了对应的所述类别,各主文件夹下包含所述视频对应该类别的一个 或多个子视频片段文件夹,子视频片段文件夹下保存了对应的视频片段的一个或多个图像帧;113. Create a main folder corresponding to each category in the folder corresponding to each video, the main folder marks the corresponding category, and each main folder contains one or the corresponding category of the video. Multiple sub-video clip folders, one or more image frames of the corresponding video clip are saved under the sub-video clip folder;
114、对每个视频对应的文件夹进行密集采样,将采样的样本归一化为N×C×H×W大小,其中N表示对每一子视频片段文件夹的采样帧数量,C表示每一帧的RGB通道,H表示每一帧的预设高度,W表示每一帧的预设宽度,优选的,N至少为8。114. Perform intensive sampling on the folder corresponding to each video, and normalize the sampled samples to a size of N×C×H×W, where N represents the number of sampling frames for each sub-video clip folder, and C represents each sub-video clip folder. For the RGB channels of one frame, H represents the preset height of each frame, and W represents the preset width of each frame. Preferably, N is at least 8.
120、使用训练数据集对MFnet三维卷积神经网络模型进行训练,获得预设模型;120. Use the training data set to train the MFnet three-dimensional convolutional neural network model to obtain a preset model;
图1示出了该模型的网络结构示意图,包含3DCNN,用于提取每一样本包含的三维卷积特征,所述三维卷积特征包含时空特征,包括商品的运动趋势、背景的变化等视频流内物体的运动信息。Figure 1 shows a schematic diagram of the network structure of the model, including 3DCNN, which is used to extract the three-dimensional convolutional features contained in each sample. The three-dimensional convolutional features include temporal and spatial features, including the movement trend of commodities, changes in background, and other video streams. Movement information of the inner object.
3Dpooling为所述模型的池化层,用于对3DCNN的输出进行池化,将池化结果输入3D MF-Unit层,进行1×1×1、3×3×3、1×3×3等不同的卷积操作;3Dpooling is the pooling layer of the model, used to pool the output of 3DCNN, and input the pooling result into the 3D MF-Unit layer for 1×1×1, 3×3×3, 1×3×3, etc. Different convolution operations;
Global Pool为全局池层,用于将输入的结果保留主要的特征同时减少不必要的参数;Global Pool is the global pool layer, used to retain the main characteristics of the input results while reducing unnecessary parameters;
FClayer为全连接层,用于输出每个视频片段对应每个类别的置信度。FClayer is a fully connected layer, used to output the confidence of each video segment corresponding to each category.
使用所述模型,对56个商品短视频测试集进行测试,测试结果如表1所示:Using the model, 56 commercial short video test sets were tested, and the test results are shown in Table 1:
Figure PCTCN2020111945-appb-000001
Figure PCTCN2020111945-appb-000001
表1Table 1
所述模型能够对经单个镜头密集采样所得的样本进行分类,在上述视频数据集共1119个测试样本的测试结果中,分类准确率达到了95.92%,且单一模型仅有29.6MB,针对单镜头密集采样的视频的前向推理时间为330ms,准确率高且速度快。The model can classify samples obtained by intensive sampling of a single lens. In the test results of the above-mentioned video data set with a total of 1119 test samples, the classification accuracy rate reaches 95.92%, and the single model is only 29.6MB, which is aimed at a single lens. The forward inference time of densely sampled video is 330ms, with high accuracy and fast speed.
获得预设模型后,即可根据所述模型实现视频的生成,如图2所示,所述生 成过程包括:After obtaining the preset model, the video can be generated according to the model. As shown in Figure 2, the generation process includes:
步骤一、接收用户输入的初始视频;Step 1: Receive the initial video input by the user;
步骤二、对所述初始视频进行镜头边界检测,根据检测结果对视频进行切分,剔除冗余片段,获得视频片段;Step 2: Perform shot boundary detection on the initial video, segment the video according to the detection result, remove redundant segments, and obtain video segments;
如图3所示,镜头边界检测过程包括:As shown in Figure 3, the shot boundary detection process includes:
首先将初始视频的每一帧使用同样的预设方法均分为预设个数的子块,然后计算每一子块的子直方图,根据子直方图计算相邻帧同样位置的子块的直方图差,每一帧的相邻帧包含所述帧的前一帧及后一帧。当差值超过第一预设阈值T H时,表明相邻帧之间对应子块的差异过大,当某一帧差异过大的子块的数量高于第二预设阈值时,即认为该帧是突变帧,连续的突变帧就组成了突变镜头。对于差值在第一预设阈值T H于第三预设阈值T L之间的帧,即认定为是潜在起始帧,当其顺次往后的帧的差值同样在T L和T H之间,且持续的时长超过第四预设阈值时,即将这些连续的帧认定为是渐变帧,组成了渐变镜头,将渐变和突变镜头剔除后的镜头就被认为是正常镜头。 Firstly, each frame of the initial video is divided into a preset number of sub-blocks using the same preset method, and then the sub-histogram of each sub-block is calculated. According to the sub-histogram, the sub-blocks at the same position in adjacent frames are calculated. The histogram difference, the adjacent frames of each frame include the previous frame and the next frame of the frame. When the difference exceeds a first predetermined threshold T H, the difference indicates that the corresponding subblock is large between adjacent frames, the frame difference when a large number of sub-blocks is higher than a second preset threshold, i.e., that This frame is a sudden change frame, and the continuous sudden change frames constitute a sudden change shot. For the value of the frame difference value between T H T L at the third predetermined threshold in a first predetermined threshold, i.e., is identified as a potential start frame, when the difference which sequentially subsequent frames in the same T L and T When H is between H and the duration exceeds the fourth preset threshold, these consecutive frames are considered to be gradual frames to form a gradual lens, and the lens after excluding the gradual and abrupt lens is regarded as a normal lens.
为了保证生成的视频的效果,需要剔除正常镜头中长度少于第五预设阈值的过短镜头,最后获得所需的视频片段集合。In order to ensure the effect of the generated video, it is necessary to eliminate the excessively short shots whose length is less than the fifth preset threshold among the normal shots, and finally obtain the required video clip set.
步骤三、将所述视频片段进行采样,将采样结果输入预设模型,获得每一视频片段对应的类别及置信度;Step 3: Sample the video clips, input the sampling results into a preset model, and obtain the category and confidence level corresponding to each video clip;
首先按照视频的时间次序,将上述视频片段进行随机密集采样。First, according to the time sequence of the video, the above video clips are randomly and densely sampled.
所述随机密集采样过程包括:The random dense sampling process includes:
在视频片段上随机初始化采样点,以采样点为七点,视频片段结尾为重点,均匀采样N帧,并对采样帧进行预处理,使其满足预设模型的输入尺寸需求。Randomly initialize sampling points on the video segment, take the sampling point as seven points, and focus on the end of the video segment, uniformly sample N frames, and preprocess the sample frames to meet the input size requirements of the preset model.
然后将预处理后的采样帧输入预设模型,获得包含所述采样帧的视频片段对应所有类别的置信度。Then, the preprocessed sample frames are input into a preset model, and the confidence levels of the video clips containing the sample frames corresponding to all categories are obtained.
步骤四、根据用户选择的目标类别及目标时长,将目标类别对应的视频片段进行拼接,生成目标视频;Step 4: According to the target category and target duration selected by the user, stitch the video clips corresponding to the target category to generate the target video;
例如,当用户向获取当前商品的外观展示视频时,就将视频片段按照对应外观展示类别的置信度进行排序,筛选符合需求的视频片段。For example, when a user displays a video to obtain the appearance of a current commodity, the video clips are sorted according to the confidence of the corresponding appearance display category, and video clips that meet the requirements are screened.
具体的筛选规则可以包括:Specific screening rules can include:
当置信度最高的视频片段的时长T i已经满足目标时长的需求时,直接将置信度最高的视频片段作为目标视频; When the duration T i of the video segment with the highest confidence level has met the requirement of the target duration, the video segment with the highest confidence level is directly used as the target video;
当置信度最高的视频片段的时长T i不满足目标时长的需求时,根据置信度数值的排序顺次选取之后的n个视频片段T j,其中j∈[1,n],直到满足下式: When the duration T i of the video segment with the highest confidence does not meet the requirement of the target duration, the next n video segments T j are sequentially selected according to the order of the confidence value, where j ∈ [1,n], until the following formula is satisfied :
Figure PCTCN2020111945-appb-000002
T 2-T 1表示目标时长;
Figure PCTCN2020111945-appb-000002
T 2 -T 1 represents the target duration;
当上述按置信度得分选取的n+1个镜头的时长超过最大时长T 2时,则按照每个镜头的时长,对其中最长的镜头进行头尾截取,直到总时长满足目标时长的需求。 When the duration of the n+1 shots selected according to the confidence score exceeds the maximum duration T 2 , the longest shot among them will be intercepted head and tail according to the duration of each shot until the total duration meets the requirement of the target duration.
步骤五、对步骤四获得的视频片段按照初始视频的时间次序进行顺序拼接,获得目标视频。Step 5. The video clips obtained in step 4 are sequentially spliced according to the time sequence of the initial video to obtain the target video.
对生成的目标视频,可以将其存入视频数据库,在下次需要时重复使用,或是用于对所述模型进行继续训练。For the generated target video, it can be stored in a video database and reused when needed next time, or used to continue training the model.
基于本申请提供的上述方案,即可实现根据用户的上传的视频生成符合需求的目标视频,同时保证了视频生成的时效性。Based on the above-mentioned solution provided by this application, it is possible to generate a target video that meets the requirements based on the video uploaded by the user, while ensuring the timeliness of video generation.
实施例二Example two
对应上述实施例,本申请提供了一种视频的生成方法,如图4所示,所述方法包括:Corresponding to the foregoing embodiment, this application provides a method for generating a video. As shown in FIG. 4, the method includes:
410、接收初始视频及目标视频分类;410. Receive initial video and target video classification;
420、按照预设的视频切分方法,将所述初始视频切分为视频片段;420. According to a preset video segmentation method, segment the initial video into video segments.
优选的,所述方法包括:Preferably, the method includes:
421、使用预设的镜头边界检测方法,确定所述初始视频包含的镜头边界;421. Use a preset shot boundary detection method to determine the shot boundary included in the initial video.
按照确定的所述镜头边界,将所述初始视频切分为视频片段。According to the determined shot boundary, the initial video is divided into video segments.
优选的,所述镜头边界包含所述初始视频的突变镜头及渐变镜头,所述方法 包括:Preferably, the shot boundary includes a sudden change shot and a gradual shot of the initial video, and the method includes:
422、将所述突变镜头及所述渐变镜头从所述初始视频中剔除,获得视频片段集合,所述视频片段集合由剔除后剩余的所述视频片段组成。422. Remove the mutation shots and the gradual shots from the initial video to obtain a set of video clips, where the set of video clips is composed of the video clips remaining after the removal.
优选的,所述视频由连续的帧组成,所述突变镜头及所述渐变镜头的确定过程包括:Preferably, the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:
423、计算所有所述帧与所述帧的相邻帧之间的差异程度;423. Calculate the degree of difference between all the frames and adjacent frames of the frame.
当所述差异程度超过第一预设阈值时,判断所述帧为突变帧,所述突变镜头由连续的所述突变帧组成;When the degree of difference exceeds a first preset threshold, determining that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames;
当所述差异程度在第一预设阈值及第二预设阈值之间时,判断所述帧为潜在渐变帧;When the degree of difference is between a first preset threshold and a second preset threshold, determining that the frame is a potential gradual change frame;
当连续的所述潜在渐变帧的数量超过第三预设阈值时,判断所述潜在渐变帧为渐变帧,所述渐变镜头由连续的所述渐变帧组成。When the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
430、将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;430. Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications.
优选的,所述方法包括:Preferably, the method includes:
431、按照预设的采样方法,对所述视频片段采样,获得所述视频片段对应的至少两帧采样帧;431. Sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip.
对所述采样帧进行预处理,将预处理后的所述采样帧输入所述预设模型,获得所述视频片段对应所有所述预设视频分类的所述置信度。Preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
优选的,获得的所述采样帧为至少八帧。Preferably, the obtained sampling frames are at least eight frames.
优选的,所述将预处理后的所述采样帧输入所述预设模型包括:Preferably, the inputting the preprocessed sampling frame into the preset model includes:
432、提取预处理后的所述采样帧包含的时空特征,将所述时空特征输入所述预设模型。432. Extract the spatiotemporal features included in the sample frame after preprocessing, and input the spatiotemporal features into the preset model.
优选的,所述预设模型为预先训练的MFnet三维卷积神经网络模型。Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
440、根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;440. Determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications.
优选的,所述方法还包括接收目标时长,所述根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段包括:Preferably, the method further includes receiving a target duration, and determining the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories includes :
441、根据所述目标时长、所述目标视频分类、每一所述视频片段对应所有预设视频分类的置信度、所述视频片段的时长,确定所述目标视频分类对应的所述视频片段。441. Determine the video segment corresponding to the target video category according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.
450、根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。450. According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain a target video.
实施例三Example three
对应上述方法实施例,本申请提供了一种视频的生成装置,如图5所示,所述装置包括:Corresponding to the foregoing method embodiment, this application provides a video generation device. As shown in FIG. 5, the device includes:
接收模块510,用于接收初始视频及目标视频分类;The receiving module 510 is used to receive initial video and target video classification;
切分模块520,用于按照预设的视频切分方法,将所述初始视频切分为视频片段;The segmentation module 520 is configured to segment the initial video into video segments according to a preset video segmentation method;
处理模块530,用于将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;The processing module 530 is configured to input the video clips into a preset model, and determine the confidence of each of the video clips corresponding to all preset video classifications;
匹配模块540,用于根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;The matching module 540 is configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;
拼接模块550,用于根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。The splicing module 550 is configured to splice the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
优选的,所述切分模块520还可用于使用预设的镜头边界检测方法,确定所述初始视频包含的镜头边界;Preferably, the segmentation module 520 may also be used to use a preset shot boundary detection method to determine the shot boundary included in the initial video;
按照确定的所述镜头边界,将所述初始视频切分为视频片段。According to the determined shot boundary, the initial video is divided into video segments.
优选的,所述镜头边界包含所述初始视频的突变镜头及渐变镜头,所述切分模块520还可用于将所述突变镜头及所述渐变镜头从所述初始视频中剔除,获得视频片段集合,所述视频片段集合由剔除后剩余的所述视频片段组成。Preferably, the shot boundary includes a sudden change shot and a gradual change shot of the initial video, and the segmentation module 520 may also be used to remove the sudden change shot and the gradual change shot from the initial video to obtain a set of video clips , The set of video clips is composed of the video clips remaining after culling.
优选的,所述视频由连续的帧组成,所述切分模块520还可用于计算所有所述帧与所述帧的相邻帧之间的差异程度;当所述差异程度超过第一预设阈值时,判断所述帧为突变帧,所述突变镜头由连续的所述突变帧组成;当所述差异程度在第一预设阈值及第二预设阈值之间时,判断所述帧为潜在渐变帧;当连续的所述潜在渐变帧的数量超过第三预设阈值时,判断所述潜在渐变帧为渐变帧,所述渐变镜头由连续的所述渐变帧组成。Preferably, the video is composed of continuous frames, and the segmentation module 520 may also be used to calculate the degree of difference between all the frames and adjacent frames of the frame; when the degree of difference exceeds a first preset Threshold, determine that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames; when the degree of difference is between a first preset threshold and a second preset threshold, determine that the frame is Potential gradient frame; when the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frame is a gradient frame, and the gradient lens is composed of the continuous gradient frames.
优选的,所述匹配模块530还可用于按照预设的采样方法,对所述视频片段采样,获得所述视频片段对应的至少两帧采样帧;对所述采样帧进行预处理,将预处理后的所述采样帧输入所述预设模型,获得所述视频片段对应所有所述预设视频分类的所述置信度。Preferably, the matching module 530 can also be used to sample the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; The subsequent sampling frames are input to the preset model, and the confidence levels of the video segments corresponding to all the preset video classifications are obtained.
优选的,所述匹配模块530还可用于提取预处理后的所述采样帧包含的时空特征,将所述时空特征输入所述预设模型。Preferably, the matching module 530 may also be used to extract the spatiotemporal features contained in the sample frame after preprocessing, and input the spatiotemporal features into the preset model.
优选的,所述预设模型为预先训练的MFnet三维卷积神经网络模型。Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
优选的,所述接收模块510还可用于接收目标时长,所述匹配模块540还可用于根据所述目标时长、所述目标视频分类、每一所述视频片段对应所有预设视频分类的置信度、所述视频片段的时长,确定所述目标视频分类对应的所述视频片段。Preferably, the receiving module 510 can also be used to receive a target duration, and the matching module 540 can also be used to receive the target duration, the target video classification, and the confidence level of each of the video segments corresponding to all preset video classifications. , The duration of the video segment, determining the video segment corresponding to the target video category.
实施例四Example four
对应上述方法、设备及系统,本申请实施例四提供一种计算机系统,包括:一个或多个处理器;以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:接收初始视频及目标视频分类;Corresponding to the foregoing method, device, and system, the fourth embodiment of the present application provides a computer system, including: one or more processors; and a memory associated with the one or more processors, and the memory is used to store program instructions , When the program instructions are read and executed by the one or more processors, perform the following operations: receive initial video and target video classification;
按照预设的视频切分方法,将所述初始视频切分为视频片段;Segment the initial video into video segments according to a preset video segmentation method;
将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信 度,确定所述目标视频分类对应的所述视频片段;Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;
根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
其中,图6示例性的展示出了计算机系统的架构,具体可以包括处理器1510,视频显示适配器1511,磁盘驱动器1512,输入/输出接口1513,网络接口1514,以及存储器1520。上述处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520之间可以通过通信总线1530进行通信连接。Among them, FIG. 6 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
其中,处理器1510可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请所提供的技术方案。Among them, the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.
存储器1520可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1520可以存储用于控制计算机系统1500运行的操作系统1521,用于控制计算机系统1500的低级别操作的基本输入输出系统(BIOS)。另外,还可以存储网页浏览器1523,数据存储管理系统1524,以及图标字体处理系统1525等等。上述图标字体处理系统1525就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之,在通过软件或者固件来实现本申请所提供的技术方案时,相关的程序代码保存在存储器1520中,并由处理器1510来调用执行。输入/输出接口1513用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored. The foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510. The input/output interface 1513 is used to connect input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
网络接口1514用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通 信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
总线1530包括一通路,在设备的各个组件(例如处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520)之间传输信息。The bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
另外,该计算机系统1500还可以从虚拟资源对象领取条件信息数据库1541中获得具体领取条件的信息,以用于进行条件判断,等等。In addition, the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
需要说明的是,尽管上述设备仅示出了处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,存储器1520,总线1530等,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本申请方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,云服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD-ROM, etc., including a number of instructions to enable a computer device (which may be a personal computer, a cloud server, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of the present application.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment. The system and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (10)

  1. 一种视频的生成方法,其特征在于,所述方法包括:A video generation method, characterized in that the method includes:
    接收初始视频及目标视频分类;Receive initial video and target video classification;
    按照预设的视频切分方法,将所述初始视频切分为视频片段;Segment the initial video into video segments according to a preset video segmentation method;
    将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
    根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;
    根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
  2. 根据权利要求1所述的方法,其特征在于,所述按照预设的视频切分方法,将所述初始视频切分为视频片段包括:The method according to claim 1, wherein the segmenting the initial video into video segments according to a preset video segmentation method comprises:
    使用预设的镜头边界检测方法,确定所述初始视频包含的镜头边界;Using a preset shot boundary detection method to determine the shot boundary included in the initial video;
    按照确定的所述镜头边界,将所述初始视频切分为视频片段。According to the determined shot boundary, the initial video is divided into video segments.
  3. 根据权利要求2所述的方法,其特征在于,所述镜头边界包含所述初始视频的突变镜头及渐变镜头,所述按照确定的所述镜头边界,将所述初始视频切分为视频片段包括:The method according to claim 2, wherein the shot boundary includes a sudden change shot and a gradual shot of the initial video, and the dividing the initial video into video segments according to the determined shot boundary comprises :
    将所述突变镜头及所述渐变镜头从所述初始视频中剔除,获得视频片段集合,所述视频片段集合由剔除后剩余的所述视频片段组成。The mutation shot and the gradual change shot are removed from the initial video to obtain a set of video clips, and the set of video clips is composed of the video clips remaining after the removal.
  4. 根据权利要求3所述的方法,其特征在于,所述视频由连续的帧组成,所述突变镜头及所述渐变镜头的确定过程包括:The method according to claim 3, wherein the video is composed of continuous frames, and the process of determining the mutation shot and the gradual shot includes:
    计算所有所述帧与所述帧的相邻帧之间的差异程度;Calculating the degree of difference between all the frames and the adjacent frames of the frame;
    当所述差异程度超过第一预设阈值时,判断所述帧为突变帧,所述突变镜头由连续的所述突变帧组成;When the degree of difference exceeds a first preset threshold, determining that the frame is a sudden change frame, and the sudden change shot is composed of continuous sudden change frames;
    当所述差异程度在第一预设阈值及第二预设阈值之间时,判断所述帧为潜 在渐变帧;When the degree of difference is between the first preset threshold and the second preset threshold, determining that the frame is a potentially gradual frame;
    当连续的所述潜在渐变帧的数量超过第三预设阈值时,判断所述潜在渐变帧为渐变帧,所述渐变镜头由连续的所述渐变帧组成。When the number of consecutive potential gradient frames exceeds a third preset threshold, it is determined that the potential gradient frames are gradient frames, and the gradient lens is composed of the continuous gradient frames.
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度包括:The method according to any one of claims 1 to 4, wherein the inputting the video clips into a preset model and determining the confidence of each of the video clips corresponding to all preset video classifications comprises:
    按照预设的采样方法,对所述视频片段采样,获得所述视频片段对应的至少两帧采样帧;Sampling the video segment according to a preset sampling method to obtain at least two sampling frames corresponding to the video segment;
    对所述采样帧进行预处理,将预处理后的所述采样帧输入所述预设模型,获得所述视频片段对应所有所述预设视频分类的所述置信度。Preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence level of the video segment corresponding to all the preset video classifications.
  6. 根据权利要求5所述的方法,其特征在于,所述将预处理后的所述采样帧输入所述预设模型包括:The method according to claim 5, wherein the inputting the preprocessed sample frame into the preset model comprises:
    提取预处理后的所述采样帧包含的时空特征,将所述时空特征输入所述预设模型。The spatio-temporal features included in the sample frame after preprocessing are extracted, and the spatio-temporal features are input into the preset model.
  7. 根据权利要求1-4任一所述的方法,其特征在于,所述预设模型为预先训练的MFnet三维卷积神经网络模型。The method according to any one of claims 1 to 4, wherein the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
  8. 根据权利要求1-4任一所述的方法,其特征在于,所述方法还包括接收目标时长,所述根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段包括:The method according to any one of claims 1 to 4, characterized in that the method further comprises receiving a target duration, said according to the target video classification and the confidence level of all preset video classifications corresponding to each said video segment , Determining the video segment corresponding to the target video classification includes:
    根据所述目标时长、所述目标视频分类、每一所述视频片段对应所有预设视频分类的置信度、所述视频片段的时长,确定所述目标视频分类对应的所述视频片段。The video segment corresponding to the target video category is determined according to the target duration, the target video category, the confidence of each of the video segments corresponding to all preset video categories, and the duration of the video segment.
  9. 一种视频的生成装置,其特征在于,所述装置包括:A video generating device, characterized in that the device includes:
    接收模块,用于接收初始视频及目标视频分类;Receiving module, used to receive initial video and target video classification;
    切分模块,用于按照预设的视频切分方法,将所述初始视频切分为视频片段;A segmentation module, configured to segment the initial video into video segments according to a preset video segmentation method;
    处理模块,用于将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;A processing module, configured to input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
    匹配模块,用于根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;A matching module, configured to determine the video fragment corresponding to the target video classification according to the target video classification and the confidence of each of the video fragments corresponding to all preset video classifications;
    拼接模块,用于根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。The splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
  10. 一种计算机系统,其特征在于,所述系统包括:A computer system, characterized in that the system includes:
    一个或多个处理器;One or more processors;
    以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:And a memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
    接收初始视频及目标视频分类;Receive initial video and target video classification;
    按照预设的视频切分方法,将所述初始视频切分为视频片段;Segment the initial video into video segments according to a preset video segmentation method;
    将所述视频片段输入预设模型,确定每一所述视频片段对应所有预设视频分类的置信度;Input the video clips into a preset model, and determine the confidence level of each of the video clips corresponding to all preset video classifications;
    根据所述目标视频分类及每一所述视频片段对应所有预设视频分类的置信度,确定所述目标视频分类对应的所述视频片段;Determine the video segment corresponding to the target video category according to the target video category and the confidence of each of the video segments corresponding to all preset video categories;
    根据预设的拼接参数,将所述目标视频分类对应的所述视频片段进行拼接,获得目标视频。According to preset splicing parameters, splicing the video segments corresponding to the target video classification to obtain the target video.
PCT/CN2020/111945 2019-12-20 2020-08-28 Video generation method and apparatus, and computer system WO2021120685A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3164771A CA3164771A1 (en) 2019-12-20 2020-08-28 Video generating method, device and computer system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911330586.7A CN111161392B (en) 2019-12-20 2019-12-20 Video generation method and device and computer system
CN201911330586.7 2019-12-20

Publications (1)

Publication Number Publication Date
WO2021120685A1 true WO2021120685A1 (en) 2021-06-24

Family

ID=70557685

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111945 WO2021120685A1 (en) 2019-12-20 2020-08-28 Video generation method and apparatus, and computer system

Country Status (3)

Country Link
CN (1) CN111161392B (en)
CA (1) CA3164771A1 (en)
WO (1) WO2021120685A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827714A (en) * 2022-04-11 2022-07-29 咪咕文化科技有限公司 Video restoration method based on video fingerprints, terminal equipment and storage medium
CN115348478A (en) * 2022-07-25 2022-11-15 深圳市九洲电器有限公司 Device interaction display method and device, electronic device and readable storage medium
CN116567353A (en) * 2023-07-10 2023-08-08 湖南快乐阳光互动娱乐传媒有限公司 Video delivery method and device, storage medium and electronic equipment

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111182367A (en) * 2019-12-30 2020-05-19 苏宁云计算有限公司 Video generation method and device and computer system
CN111935528B (en) * 2020-06-22 2022-12-16 北京百度网讯科技有限公司 Video generation method and device
CN114286197A (en) * 2022-01-04 2022-04-05 土巴兔集团股份有限公司 Method and related device for rapidly generating short video based on 3D scene

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160277779A1 (en) * 2013-12-04 2016-09-22 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for processing video image
CN109922373A (en) * 2019-03-14 2019-06-21 上海极链网络科技有限公司 Method for processing video frequency, device and storage medium
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device
CN110392281A (en) * 2018-04-20 2019-10-29 腾讯科技(深圳)有限公司 Image synthesizing method, device, computer equipment and storage medium
CN111182367A (en) * 2019-12-30 2020-05-19 苏宁云计算有限公司 Video generation method and device and computer system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108924464B (en) * 2018-07-10 2021-06-08 腾讯科技(深圳)有限公司 Video file generation method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160277779A1 (en) * 2013-12-04 2016-09-22 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for processing video image
CN110392281A (en) * 2018-04-20 2019-10-29 腾讯科技(深圳)有限公司 Image synthesizing method, device, computer equipment and storage medium
CN109922373A (en) * 2019-03-14 2019-06-21 上海极链网络科技有限公司 Method for processing video frequency, device and storage medium
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device
CN111182367A (en) * 2019-12-30 2020-05-19 苏宁云计算有限公司 Video generation method and device and computer system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827714A (en) * 2022-04-11 2022-07-29 咪咕文化科技有限公司 Video restoration method based on video fingerprints, terminal equipment and storage medium
CN114827714B (en) * 2022-04-11 2023-11-21 咪咕文化科技有限公司 Video fingerprint-based video restoration method, terminal equipment and storage medium
CN115348478A (en) * 2022-07-25 2022-11-15 深圳市九洲电器有限公司 Device interaction display method and device, electronic device and readable storage medium
CN115348478B (en) * 2022-07-25 2023-09-19 深圳市九洲电器有限公司 Equipment interactive display method and device, electronic equipment and readable storage medium
CN116567353A (en) * 2023-07-10 2023-08-08 湖南快乐阳光互动娱乐传媒有限公司 Video delivery method and device, storage medium and electronic equipment
CN116567353B (en) * 2023-07-10 2023-09-12 湖南快乐阳光互动娱乐传媒有限公司 Video delivery method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN111161392B (en) 2022-12-16
CN111161392A (en) 2020-05-15
CA3164771A1 (en) 2021-06-24

Similar Documents

Publication Publication Date Title
WO2021120685A1 (en) Video generation method and apparatus, and computer system
US10657652B2 (en) Image matting using deep learning
US11176415B2 (en) Assisted image annotation
US10810435B2 (en) Segmenting objects in video sequences
CN111182367A (en) Video generation method and device and computer system
EP4009231A1 (en) Video frame information labeling method, device and apparatus, and storage medium
CN107705808A (en) A kind of Emotion identification method based on facial characteristics and phonetic feature
CA3083486C (en) Method, medium, and system for live preview via machine learning models
WO2014174932A1 (en) Image processing device, program, and image processing method
CN109284729A (en) Method, apparatus and medium based on video acquisition human face recognition model training data
CN110832583A (en) System and method for generating a summary storyboard from a plurality of image frames
CN113033537A (en) Method, apparatus, device, medium and program product for training a model
CN107024989A (en) A kind of husky method for making picture based on Leap Motion gesture identifications
CN106446223B (en) Map data processing method and device
CN109409432B (en) A kind of image processing method, device and storage medium
US10755087B2 (en) Automated image capture based on emotion detection
CN113689436B (en) Image semantic segmentation method, device, equipment and storage medium
US10819876B2 (en) Video-based document scanning
CN111259192A (en) Audio recommendation method and device
US11948360B2 (en) Identifying representative frames in video content
CN110795925A (en) Image-text typesetting method based on artificial intelligence, image-text typesetting device and electronic equipment
JP2023543964A (en) Image processing method, image processing device, electronic device, storage medium and computer program
CN112839185B (en) Method, apparatus, device and medium for processing image
CN113223125A (en) Face driving method, device, equipment and medium for virtual image
WO2023197648A1 (en) Screenshot processing method and apparatus, electronic device, and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20903874

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3164771

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20903874

Country of ref document: EP

Kind code of ref document: A1