CA3166347A1

CA3166347A1 - Video generation method and apparatus, and computer system

Info

Publication number: CA3166347A1
Application number: CA3166347A
Authority: CA
Inventors: Minmin HUANG; Bangfa DONG; Xian YANG
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2019-12-30
Filing date: 2020-08-28
Publication date: 2021-07-08
Also published as: CN111182367A; WO2021135320A1

Abstract

Disclosed in the present application are a video generation method and apparatus, and a computer system. The method comprises: acquiring an original image; rendering the original image according to a pre-determined rendering method, obtaining a key frame; rendering the key frame according to a pre-determined rendering method, obtaining an intermediate frame corresponding to the key frame; generating a video corresponding to the key frame, the video consisting of the key frame and the intermediate frame corresponding to the key frame, realizing low cost and high efficiency for a video generating process, taking into account the problems of both scalability and content individualization.

Description

VIDEO GENERATION METHOD AND APPARATUS, AND COMPUTER SYSTEM
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the field of computer vision technology, and more particularly to a video generating method, and corresponding device and computer system.
Description of Related Art

[0002] With the quickening of the tempo of life, consumers hope to be able to acquire relevant information of commodities more visually directly, and the traditional method of relying on a certain number of commodity pictures to present commodities can no longer satisfy the requirements of e-commerce platforms to present commodity characteristics to help consumers select commodities and make decisions, instead, commodity presentation short videos for presenting commodity functions or actual use effects have become the mainstream of commodity propaganda of various large online retailers. However, great quantities of commodity videos uploaded by such users as merchants are diversified in qualities, not fixed in lengths, and cannot meet the marketing requirements of the platforms.

[0003] In the state of the art, the generation of commodity videos is classified into two large categories, namely the traditional manual method and graphic video conversion generation. The traditional manual method is to manually segment the shots of the uploaded source video according to scene contents and target source materials etc., and thereafter manually screen and join various video segments that satisfy the marketing standards to obtain creative commodity marketing short videos that satisfy user requirements; the method puts high technical demand on the operator, has low timeliness and high subjectivity during the manual operation, and cannot satisfy the marketing requirements of videos.

[0004] The method of graphic video conversion requires the cutout of commodity presentation pictures provided by merchants, the cutouts are thereafter deployed in preset image Date Regue/Date Received 2022-06-29 backgrounds to form finished pictures of commodities, such template files as video templates and background music are obtained from video source material libraries existing in the platforms, and commodity videos are generated in batches from these template files. Although generation of commodity videos in great batches is achieved thereby, the styles and formats of the commodity videos are completely dependent upon template files preconfigured in the source material libraries, whereby videos generated are close in styles and lacking in formats, fall short of visually directly presenting actual statuses of commodities to consumers, and are rather limited in the capability of expression.
SUMMARY OF THE INVENTION

[0005] In order to deal with deficiencies in the state of the art, a main objective of the present invention is to provide a video generating method to realize automatic generation of target videos according to initial videos.

[0006] In order to achieve the above objective, according to the first aspect, the present invention provides a video generating method that comprises:

[0007] receiving an initial video and a target video classification;

[0008] segmenting the initial video into video segments according to a preset video segmenting method;

[0009] inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0010] determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and

[0011] joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0012] In some embodiments, the step of segmenting the initial video into video segments Date Regue/Date Received 2022-06-29 according to a preset video segmenting method includes:

[0013] employing a preset shot boundary detection method to determine a shot boundary contained in the initial video; and

[0014] segmenting the initial video into video segments according to the determined shot boundary.

[0015] In some embodiments, the shot boundary contains an abrupt shot and a gradual shot of the initial video, and the step of segmenting the initial video into video segments according to the determined shot boundary includes:

[0016] eliminating the abrupt shot and the gradual shot from the initial video, and obtaining a video segment collection consisting of the video segments remaining after the elimination.

[0017] In some embodiments, the video consists of consecutive frames, and the step of determining the abrupt shot and the gradual shot includes:

[0018] calculating degrees of deviation of all the frames from adjacent frames;

[0019] judging, when a given degree of deviation exceeds a first preset threshold, the given frame as an abrupt frame, wherein the abrupt shot consists of consecutive abrupt frames;

[0020] judging, when the given degree of deviation is between the first preset threshold and a second preset threshold, the given frame as a latent gradual frame; and

[0021] judging, when the number of consecutive latent gradual frames exceeds a third preset threshold, the latent gradual frames as gradual frames, wherein the gradual shot consists of consecutive gradual frames.

[0022] In some embodiments, the step of inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications includes:

[0023] sampling the video segments according to a preset sampling method, and obtaining at least two sample frames to which the video segments correspond; and Date Regue/Date Received 2022-06-29

[0024] preprocessing the sample frames, inputting the preprocessed sample frames in the preset model, and obtaining confidences of the video segments corresponding to all the preset video classifications.

[0025] In some embodiments, the step of inputting the preprocessed sample frames in the preset model includes:

[0026] extracting spatiotemporal features contained in the preprocessed sample frames, and inputting the spatiotemporal features in the preset model.

[0027] In some embodiments, the preset model is a previously trained MEnet 3D
convolutional neural network model.

[0028] In some embodiments, the method further comprises receiving a target duration, and the step of determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications includes:

[0029] determining the video segment to which the target video classification corresponds according to the target duration, the target video classification, the confidence of each video segment corresponding to all preset video classifications, and a duration of the video segment.

[0030] According to the second aspect, there is provided a video generating device that comprises:

[0031] a receiving module, for receiving an initial video and a target video classification;

[0032] a segmenting module, for segmenting the initial video into video segments according to a preset video segmenting method;

[0033] a processing module, for inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0034] a matching module, for determining the video segment to which the target video classification corresponds according to the target video classification and the confidence Date Regue/Date Received 2022-06-29 of each video segment corresponding to all preset video classifications; and

[0035] a joining module, for joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0036] According to the third aspect, the present application provides a computer system that comprises:

[0037] one or more processor(s); and

[0038] a memory, associated with the one or more processor(s), for storing a program instruction that performs the following operations when it is read and executed by the one or more processor(s):

[0039] receiving an initial video and a target video classification;

[0040] segmenting the initial video into video segments according to a preset video segmenting method;

[0041] inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0042] determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and

[0043] joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0044] The present invention achieves the following advantageous effects.

[0045] The present invention discloses a video generating method, by receiving an initial video and a target video classification, segmenting the initial video into video segments according to a preset video segmenting method, inputting the video segments in a preset model, determining confidence of each video segment corresponding to all preset video classifications, determining the video segment to which the target video classification Date Regue/Date Received 2022-06-29 corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications, joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video, automatic generation of target videos that conform to requirements according to initial videos is realized, and timeliness and precision of video generation are ensured.

[0046] The present invention further proposes employing a preset shot boundary detection method to determine a shot boundary contained in the initial video, and segmenting the initial video into video segments according to the determined shot boundary, and still further proposes that the shot boundary contains an abrupt shot and a gradual shot of the initial video, and that segmenting the initial video into video segments according to the determined shot boundary includes: eliminating the abrupt shot and the gradual shot from the initial video, and obtaining a video segment collection consisting of the video segments remaining after the elimination, whereby is guaranteed the precision in segmenting video segments.

[0047] The present application discloses sampling the video segments according to a preset sampling method, obtaining at least two sample frames to which the video segments correspond, preprocessing the sample frames, inputting the preprocessed sample frames in the preset model, obtaining confidences of the video segments corresponding to all the preset video classifications, determining the preset video classification to which the confidence having the maximum value corresponds as the preset video classification to which the given video segment corresponds, determining the confidence having the maximum value as the confidence of the given video segment, and determining the video segment to which the target video classification corresponds and the corresponding confidence of the video segment according to the preset video classifications and the confidences to which all of the video segments correspond, whereby is guaranteed the precision in calculating the confidences.

[0048] Not all products of the present invention are necessarily required to possess all of the Date Regue/Date Received 2022-06-29 aforementioned effects.
BRIEF DESCRIPTION OF THE DRAWINGS

[0049] In order to more clearly describe the technical solutions in the embodiments of the present invention, drawings required for the illustration of the embodiments will be briefly introduced below. Apparently, the drawings described below are merely directed to some embodiments of the present invention, and it is possible for persons ordinarily skilled in the art to base on these drawings to acquire other drawings without spending creative effort in the process.

[0050] Fig. 1 is a view schematically illustrating the structure of the model network provided by an embodiment of the present application;

[0051] Fig. 2 is a flowchart illustrating shot segmentation provided by an embodiment of the present application;

[0052] Fig. 3 is a flowchart illustrating model training provided by an embodiment of the present application;

[0053] Fig. 4 is a flowchart illustrating the method provided by an embodiment of the present application;

[0054] Fig. 5 is a view illustrating the structure of the device provided by an embodiment of the present application; and

[0055] Fig. 6 is a view illustrating the structure of the computer system provided by an embodiment of the present application.
DETAILED DESCRIPTION OF THE INVENTION

[0056] In order to make more lucid and clear the objectives, technical solutions and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the embodiments as described are merely partial embodiments, rather than the entire Date Regue/Date Received 2022-06-29 embodiments, of the present invention. All other embodiments obtainable by persons ordinarily skilled in the art based on the embodiments in the present invention without spending any creative effort shall all be covered by the protection scope of the present invention.

[0057] As noted in the Description of Related Art, the two methods of generating commodity videos frequently used in the state of the art are respectively restricted to certain degrees.
Use of the manual editing method is high in manpower cost, low in efficiency, and cannot satisfy the practical requirements for generating commodity videos in great batches;
although the video generating method based on graphic conversion achieves higher efficiency, available video formats and video styles are few and fixed, and the capability of expression is rather limited.

[0058] In order to solve the aforementioned technical problems, the present application proposes obtaining video segments by segmenting the video uploaded by a user with a preset segmenting method, employing a preset classification model to classify each video segment, obtaining confidence to which each video segment corresponds, and joining any video segment whose confidence satisfies a preset condition in the classifications according to a target video classification selected by the user, so as to obtain the target video. It is realized to generate a target video that conforms to requirements according to the video uploaded by the user, and timeliness of video generation is guaranteed at the same time.

[0059] Embodiment 1

[0060] In order to achieve classification of video segments obtained by segmentation, it is required to train the classification model in advance, specifically, an MEnet convolutional neural network model can be used as the classification model.
The MEnet 3D convolutional neural network model is a lightweight deep learning model, relative to such recent deep learning models as I3D, SlowFastnet, etc., the model is more refined and simplified, has fewer computational amount of floating points FLOPs, and exhibits more excellent testing effect on the testing dataset.

Date Regue/Date Received 2022-06-29

[0061] The training process includes:

[0062] 110¨ importing a training dataset;

[0063] the training dataset can be generated by the following method:

[0064] 111¨obtaining a preset number of commodity videos, and creating a corresponding video folder for each video;

[0065] 112¨ classifying segments contained in each video into different categories according to different contents as presented, wherein the categories include, but are not limited to, commodity subject appearance, commodity usage scene, and commodity content introduction, and performing manual editing according to the classified categories;

[0066] 113 ¨ creating a home folder corresponding to each category at the folder to which each video corresponds, wherein the home folder marks the corresponding category, under the various home folders is/are contained one or more sub video segment folder(s) of the video that corresponds to the category, and under the sub video segment folder(s) is/are stored one or more image frame(s) of the corresponding video segment;

[0067] 114¨ densely sampling the folder to which each video corresponds, and normalizing the sampled sample to a size of NxCxHxW, where N indicates the number of sample frames of each sub video segment folder, C indicates RGB channels of each frame, H
indicates the preset height of each frame, and W indicates the preset width of each frame, preferably, N is at least 8.

[0068] 120 ¨ employing the training dataset to train the MFnet 3D
convolutional neural network model, and obtaining a preset model.

[0069] Fig. 1 is a view schematically illustrating the network structure of the model, including 3DCNN for extracting a 3D convolution feature contained in each sample, the 3D

convolution feature contains a spatiotemporal feature, including such movement information of objects inside the video stream as movement tendency and background variation of commodities, etc.

Date Regue/Date Received 2022-06-29

[0070] 3Dpooling is a pooling layer of the model used for pooling the output from 3DCNN, the pooling result is input to a 3D MF-Unit layer, and such different convoluting operations as lx1x1, 3x3x3, 1x3 x3 are carried out;

[0071] Global Pool is a global pool layer used for retaining key features of the input result and reducing unnecessary parameters at the same time;

[0072] FClayer is a total connection layer used for outputting the confidence of each video segment corresponding to each category.

[0073] The model is employed to test a short video testing set with 56 commodities, and the testing result is as shown in Table 1:
model loss Accuracy/% Modelsie/MB Infer time/ms MF net 0.022 95.92 29.6 330 Table 1

[0074] The model can classify samples obtained by dense sampling of a single shot, the accuracy rate of classifications reaches 95.92% in a testing result with altogether 1119 testing samples of the above video dataset, the single model only has 29.6MB, the forward inference time of the video densely sampled from a single shot is 330ms, so the accuracy is high and the speed is quick.

[0075] After the preset model has been obtained, generation of the video can be realized according to the model. As shown in Fig. 2, the generating process includes:

[0076] Step A ¨ receiving an initial video input by a user;

[0077] Step B ¨ performing shot boundary detection on the initial video, segmenting the video according to a detection result, eliminating redundant segments, and obtaining video segments.

[0078] As shown in Fig. 3, the shot boundary detecting process includes the following.

[0079] Each frame of the initial video is firstly equally divided into a preset number of subblocks lo Date Regue/Date Received 2022-06-29 by the same preset method, a sub histogram of each subblock is thereafter calculated, a difference in histograms with respect to subblocks at the same positions of adjacent frames is calculated according to the sub histogram, the adjacent frames of each frame include the previous frame of the frame and the next frame of the frame. When the difference exceeds a first preset threshold TI-1, this indicates that corresponding subblocks between the adjacent frames differ unduly much, when the number of much different subblocks of a certain frame is higher than a second preset threshold, it is considered that this frame is an abrupt frame, and consecutive abrupt frames constitute an abrupt shot.
With respect to any frame whose difference lies between the first preset threshold Tx and a third preset threshold TL, it is determined as a latent start frame, when the differences of its sequentially following frames likewise lie between TL and TI-1, and the duration of continuation exceeds a fourth preset threshold, these consecutive frames are determined as gradual frames, which constitute a gradual shot, and a shot from which the gradual and abrupt shots have been eliminated is considered to be a normal shot.

[0080] In order to guarantee the effect of the generated video, any unduly short shot whose length is less than a fifth preset threshold should be eliminated from the normal shot, and the required video segment collection is finally obtained.

[0081] Step C ¨ sampling the video segments, inputting the sampling result in the preset model, and obtaining a category and confidence to which each video segment corresponds.

[0082] Firstly, the video segments are randomly densely sampled according to a temporal sequence of the video.

[0083] The random and dense sampling process includes the following.

[0084] Sample points are randomly initialized on the video segments, the sample points are taken as seven, emphasis is put at the end of the video segments, N frames are uniformly sampled, and the sample frames are so preprocessed that they satisfy the size requirement for input in the preset model.

[0085] The preprocessed sample frames are thereafter input in the preset model, and confidences Date Regue/Date Received 2022-06-29 of the video segments including the sample frames corresponding to all the categories are obtained.

[0086] Step D ¨joining the video segment to which a target category corresponds according to the target category and a target duration selected by the user, and generating a target video.

[0087] For instance, when the user acquires an appearance presentation video of a current commodity, the video segments are sorted according to confidences of the corresponding appearance presentation category, and the video segment that conforms to the requirement is screened out.

[0088] The specific screening rule can include the following.

[0089] When a duration T, of the video segment with the highest confidence has already satisfied the requirement of the target duration, the video segment with the highest confidence is directly taken as the target video;

[0090] when the duration T, of the video segment with the highest confidence does not satisfy the requirement of the target duration, the following n video segments T1 are sequentially selected according to sorting of the confidence values, where j E [1,n], until the following formula is satisfied:

[0091] T1 << + El=7Tj T2, where T2-Ti indicates the target duration;

[0092] when the duration of n+1 shots selected according to the above confidence scores exceeds the maximum duration Tz, both ends of the longest shot are cut according to the duration of each shot, until the total duration satisfies the requirement of the target duration.

[0093] Step E ¨ sequentially joining the video segments obtained in Step D
according to the temporal sequence of the initial video, and obtaining the target video.

[0094] The generated target video can be stored in a video database to be reused when it is required next time, or used for continued training of the model.

[0095] Based on the foregoing solution provided by the present application, it is realized to Date Regue/Date Received 2022-06-29 generate a target video that conforms to the requirement according to the video uploaded by the user, and timeliness of video generation is ensured at the same time.

[0096] Embodiment 2

[0097] Corresponding to the foregoing embodiment, the present application provides a video generating method, as shown in Fig. 4, the method comprises:

[0098] 410 - receiving an initial video and a target video classification;

[0099] 420 - segmenting the initial video into video segments according to a preset video segmenting method;

[0100] preferably, the method comprises:

[0101] 421 - employing a preset shot boundary detection method to determine a shot boundary contained in the initial video; and

[0102] segmenting the initial video into video segments according to the determined shot boundary;

[0103] preferably, the shot boundary contains an abrupt shot and a gradual shot of the initial video, and the method comprises:

[0104] 422 - eliminating the abrupt shot and the gradual shot from the initial video, and obtaining a video segment collection consisting of the video segments remaining after the elimination;

[0105] preferably, the video consists of consecutive frames, and the step of determining the abrupt shot and the gradual shot includes:

[0106] 423 - calculating degrees of deviation of all the frames from adjacent frames;

[0107] judging, when a given degree of deviation exceeds a first preset threshold, the given frame as an abrupt frame, wherein the abrupt shot consists of consecutive abrupt frames;

[0108] judging, when the given degree of deviation is between the first preset threshold and a second preset threshold, the given frame as a latent gradual frame; and Date Regue/Date Received 2022-06-29

[0109] judging, when the number of consecutive latent gradual frames exceeds a third preset threshold, the latent gradual frames as gradual frames, wherein the gradual shot consists of consecutive gradual frames;

[0110] 430 - inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0111] preferably, the method comprises:

[0112] 431 - sampling the video segments according to a preset sampling method, and obtaining at least two sample frames to which the video segments correspond; and

[0113] preprocessing the sample frames, inputting the preprocessed sample frames in the preset model, and obtaining confidences of the video segments corresponding to all the preset video classifications;

[0114] preferably, the obtained sample frames are at least eight frames;

[0115] preferably, the step of inputting the preprocessed sample frames in the preset model includes:

[0116] 432 - extracting spatiotemporal features contained in the preprocessed sample frames, and inputting the spatiotemporal features in the preset model;

[0117] preferably, the preset model is a previously trained MFnet 3D
convolutional neural network model;

[0118] 440 - determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications;

[0119] preferably, the method further comprises receiving a target duration, and the step of determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications includes:

Date Regue/Date Received 2022-06-29

[0120] 441 - determining the video segment to which the target video classification corresponds according to the target duration, the target video classification, the confidence of each video segment corresponding to all preset video classifications, and a duration of the video segment; and

[0121] 450 - joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0122] Embodiment 3

[0123] Corresponding to the foregoing method embodiment, the present application provides a video generating device, as shown in Fig. 5, the device comprises:

[0124] a receiving module 510, for receiving an initial video and a target video classification;

[0125] a segmenting module 520, for segmenting the initial video into video segments according to a preset video segmenting method;

[0126] a processing module 530, for inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0127] a matching module 540, for determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and

[0128] a joining module 550, for joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0129] Preferably, the segmenting module 520 is further usable for employing a preset shot boundary detection method to determine a shot boundary contained in the initial video;
and

[0130] segmenting the initial video into video segments according to the determined shot boundary.
Date Regue/Date Received 2022-06-29

[0131] Preferably, the shot boundary contains an abrupt shot and a gradual shot of the initial video, and the segmenting module 520 is further usable for eliminating the abrupt shot and the gradual shot from the initial video, and obtaining a video segment collection consisting of the video segments remaining after the elimination.

[0132] Preferably, the video consists of consecutive frames, and the segmenting module 520 is further usable for calculating degrees of deviation of all the frames from adjacent frames;
judging, when a given degree of deviation exceeds a first preset threshold, the given frame as an abrupt frame, wherein the abrupt shot consists of consecutive abrupt frames; judging, when the given degree of deviation is between the first preset threshold and a second preset threshold, the given frame as a latent gradual frame; and judging, when the number of consecutive latent gradual frames exceeds a third preset threshold, the latent gradual frames as gradual frames, wherein the gradual shot consists of consecutive gradual frames.

[0133] Preferably, the matching module 530 is further usable for sampling the video segments according to a preset sampling method, and obtaining at least two sample frames to which the video segments correspond; and preprocessing the sample frames, inputting the preprocessed sample frames in the preset model, and obtaining confidences of the video segments corresponding to all the preset video classifications.

[0134] Preferably, the matching module 530 is further usable for extracting spatiotemporal features contained in the preprocessed sample frames, and inputting the spatiotemporal features in the preset model.

[0135] Preferably, the preset model is a previously trained MFnet 3D
convolutional neural network model.

[0136] Preferably, the receiving module 510 is further usable for receiving a target duration, and the matching module 540 is further usable for determining the video segment to which the target video classification corresponds according to the target duration, the target video classification, the confidence of each video segment corresponding to all preset video classifications, and a duration of the video segment.

Date Regue/Date Received 2022-06-29

[0137] Embodiment 4

[0138] Corresponding to the foregoing method and device, embodiment 4 of the present application provides a computer system that comprises: one or more processor(s); and a memory, associated with the one or more processor(s), for storing a program instruction that performs the following operations when it is read and executed by the one or more processor(s):

[0139] receiving an initial video and a target video classification;

[0140] segmenting the initial video into video segments according to a preset video segmenting method;

[0141] inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;

[0142] determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and

[0143] joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

[0144] Fig. 6 exemplarily illustrates the framework of the computer system that can specifically include a processor 1510, a video display adapter 1511, a magnetic disk driver 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 can be communicably connected with one another via a communication bus 1530.

[0145] The processor 1510 can be embodied as a general CPU (Central Processing Unit), a microprocessor, an ASIC (Application Specific Integrated Circuit), or one or more integrated circuit(s) for executing relevant program(s) to realize the technical solutions provided by the present application.

Date Regue/Date Received 2022-06-29

[0146] The memory 1520 can be embodied in such a form as an ROM (Read Only Memory), an RAM (Random Access Memory), a static storage device, or a dynamic storage device.
The memory 1520 can store an operating system 1521 for controlling the running of a computer system 1500, and a basic input/output system (BIOS) for controlling lower-level operations of the computer system 1500. In addition, the memory 1520 can also store a web browser 1523, a data storage administration system 1524, and an icon font processing system 1525, etc. The icon font processing system 1525 can be an application program that specifically realizes the aforementioned various step operations in the embodiments of the present application. To sum it up, when the technical solutions provided by the present application are to be realized via software or firmware, the relevant program codes are stored in the memory 1520, and invoked and executed by the processor 1510.

[0147] The input/output interface 1513 is employed to connect with an input/output module to realize input and output of information. The input/output module can be equipped in the device as a component part (not shown in the drawings), and can also be externally connected with the device to provide corresponding functions. The input means can include a keyboard, a mouse, a touch screen, a microphone, and various sensors etc., and the output means can include a display screen, a loudspeaker, a vibrator, an indicator light etc.

[0148] The network interface 1514 is employed to connect to a communication module (not shown in the drawings) to realize intercommunication between the current device and other devices. The communication module can realize communication in a wired mode (via USB, network cable, for example) or in a wireless mode (via mobile network, WIFI, Bluetooth, etc.).

[0149] The bus 1530 includes a passageway transmitting information between various component parts of the device (such as the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).

Date Regue/Date Received 2022-06-29

[0150] Additionally, the computer system 1500 may further obtain information of specific collection conditions from a virtual resource object collection condition information database 1541 for judgment on conditions, and so on.

[0151] As should be noted, although merely the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, the memory 1520, and the bus 1530 are illustrated for the aforementioned device, the device may further include other component parts prerequisite for realizing normal running during specific implementation. In addition, as can be understood by persons skilled in the art, the aforementioned device may as well only include component parts necessary for realizing the solutions of the present application, without including the entire component parts as illustrated.

[0152] As can be known through the description to the aforementioned embodiments, it is clearly learnt by person skilled in the art that the present application can be realized through software plus a general hardware platform. Based on such understanding, the technical solutions of the present application, or the contributions made thereby over the state of the art, can be essentially embodied in the form of a software product, and such a computer software product can be stored in a storage medium, such as an ROM/RAM, a magnetic disk, an optical disk etc., and includes plural instructions enabling a computer equipment (such as a personal computer, a cloud server, or a network device etc.) to execute the methods described in various embodiments or some sections of the embodiments of the present application.

[0153] The various embodiments are progressively described in the Description, identical or similar sections among the various embodiments can be inferred from one another, and each embodiment stresses what is different from other embodiments.
Particularly, with respect to the system or system embodiment, since it is essentially similar to the method embodiment, its description is relatively simple, and the relevant sections thereof can be inferred from the corresponding sections of the method embodiment. The system or system embodiment as described above is merely exemplary in nature, units therein Date Regue/Date Received 2022-06-29 described as separate parts can be or may not be physically separate, parts displayed as units can be or may not be physical units, that is to say, they can be located in a single site, or distributed over a plurality of network units. It is possible to base on practical requirements to select partial modules or the entire modules to realize the objectives of the embodied solutions. It is understandable and implementable by persons ordinarily skilled in the art without spending creative effort in the process.

[0154] What the above describes is merely directed to preferred embodiments of the present invention, and is not meant to restrict the present invention. Any modification, equivalent substitution, and improvement makeable within the spirit and scope of the present invention shall all be covered by the protection scope of the present invention.
Date Regue/Date Received 2022-06-29

Claims

What is claimed is:

1. A video generating method, characterized in that the method comprises:
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmenting method;
inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;
determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

2. The method according to Claim 1, characterized in that the step of segmenting the initial video into video segments according to a preset video segmenting method includes:
employing a preset shot boundary detection method to determine a shot boundary contained in the initial video; and segmenting the initial video into video segments according to the determined shot boundary.

3. The method according to Claim 2, characterized in that the shot boundary contains an abrupt shot and a gradual shot of the initial video, and that the step of segmenting the initial video into video segments according to the determined shot boundary includes:
eliminating the abrupt shot and the gradual shot from the initial video, and obtaining a video segment collection consisting of the video segments remaining after the elimination.

4. The method according to Claim 3, characterized in that the video consists of consecutive frames, and that the step of determining the abrupt shot and the gradual shot includes:

calculating degrees of deviation of all the frames from adjacent frames;
judging, when a given degree of deviation exceeds a first preset threshold, the given frame as an abrupt frame, wherein the abrupt shot consists of consecutive abrupt frames;
judging, when the given degree of deviation is between the first preset threshold and a second preset threshold, the given frame as a latent gradual frame; and judging, when the number of consecutive latent gradual frames exceeds a third preset threshold, the latent gradual frames as gradual frames, wherein the gradual shot consists of consecutive gradual frames.

5. The method according to anyone of Claims 1 to 4, characterized in that the step of inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications includes:
sampling the video segments according to a preset sampling method, and obtaining at least two sample frames to which the video segments correspond; and preprocessing the sample frames, inputting the preprocessed sample frames in the preset model, and obtaining confidences of the video segments corresponding to all the preset video classifications.

6. The method according to Claim 5, characterized in that the step of inputting the preprocessed sample frames in the preset model includes:
extracting spatiotemporal features contained in the preprocessed sample frames, and inputting the spatiotemporal features in the preset model.

7. The method according to anyone of Claims 1 to 4, characterized in that the preset model is a previously trained MFnet 3D convolutional neural network model.

8. The method according to anyone of Claims 1 to 4, characterized in that the method further comprises receiving a target duration, and that the step of determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications includes:

determining the video segment to which the target video classification corresponds according to the target duration, the target video classification, the confidence of each video segment corresponding to all preset video classifications, and a duration of the video segment.

9. A video generating device, characterized in that the device comprises:
a receiving module, for receiving an initial video and a target video classification;
a segmenting module, for segmenting the initial video into video segments according to a preset video segmenting method;
a processing module, for inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;
a matching module, for determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and a joining module, for joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.

10. A computer system, characterized in that the system comprises:
one or more processor(s); and a memory, associated with the one or more processor(s), for storing a program instruction that performs the following operations when it is read and executed by the one or more processor(s):
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmenting method;
inputting the video segments in a preset model, and determining confidence of each video segment corresponding to all preset video classifications;
determining the video segment to which the target video classification corresponds according to the target video classification and the confidence of each video segment corresponding to all preset video classifications; and joining the video segment to which the target video classification corresponds according to a preset joining parameter, and obtaining a target video.