CN115801980A

CN115801980A - Video generation method and device

Info

Publication number: CN115801980A
Application number: CN202211389074.XA
Authority: CN
Inventors: 卢杨; 牟俊舟; 郭士伟; 吕晶晶
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-03-14
Also published as: WO2024099171A1

Abstract

The embodiment of the disclosure discloses a video generation method and device. One embodiment of the method comprises: acquiring at least two video segments obtained by segmenting a video to be edited; processing at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to the video clipping result, and the training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index value; and selecting the video segments from the at least two video segments to generate a video clipping result according to the processing result. This embodiment enables a video clipping approach based on effect indicator feedback.

Description

Video generation method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a video generation method and device.

Background

Video editing refers to processing such as nonlinear editing of videos by using various applications or tools, for example, processing such as cutting and merging of videos, and adding materials such as pictures, background music, special effects, scenes and the like into videos to generate new videos with different expressive power.

With the overall development of the multimedia industry, videos are applied to various fields as main expression modes, such as short video platforms, product publicity, knowledge science popularization, travel shooting sharing and the like. In some scenarios, a user may desire to form a new video of shorter duration from the cutting of a given video. For example, presenting summary videos at some page-specific locations facilitates a user to quickly determine if interest is present and to browse through the complete video when interest is present. For another example, the e-commerce platform may display a short video on the product page that highlights product features so that the user can quickly learn about the product. As another example, for some events or movie works, etc., it may be desirable to play back some highlight videos.

Under these scenarios, how to filter content from a given video to form a new video is one of the issues to be considered. The existing video editing mode is to evaluate the importance of different video segments in a video, for example, evaluate according to various angles such as the definition, the color richness, the content richness and the diversity of the video segments, and then select the video segment with the highest importance as a new video.

Disclosure of Invention

The embodiment of the disclosure provides a video generation method and device.

In a first aspect, an embodiment of the present disclosure provides a video generation method, where the method includes: acquiring at least two video segments obtained by segmenting a video to be edited; processing at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to the video clipping result, and the training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index value; and selecting the video segments from the at least two video segments according to the processing result to generate a video clipping result.

In a second aspect, an embodiment of the present disclosure provides a video generating apparatus, including: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire at least two video segments obtained by segmenting a video to be clipped; a processing unit configured to process at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents a probability that each video segment belongs to a video clipping result, and a training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index value; and the generating unit is configured to select a video segment from the at least two video segments according to the processing result to generate a video clipping result.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

According to the video generation method and device provided by the embodiment of the disclosure, the effect index values corresponding to the video editing results corresponding to the original video are used for guiding generation of the training samples so as to obtain the video processing model through pre-training, so that the video processing model is used for processing the video clips obtained after the video to be edited is segmented, the probability that the video clips belong to the video editing results is obtained, the video clips of the video to be edited are generated by selecting the video clips from the video clips, and the training samples are generated by using the effect indexes, so that the video editing results generated by using the processing results of the video processing model obtained through training can be ensured to meet the expected effect to a certain extent, and the quality of the video editing results is improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a video generation method according to the present disclosure;

FIG. 3 is a flow diagram of one embodiment of generating training samples for a video processing model;

FIG. 4 is a schematic diagram of a network architecture of a video processing model;

FIG. 5 is a schematic block diagram of one embodiment of a video generation apparatus according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary architecture 100 to which embodiments of the video generation method or video generation apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. Various client applications may be installed on the

terminal devices

101, 102, 103. For example, browser-like applications, search-like applications, shopping-like applications, social platforms, video processing-like applications, instant messaging tools, and so forth.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a server that provides back-end support for client applications installed by the

terminal devices

101, 102, 103. The server can segment videos to be clipped sent by the

terminal devices

101, 102, and 103, process at least two video segments obtained by the segmentation by using a pre-trained video processing model to obtain a processing result, and select a video segment from the at least two video segments according to the processing result to generate a video clipping result of the video to be clipped.

It should be noted that the video to be clipped may also be directly stored locally in the server 105, and the server 105 may directly extract and process the video to be clipped stored locally, in which case, the

terminal apparatuses

101, 102, and 103 and the network 104 may not be present.

It should be noted that the video generation method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the video generation apparatus is generally disposed in the server 105.

It should be noted that the

terminal devices

101, 102, and 103 may also have video processing applications installed therein, and the

terminal devices

101, 102, and 103 may also process videos to be clipped based on the video processing applications, in this case, the video generation method may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the video generation apparatus may also be provided in the

terminal devices

101, 102, and 103. At this point, the exemplary system architecture 100 may not have the server 105 and the network 104.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a video generation method according to the present disclosure is shown. The video generation method comprises the following steps:

step 201, at least two video segments obtained by splitting a video to be clipped are obtained.

In this embodiment, the video to be clipped may be various types of videos, and may be specifically determined according to an actual application scenario. For example, the video to be edited may be an introduction video of a certain product, and the like. As another example, the video to be edited may be a video of a certain event. The video to be clipped is generally the video that is desired to be clipped to obtain a video with a shorter duration than the duration of the video to be clipped itself as a video clipping result.

Specifically, the video to be clipped can be clipped in various video clipping manners according to actual application requirements, so as to obtain at least two video clips, that is, a plurality of video clips. For example, the video to be clipped may be equally spaced apart to obtain a plurality of video segments. For another example, the video to be edited may be divided into a plurality of video segments according to the video content (e.g., the continuity, relevance, etc. of the content). The specific segmentation implementation can be implemented using various existing video editing applications or tools. The duration of each video segment obtained by segmentation may be the same or different. Generally, the content of each video segment obtained by segmentation belongs to the video to be edited.

The executing entity of the video generation method (such as the server 105 shown in fig. 1) may obtain the at least two video clips from various data sources, such as a local, connected database, or a third-party data platform. It should be noted that an execution subject for splitting the video to be clipped to obtain the at least two video segments may be the same as or different from an execution subject of the video generation method.

+ step 202, processing at least two video clips by using a pre-trained video processing model to obtain a processing result.

In this embodiment, the processing result may represent a probability that each of the at least two video clips belongs to the video clip result, that is, a probability that the desired video clip result includes the content of each video clip. In general, the greater the probability of correspondence, the more likely the video clip result will include video clip content.

The input of the video processing model may be at least two video segments and the output may be a processing result representing the probability that each video segment belongs to the video clip result, respectively. The video processing model can be various types of neural network models, and the specific network structure can be flexibly set by technicians. The video processing model can be obtained by training in advance by using training samples based on methods such as back propagation, gradient descent and the like.

The training sample of the video processing model can be obtained through the following steps:

step one, a video clipping result set corresponding to an original video is obtained.

In this step, the original video is a video that can be arbitrary. The video clip result set corresponding to the original video may be obtained in various ways. For example, the original video can be edited by using various existing video editing modes according to application requirements (such as duration requirements of video editing results, etc.), so as to obtain a plurality of video editing results. For another example, the original video may be divided into equal intervals, and each video segment obtained by dividing may be used as the video clipping result.

The execution subject for acquiring the video clip result set may be the same as or different from the execution subject for the video generation method. The execution subject who obtains the video clip result set may obtain the video clip result set corresponding to the original video locally or from other various data sources.

And step two, respectively determining the effect index value of each video clip result in the video clip result set.

In this step, the effect index may refer to an effect or a target desired to be achieved. The effect index of the video clipping result may refer to an effect or an optimization goal that the video clipping result is expected to achieve. The effect index can be flexibly set according to actual application requirements. For example, the effectiveness index may be click rate, broadcast completion rate, conversion rate, and the like. The effect index value is a specific numerical value of the effect index.

The effectiveness index value of each video clip result can be determined by various methods according to the actual application scenario. For example, the effect index value of each video clip result may be predicted using a preset prediction method. For another example, each video clip result may be used online (e.g., delivered online), and then the effect index value for each video clip result may be obtained statistically or the like.

Step three, generating a training sample of the video processing model according to the effect index value

In this step, after the effect index value of each video clip result is obtained, various ways can be flexibly adopted to generate the training sample of the video processing model according to the specific input and output forms of the video processing model.

For example, the video processing model inputs a plurality of video segments and outputs a sequencing result among the video segments, and the sequencing result can be formed by sequentially arranging the probability that the corresponding video segments belong to the video clip result from large to small or from small to large. At this time, after the effect index value of each video clip result corresponding to the original video is obtained, the video clip results may be ranked in order of the effect index value from large to small to obtain a ranking result, and then each video clip result and the corresponding ranking result may be used as a training sample. At this time, the probability that the video clip belongs to the video clip result is positively correlated with the effect index value of the video. I.e. the greater the value of the effect metric for a video segment, the greater the probability that the video segment may be indicative of a video clip belonging to a video clip result. Then, in the same way, a plurality of original videos can be obtained, and each original video is processed by the above steps to obtain a plurality of training samples.

And 203, selecting a video segment from the at least two video segments according to the processing result to generate a video clip result.

In this embodiment, after obtaining the processing result output by the video processing model, a video segment may be selected from the at least two video segments, and a video clipping result may be generated according to the selected video segment. Specifically, the video clips can be selected in various selection modes according to the actual application scene, and the video clip results can be generated according to the video clips in various generation modes.

For example, in the case that the video processing model inputs a plurality of video segments and outputs the sequencing result between the video segments, the video segment with the highest probability may be selected from the at least two video segments, and the selected video segment may be directly used as the video clipping result. And if the sequencing result is formed by sequencing the corresponding probabilities from large to small, selecting the video clip sequenced first as the video clipping result. Correspondingly, if the sorting result is formed by arranging the video clips in the order of the small arrival according to the corresponding probability, the video clip at the end of the sorting can be selected as the video clip result.

And according to the expected effect on the video clipping result, constructing a training sample by using the feedback of the on-line effect index value to obtain a video processing model, processing the video to be clipped by using the video processing model, and generating a video clipping result according to the processing result. The video generation method provided by the disclosure provides that the online effect index is directly used for constructing the video processing model by using the feedback of the effect index, so as to generate the video clip result by using the video processing model, the video clip result can better accord with the expected effect, and the online effect index can reflect the interest of a user to a certain extent, so that the generated video clip result can accord with the preference of the user, and the user experience is improved.

Referring now to FIG. 3, a flow diagram of one embodiment of generating training samples for a video processing model is shown. The method specifically comprises the following steps:

step 301, a video clip result set corresponding to an original video is obtained.

Step 302, determining the effectiveness index value of each video clip result in the video clip result set.

Step 303, selecting a video clipping result with a corresponding effect index value meeting a preset condition from the video clipping result set.

In this embodiment, the preset condition can be flexibly set by a technician according to the actual application requirement. For example, the preset condition may be that the effectiveness index value is greater than a preset effectiveness index value threshold. For another example, the preset condition may be that the effectiveness index value is maximum.

As an example, taking a preset condition as an example of the maximum value of the effect index, the video clip result with the maximum corresponding effect index value may be selected from the video clip results in the video clip result set corresponding to the original video.

And step 304, determining the time period of the selected video clip result in the original video as a target time period.

In this step, the time period of the video-clipping result in the original video is the time period formed by the time points of the video-clipping result appearing in the original video. For example, when the video clipping result is a continuous video segment, the time period from the start time point to the end time point of the video clipping result in the original video may be regarded as the target time period.

Step 305, the original video is cut into at least two original video segments, and the label of each original video segment is determined.

In this step, the original video may be segmented in various segmentation manners to obtain at least two original video segments, i.e., a plurality of original video segments. For example, the dicing may be performed at equal intervals. Typically, the duration of the original video segment is not greater than the duration of the video clip results in the video clip result set in step 301 above.

The label of each original video segment can indicate whether the time period of the original video segment in the original video belongs to the target time period. For example, annotations may be represented using boolean values. As an example, a "1" is used to indicate that a period in which the original video clip is located in the original video belongs to the above-described target period, and a "0" is used to indicate that a period in which the original video clip is located in the original video does not belong to the above-described target period.

Step 306, determining at least two original video segments and a label corresponding to each original video segment as a training sample of the video processing model.

In this step, for an original video, at least two segments corresponding to the original video and labels corresponding to each of the at least two video segments may be used as training samples of a video processing model. Thus, multiple training samples may be obtained using multiple raw videos. Then, a plurality of training samples can be utilized to train and obtain the video processing model based on a machine learning method.

As an example, the video processing model may be trained by: obtaining an initial model, wherein the initial model may include an initial video processing model and an initial discriminant model, wherein the initial video processing model may be various types of neural network models (such as a deep learning model, etc.), an input of which may be a plurality of video segments, and an output of which may be probabilities that the input video segments respectively belong to video clipping results. The initial discrimination model may be various types of discrimination models (such as a binary discriminator, etc.), the input may be a probability that each video segment of the output of the initial video processing model belongs to the video clipping result, and the output may be a binary result indicating whether each video segment belongs to the video clipping result, one type indicates that the video segment belongs to the video clipping result, and the other type indicates that the video segment does not belong to the video clipping result, where the binary result corresponds to the label of the video segment. Then, the initial model may be trained by using the training samples and using back propagation and gradient descent algorithms based on a preset loss function (e.g., a loss function designed based on KL divergence), so as to obtain a trained initial model. The initial video processing model included in the trained initial model may then be considered as the video processing model determined to be trained.

Generally, the influence factors of the effect index of the video may be many, and each video segment is labeled by using the effect index value of the video through a boolean value to form a training sample to complete training of a video processing model, which is helpful for ensuring that the video processing model can learn whether the video processing model belongs to the characteristics of a video clipping result, so as to assist in generating a video clipping result with better effect.

In some optional implementations of this embodiment, the video processing model may include a first feature extraction model, a second feature extraction model, and a generative model. Wherein the first feature extraction model can be used to extract features of the video segment. The second feature extraction model may determine a time-series relationship feature between the video segments according to the features of the video segments respectively extracted by the first feature extraction model, and the generation model may generate the processing result according to the time-series relationship feature between the video segments extracted by the second feature extraction model.

At this time, after the at least two video segments obtained by splitting the video model to be edited are obtained, the feature vectors of the video segments in the at least two video segments may be extracted by using the first feature extraction model, the feature vectors corresponding to the at least two video segments are input to the second feature extraction model to obtain the feature vectors corresponding to the video segments and representing the time sequence relation features between the video segments, and the feature vectors corresponding to the video segments output by the second feature extraction model are input to the generation model to obtain the processing result.

The network structures of the first feature extraction model, the second feature extraction model and the generation model can be flexibly set by technicians according to actual application requirements.

As an example, as shown in fig. 4, a schematic diagram of a network structure of a video processing model is shown. The first feature extraction model can be constructed based on a C3D Network (conditional 3 DNetwork) or a C2D Network (conditional 2D Network) in combination with an LSTM (Long Short-Term Memory), and can extract features of an input video segment. The second feature extraction model may be constructed based on a Transformer model. The generative model may be constructed based on MLP (multilayered Perceptron).

The method has the advantages that the C3D model and the like can be used for modeling the video sequence, the method has better feature expression capability, the Transformer and the like are used for modeling in a time sequence, the features of the context video clips of each video clip are learned, the features are mapped into the probability that each video clip belongs to the video clip result through the MLP and the like to serve as the processing result, good modeling and processing of various videos (such as long videos) are achieved, and the accuracy of the processing result is improved.

In some optional implementation manners of this embodiment, the video segments may be selected and combined to obtain the video clip result according to the sequence of the corresponding probabilities from large to small.

The number of the selected video clips can be flexibly set according to actual application requirements. For example, the duration of the desired video clip result and the duration of each video segment may be determined to ensure that the total duration of the selected video segments is not greater than the duration of the desired video clip result. After selecting the video segments, the selected video segments can be combined in various ways to obtain a video clipping result. For example, the selected video segments are combined in a random order. For another example, the selected video segments may be sequentially combined according to the sequence of the time periods of the selected video segments in the video to be edited.

Therefore, the video clips forming the video clipping result can be ensured to correspond to higher probability, and the method can be applied to application scenes such as video abstraction and highlight video generation.

In some optional implementations of this embodiment, a sliding window with a preset duration may be used, and a group of video segments located in the sliding window may be selected from the original video and combined to obtain a video clipping result.

Wherein, the duration (or called size) of the sliding window can be flexibly set. The sum of the probabilities corresponding to each of the selected video clips in the group of video clips can meet a preset condition. The preset conditions may be preset by a technician. For example, the preset condition may be greater than a preset threshold. For another example, the preset condition may be that the sum is maximum.

Therefore, the method not only can ensure that the video clips forming the video clipping result have higher probability, but also can ensure the continuity of the content of the video clips forming the video clipping result, thereby being beneficial to improving the content fluency of the video clipping result, and being applied to application scenes generated by introduction videos (such as product introduction, propaganda videos and the like).

In addition, according to actual application requirements, when the selected video segments are combined, contents such as special effects, various other materials, background music and the like can be added to enrich the contents and presentation modes of the generated video clip results.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video generating apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the video generating apparatus 500 provided by the present embodiment includes an acquiring unit 501, a processing unit 502, and a generating unit 503. The acquiring unit 501 is configured to acquire at least two video segments obtained by splitting a video to be clipped; the processing unit 502 is configured to process at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents a probability that each video segment belongs to the video clipping result, and a training sample of the video processing model is obtained by: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index value; the generating unit 503 is configured to select a video segment from the at least two video segments to generate a video clip result according to the processing result.

In the present embodiment, in the video generation apparatus 500: for specific processing of the obtaining unit 501, the processing unit 502, and the generating unit 503 and technical effects brought by the specific processing, reference may be made to relevant descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, and details are not repeated here.

In some optional implementations of this embodiment, the foregoing steps further include: selecting a video clipping result of which the corresponding effect index value meets the preset condition from the video clipping result set; determining a time period of the selected video clip result in the original video as a target time period; the method comprises the steps of cutting an original video into at least two original video segments, and determining the label of each original video segment, wherein the label of each original video segment indicates whether the time period of the original video segment in the original video belongs to a target time period or not; and determining at least two original video clips and labels corresponding to each original video clip as training samples of the video processing model.

In some optional implementations of this embodiment, the video processing model includes a first feature extraction model, a second feature extraction model, and a generation model; and the processing unit 502 is further configured to: respectively extracting the characteristics of each video clip in at least two video clips by utilizing a first characteristic extraction model to obtain first characteristic vectors respectively corresponding to the video clips; extracting time sequence relation features among the video clips by using a second feature extraction model to obtain second feature vectors corresponding to the video clips respectively; and generating a processing result according to the second feature vectors respectively corresponding to the video segments by using the generating model.

In some optional implementations of this embodiment, the generating unit 503 is further configured to: and selecting video segments to combine according to the sequence of the corresponding probabilities from large to small to obtain a video clipping result.

In some optional implementations of this embodiment, the generating unit 503 is further configured to: and selecting a group of video clips positioned in the sliding window from the original video by using the sliding window with preset duration to combine to obtain a video clipping result, wherein the sum of the probabilities corresponding to all the video clips in the selected group of video clips meets a preset condition.

According to the device provided by the embodiment of the disclosure, at least two video segments obtained by segmenting the video to be clipped are obtained through the obtaining unit; the processing unit processes at least two video clips by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video clip belongs to the video clipping result, and the training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index values; the generation unit selects the video segments from the at least two video segments to generate the video clipping result according to the processing result, and the training sample is generated by using the effect index, so that the video clipping result generated by using the processing result of the video processing model obtained by training can be ensured to meet the expected effect to a certain extent, and the quality of the video clipping result is improved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Referring now to FIG. 6, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or installed from the storage means 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two video segments obtained by splitting a video to be clipped; processing at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to the video clipping result, and the training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of a video processing model according to the effect index value; and selecting the video segments from the at least two video segments according to the processing result to generate a video clipping result.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a processing unit, and a generation unit. The names of these units do not form a limitation to the unit itself in some cases, for example, the acquiring unit may also be described as "a unit that acquires at least two video segments obtained by splitting a video to be clipped".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A video generation method, comprising:

acquiring at least two video segments obtained by segmenting a video to be edited;

processing the at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to the video clip result, and a training sample of the video processing model is obtained by the following steps: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of the video processing model according to the effect index value;

and selecting a video segment from the at least two video segments to generate a video clip result according to the processing result.

2. The method of claim 1, wherein said generating samples of said video processing model in accordance with said performance indicator value comprises:

selecting a video clipping result of which the corresponding effect index value meets a preset condition from the video clipping result set;

determining a time period of the selected video clip result in the original video as a target time period;

the original video is cut into at least two original video segments, and the label of each original video segment is determined, wherein the label of each original video segment represents whether the time period of the original video segment in the original video belongs to the target time period or not;

and determining the at least two original video segments and the corresponding label of each original video segment as a training sample of the video processing model.

3. The method of claim 2, wherein the video processing model comprises a first feature extraction model, a second feature extraction model, and a generation model; and

the processing the at least two video segments by using the pre-trained video processing model to obtain a processing result includes:

respectively extracting the characteristics of each video clip in the at least two video clips by using the first characteristic extraction model to obtain first characteristic vectors respectively corresponding to the video clips;

extracting time sequence relation features among the video clips by using the second feature extraction model to obtain second feature vectors corresponding to the video clips respectively;

and generating a processing result according to the second feature vectors respectively corresponding to the video segments by using the generating model.

4. The method according to one of claims 1 to 3, wherein said selecting a video segment from the at least two video segments according to the processing result generates a video clip result, comprising:

and selecting video segments to combine according to the sequence of the corresponding probabilities from large to small to obtain a video clipping result.

5. The method according to one of claims 1 to 3, wherein said selecting a video segment from the at least two video segments according to the processing result generates a video clip result, comprising:

and selecting a group of video clips positioned in the sliding window from the original video by using the sliding window with preset duration to combine to obtain a video clipping result, wherein the sum of the probabilities corresponding to all the video clips in the selected group of video clips meets a preset condition.

6. A video generation apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire at least two video segments obtained by segmenting a video to be edited;

a processing unit configured to process the at least two video segments by using a pre-trained video processing model to obtain a processing result, wherein the processing result represents a probability that each video segment belongs to a video clipping result, and a training sample of the video processing model is obtained by: acquiring a video clip result set corresponding to an original video, respectively determining an effect index value of each video clip result of the video clip result set, and generating a training sample of the video processing model according to the effect index value;

and the generating unit is configured to select a video segment from the at least two video segments according to the processing result to generate a video clipping result.

7. The apparatus of claim 6, wherein the steps further comprise:

and determining the at least two original video clips and the corresponding label of each original video clip as a training sample of the video processing model.

8. The apparatus of claim 7, wherein the video processing model comprises a first feature extraction model, a second feature extraction model, and a generation model; and

the processing unit is further configured to: respectively extracting the characteristics of each video clip in the at least two video clips by using the first characteristic extraction model to obtain first characteristic vectors respectively corresponding to the video clips;

9. The apparatus according to one of claims 6-8, wherein the generating unit is further configured to: and selecting video segments to combine according to the sequence of the corresponding probabilities from large to small to obtain a video clipping result.

10. The apparatus according to one of claims 6-8, wherein the generating unit is further configured to: and selecting a group of video clips positioned in the sliding window from the original video by using the sliding window with preset duration to combine to obtain a video clipping result, wherein the sum of the probabilities corresponding to all the video clips in the selected group of video clips meets a preset condition.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-5.