WO2024012289A1

WO2024012289A1 - Video generation method and apparatus, electronic device and medium

Info

Publication number: WO2024012289A1
Application number: PCT/CN2023/105161
Authority: WO
Inventors: 李宇
Original assignee: 维沃移动通信有限公司
Priority date: 2022-07-14
Filing date: 2023-06-30
Publication date: 2024-01-18
Also published as: CN115222838A

Abstract

A video generation method and apparatus, an electronic device and a medium, relating to the technical field of artificial intelligence. The video generation method comprises: obtaining a first image set, inputting the first image set into a multi-classification model for classification, and outputting M classification results corresponding to the first image set; determining a target video template from at least one video template corresponding to the M classification results; and generating a target video on the basis of the first image set and the target video template, wherein M is an integer greater than 1.

Description

Video generation method, device, electronic equipment and media

Cross-references to related applications

This application claims priority to the Chinese patent with application number 202210834501.4 filed in China on July 14, 2022, the entire content of which is incorporated herein by reference.

Technical field

This application belongs to the field of artificial intelligence technology, and specifically relates to a video generation method, device, electronic equipment and media.

Background technique

With the widespread popularity of network bandwidth, the development momentum of Internet videos has gradually become more and more popular. The most important factor affecting whether Internet videos are popular among users is the quality of the video itself.

In related technologies, the video classification network used to produce videos usually divides a video into multiple single-frame images, classifies each frame of image respectively, and then counts the classification results corresponding to each frame of image, and then, based on the final Statistical results are used to generate videos that users need.

However, due to the video classification network used in the above scheme, each frame of image needs to be classified separately in turn, which results in a high delay in the video classification network and a long time required for classification, which in turn results in low video generation efficiency.

Contents of the invention

The purpose of the embodiments of the present application is to provide a video generation method, device, electronic device, and medium that can solve the problem of low video generation efficiency.

In a first aspect, embodiments of the present application provide a video generation method. The method includes: acquiring a first image set, inputting the first image set into a multi-classification model for classification, and outputting M categories corresponding to the first image set. Result: Determine a target video template from at least one video template corresponding to the M classification results; generate a target video based on the above-mentioned first image set and the target video template; where M is an integer greater than 1.

In a second aspect, embodiments of the present application provide a video generation device, which includes: an acquisition unit, a classification unit, a determination unit and a generation unit; wherein the acquisition unit is used to acquire the first image collection; the classification unit is used to The above-mentioned first image set acquired by the acquisition unit is input into the multi-classification model for classification, and M classification results corresponding to the above-mentioned first image set are output; the determination unit is used for at least one corresponding to the above-mentioned M classification results obtained from the classification unit In the video template, the target video template is determined; the generating unit is configured to generate the target video based on the first image set obtained by the acquisition unit and the target video template determined by the determining unit; where M is an integer greater than 1.

In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes a processor and a memory. The memory stores programs or instructions that can be run on the processor. The programs or instructions are processed by the processor. device When executed, the steps of the method as described in the first aspect are implemented.

In a fourth aspect, embodiments of the present application provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the programs or instructions are executed by a processor, the steps of the method described in the first aspect are implemented. .

In a fifth aspect, embodiments of the present application provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the first aspect. the method described.

In a sixth aspect, embodiments of the present application provide a computer program product, the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the method as described in the first aspect.

In the embodiment of the present application, when making a video, the electronic device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output M classification results corresponding to the first image set; Then, determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is an integer greater than 1. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.

Description of drawings

Figure 1 is one of the schematic flow diagrams of a video generation method provided by an embodiment of the present application;

Figure 2 is one of the processing flow charts of a multi-classification model provided by the embodiment of the present application;

Figure 3 is the second processing flow chart of a multi-classification model provided by the embodiment of the present application;

Figure 4 is a schematic diagram of a Token downsampling module provided by an embodiment of the present application;

Figure 5 is a second schematic flowchart of a video generation method provided by an embodiment of the present application;

Figure 6 is a third schematic flowchart of a video generation method provided by an embodiment of the present application;

Figure 7 is a schematic structural diagram of a video generation device provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

Figure 9 is a hardware schematic diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the scope of protection of this application.

The terms "first", "second", etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first,""second," etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple. In addition, “and/or” in the description and claims indicates the connected At least one of the objects, the character "/", generally indicates that the related objects are in an "or" relationship.

The video generation method, device, electronic device, and medium provided by the embodiments of the present application will be described in detail through specific embodiments and application scenarios in conjunction with the accompanying drawings.

In related technologies, when a user needs to make a video, the mobile video classification network used mainly classifies each frame of image in the video to obtain the classification result of each frame of image, and then obtains a classification result based on the classification result. Comprehensive video classification results. This classification scheme requires multiple forward processing, which will cause the scheme to take a long time to process after it is applied to the embedded platform; in addition, when running the scheme on the same mobile device, frame-by-frame processing will cause the mobile device to The running pressure is greater, which ultimately leads to lower quality of the generated video.

In the video generation method, device, electronic equipment and media provided by the embodiments of the present application, by providing a brand-new video classification model, when the user needs to create a video, the image or video frame input by the user can be first used as a The entire video classification model is input at one time, so that the video classification model only needs to perform a forward process, which reduces the delay of the video classification model and improves the classification efficiency of the video classification model. In this way, the classification ability of the video classification model can be improved while reducing the computational cost, and the classification results given by the video classification model, combined with the recommendation algorithm and video templates, can be used to generate videos for users with one click, thereby improving video generation. efficiency.

The execution subject of the video generation method provided in this embodiment may be a video generation device, and the video generation device may be an electronic device, or may be a control module or processing module in the electronic device. An electronic device is taken as an example to illustrate the technical solutions provided by the embodiments of the present application.

An embodiment of the present application provides a video generation method, as shown in Figure 1. The video generation method may include the following steps 201 to 204:

Step 201: The electronic device obtains a first image set.

In this embodiment of the present application, the first image set includes N frames of images. Among them, N is an integer greater than 1.

In a possible embodiment, the N frames of images in the first image set may be N images.

For example, when the electronic device obtains at least one image input by the user or pre-stored by the electronic device, a predetermined number of black images can be filled before and after each image in the at least one image to form a first image set.

For example, "the electronic device acquires the first image collection" in step 201 may include step 201a:

Step 201a: The electronic device obtains N images input by the user to obtain the first image set.

For example, the above-mentioned N images may include images pre-stored by the electronic device and/or images input by the user.

In another possible embodiment, the N frames of images in the first image set may be N frames of video frames in the first video.

It should be noted that in related technologies, since the video classification model in related technologies has weak modeling ability of temporal sequence when classifying videos, therefore, for videos with strong temporal sequence, one frame The processing will result in the inability to take into account the time sequence between each frame of image, thus reducing the classification accuracy and failing to meet user needs.

In this regard, after acquiring the N video frames in the first video, the electronic device can sort the N video frames according to the time sequence corresponding to each of the N video frames, thereby generating the first video frame. Image collection.

In this way, when the user needs to make a video, the N video frames are sorted according to the time order of the obtained N video frames, and then all the video frames after sorting are input into the multi-classification model provided by this application as a whole at once. , thereby improving the classification accuracy of the multi-classification model, thereby improving the quality of the final generated video.

For example, "the electronic device acquires the first image collection" in step 201 may include step 201b:

Step 201b: The electronic device extracts N video frames from the first video to obtain a first image set.

For example, the above-mentioned N video frames may be key frames in the first video. Further, the above-mentioned key frame refers to a video frame in which key information exists in the first video. For example, the frame of the first video that can show the key action in the movement or change of the object, or other video frames that can play a decisive role.

For example, when extracting N video frames from the first video, the electronic device may evenly extract N video frames from the first video according to the duration of the first video. This ensures that the finally extracted N video frames can reflect various video features in the first video.

Step 202: The electronic device inputs the first image set into the multi-classification model for classification, and outputs M classification results corresponding to the first image set.

In this embodiment of the present application, the above-mentioned multi-classification model may be: a multiclass video-classification model (Multiclass Video-classification Model, MVM). Furthermore, the above-mentioned MVM refers to a classification model that can perform comprehensive analysis on multiple frames of images.

In this embodiment of the present application, the above M classification results may include: the classification category corresponding to the first image set, and the name of the classification category corresponding to the first image set. For example, the above classification categories may be: action category, scene category, object category, emotion category, etc.

Step 203: The electronic device determines the target video template from at least one video template corresponding to the M classification results.

Among them, M is an integer greater than 1.

In this embodiment of the present application, the at least one video template is one or more video templates in a video template library of the electronic device. Wherein, multiple video templates are pre-stored in the above video template library, and each video template corresponds to at least one template category. For example, the above video template refers to a fixed-format video that has been edited and can be reused. Generally, video templates can include: video layout, video color matching, video background, video soundtrack and video fonts, etc.

In this embodiment of the present application, one classification result can correspond to one or more video templates, and different classification results can correspond to the same video template or different video templates.

Step 204: The electronic device generates a target video based on the first image set and the target video template.

In this embodiment of the present application, after determining the target video template, the electronic device can fuse the N frames of images in the first image set with the target video template to generate the target video.

Optionally, in this embodiment of the present application, the N frames of images in the above-mentioned first image set are images in the first video. In the case of N video frames, the "electronic device generates the target video based on the first image set and the target video template" in the above step 204 may include step 204b:

Step 204b: The electronic device fuses the first video with the target video template to generate the target video.

In the embodiment of the present application, when the electronic device fuses the first video with the target video template, the starting time points of the two timelines can be overlapped according to the timeline of the first video and the timeline of the target video template, Then fusion is performed to generate the target video.

It should be noted that when the timeline length of the target video template is less than the timeline length of the first video, after the timeline of the target video template reaches the end time point, the target video template can be reused to continue fusion until the first video All integrated.

In the video generation method provided by the embodiment of the present application, when producing a video, the electronic device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output the corresponding M classification results; then determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is greater than 1 integer. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.

Optionally, in this embodiment of the present application, in the above-mentioned step 202, "the electronic device inputs the first image set into the multi-classification model for classification, and outputs M classification results corresponding to the first image set" may include the following steps A1 to A4:

Step A1: After the electronic device inputs the first image set into the multi-classification model, it converts N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model.

Among them, X is an integer greater than 1.

In this embodiment of the present application, the above-mentioned first image feature information of the image block may include a first image feature vector of the image block. For example, the token corresponding to the image block.

In this embodiment of the present application, the electronic device can input the first image set into an image feature information conversion module (such as a tokenization module), and output tokens corresponding to X image blocks corresponding to N frames of images.

Further optionally, in the embodiment of the present application, the above-mentioned step A1 of "converting N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model" may include the following steps A11 and Step A12:

Step A11: Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks.

In this embodiment of the present application, any frame image in the first image set may correspond to multiple image blocks.

Step A12: Extract feature information from X image blocks through a convolutional neural network to obtain first image feature information of X image blocks.

In this embodiment of the present application, the electronic device can first cut each frame of the image in the first image set into individual image blocks, and then separately extract the image of each image block through a convolutional neural network (Convolutional Neural Networks, CNN). Features, thereby obtaining the first image feature information of each image block.

Step A2: Determine the first key image feature information from the first image feature information of X image blocks.

In the embodiment of the present application, the above-mentioned first key image feature information may be: first image feature information in which the pixel features in the first image feature information meet predetermined conditions, or may be first image feature information in which the spatial features in the first image feature information meet predetermined conditions. First image feature information.

Further optionally, in this embodiment of the present application, "determining the first key image feature information from the first image feature information of X image blocks" in the above step A2 may include the following steps A21 and A22:

Step A21: Based on the image feature information selection module in the multi-classification model, the electronic device selects the second key image feature information from the first image feature information of the The arrangement of the information is transformed to obtain the second image feature information.

In the embodiment of the present application, the above-mentioned transformation arrangement method refers to adjusting the arrangement position of the first image feature information of the above-mentioned X image blocks.

It should be noted that the above transformation arrangement does not change the specific content information of the first image feature information of the X image blocks.

Step A22: Fusion of the second key image feature information and the second image feature information to obtain the first key image feature information.

In this embodiment of the present application, the above-mentioned image feature information selection module may be a Token selection module (eg, TokenSelect module).

In the embodiment of this application, the electronic device can select the most important key image feature information from the image feature information of X image blocks through the Token selection module to reduce the number of image feature information of the image block, thereby reducing the number of Computational amount of classification model.

In this embodiment of the present application, the electronic device can select the image feature information of the image block containing key information from the image feature information of X image blocks.

Step A3: Extract high-level semantic information corresponding to at least one key image feature information.

In the embodiment of this application, the above-mentioned high-level semantic information refers to the abstract feature information in the image, for example, the expression of the character in the image, the age of the character, etc.

Further optionally, in this embodiment of the present application, "extracting high-level semantic information corresponding to at least one key image feature information" in the above-mentioned step A3 may include the following steps A31 to step A34:

Step A31: Based on the basic feature module in the multi-classification model, the electronic device performs a normalization operation on the first key image feature information to obtain the third key image feature information.

Step A32: Extract basic image feature information from the third key image feature information.

Step A33: Fusion of the first key image feature information and basic image feature information to obtain target key image feature information.

Step A34: Extract high-level semantic information corresponding to the target key image feature information.

In this embodiment of the present application, the above-mentioned basic feature module is used to perform feature extraction on the first key image feature information determined by the Token selection module to obtain high-level semantic information corresponding to the first key image feature information.

Step A4: Based on the high-level semantic information corresponding to the first key image feature information, obtain M classification results corresponding to the first image set.

In this embodiment of the present application, the electronic device can input the obtained high-level semantic information into the fully connected layer of the multi-classification model to obtain M classification results corresponding to the first image set.

In this embodiment of the present application, the above-mentioned fully connected layer is used to convert the input high-level semantic information into multiple classification result outputs.

Example 1:

For example, take the first image set as an image set consisting of 16 video frames. Let’s illustrate the classification process of the multi-classification model.

For example, taking the multi-classification model as the MVM model as an example, the classification process of the MVM model is as follows: first, 16 frames are evenly extracted from the input video according to time (this parameter is variable), and arranged into a multi-dimensional matrix according to the order in the original video. (For example, [bs*16,3,224,224]), recorded as input (assuming a video is input, the number of samples (batch size, bs) is 1). After that, the input is converted into input tokens through CNN convolution (tokens refers to dividing the picture or video into image blocks. Each image block extracts information separately through CNN and becomes a 1*1*embedding feature vector. , where embedding refers to the dimension of the feature vector after tokenization). Then, the input tokens are passed through the TokenSelect module to select the most important tokens, thereby reducing the calculation amount of the model. Secondly, the high-level semantic information of the token is extracted through the basic feature module, and finally the classification results of multiple labels of the video are obtained through a fully connected layer.

Specifically, the classification process of the above-mentioned MVM model includes the following steps S1 to step S2:

Step S1 (processing process of Tokenization module): First, arrange the 16-frame image set into a multi-dimensional matrix 1, such as [bs*16, 3, 224, 224], which is recorded as input. Then, through several CNN convolution operations, the multidimensional matrix 1 is transformed into a multidimensional matrix 2, such as [bs*16, embedding, 224/16, 224/16], where embedding is used to represent the feature vector dimension parameters of the token. , its optional 512, 768, 1024, etc.; 224/16 means that through a convolution kernel size of 16*16 (the convolution kernel size is 16*16 is used to indicate, each frame image will be 16*16 The pixels are divided into image blocks to extract tokens). When performing CNN convolution with a step size of 16, the input length and width (224*224) will be reduced to the original [224/16,224/16]=[14,14] so much. For example, after convolving an input of size [3,224,224] with a CNN with a step size of 16 and a convolution kernel size of 16*16, the size of the input will become [512, 224/16, 224/16]. Here we take embedding=512 as an example.

Step S2 (processing process of Token selection module): After obtaining the above-mentioned multi-dimensional matrix 2 (such as [bs*16, 512, 224/16, 224/16]) through several CCN convolution operations, it means that there is bs *16*14*14 tokens. As shown in Figure 2, the above multidimensional matrix 2 is transformed in two ways:

One of the channels first passes through a 2d convolution conv1 (convolution kernel 3*3, output channel number 512, the number of channels can be (adjusted) and the activation function (relu) for processing, and then through another 2d convolution conv2 (convolution kernel 3*3, the number of output channels is 128, where 128 is the number of tokens that the Token selection module ultimately needs to select ( That is, the above-mentioned at least one key image feature information)), select the multi-dimensional matrix 3 from the multi-dimensional matrix 2, such as [bs*128, 14*14], and finally, adjust the confidence of the token that needs to be selected through an activation function (sigmoid) degree, and expand the output dimensions through the decompression (unsqueeze) operation, extending the multidimensional matrix 3 to a multidimensional matrix 4, such as [bs*128, 14*14, 1].

The other way is to first perform a reshape operation to transform the above-mentioned multi-dimensional matrix 2 [bs*512, 14, 14] into a multi-dimensional matrix 5, such as [bs*1, 512, 14*14], and then transpose (transpose) operation, convert multidimensional matrix 5 into multidimensional matrix 6, such as [bs*1, 14*14, 512], to change the shape of the multidimensional matrix.

Finally, the results obtained by the above two methods are multiplied by the elements of the multi-dimensional matrix to obtain an output of [bs*128, 14*14, 512]. At this time, the penultimate dimension of the output (where 14*14 is located dimensions), the output of the final Token selection module can be obtained, which is [bs*128, 512], where 128 is the number of tokens to be selected, and 512 is the feature vector dimension of the token.

In this way, when the electronic device classifies the entire first image set through the multi-classification model, it converts each frame of image into image feature information of multiple image blocks, and then from the image feature information of the multiple image blocks, Some important key image feature information is selected, and high-level semantic information is extracted from the important key image feature information. Finally, based on the high-level semantic information, M classification results of the above-mentioned first image collection are obtained. Therefore, the calculation amount of the above multi-classification model can be reduced, and the classification efficiency is further improved.

Example 2:

For the basic feature module in the multi-classification model in step A31 above:

For example, as shown in Figure 3, the main components of the basic feature module are: Token's normalization layer, Token's pooling layer, Token's random discard layer, Token's residual link layer, and Token's downsampling. module.

Normalization layer for Token in this basic feature module:

For example, the above-mentioned normalization layer of Token is used to limit the scope of token. For example, limit its range to (0, 1).

For example, the normalization layer of the above Token: use the normalization module (torch.nn.LayerNorm) to perform layer normalization operation on the input token. The main function of this layer normalization operation is to normalize each token. For normalization, the calculation formula is as follows:

Among them, expectation (Expectation, E) [x] is the mean value of input x, variable (Variable, Var) [x] is the variance of input x, ε = 1e-6 prevents the denominator from being 0, and other parameters are learnable biases. Set the amount.

Pooling layer for Token in this basic feature module:

For example, the above-mentioned Token pooling layer is used to learn the association between different tokens.

Illustratively, combined with Example 1, the above-mentioned Token pooling layer: mainly performs a pooling operation on 128 tokens through a 3*1 pooling layer. For example, for the image input [128,512], the pooling kernel is moved by 3 pixels per row and 1 pixel per column to generate a new pooling result. This pooling layer is mainly used to fuse information between different tokens.

Random discarding layer for Token in this basic feature module:

Illustratively, the above-mentioned random discarding layer of Token is used to improve the recognition ability of the multi-classification model.

Illustratively, combined with Example 1, the random discarding layer of the above Token: selects a discarding random number t (0<=t<1), so that among the 128 input tokens, t*100% of the token numbers are randomly set to 0, thereby discarding the original value, allowing a wider range of tokens to be processed during subsequent video classification.

Residual connection layer for Token in this basic feature module:

For example, the above-mentioned residual connection layer of Token is used to improve the processing depth of the multi-classification model.

Illustratively, combined with the above example, the residual connection layer of the above Token: mainly adds the input 128 token numbers to the (1-t)*100% token number output by the random discarding layer of the above Token, This preserves the original information.

Downsampling module for Token in this basic feature module:

Illustratively, the above-mentioned downsampling module is used to further reduce the number of output tokens and adjust the output dimensions.

Exemplarily, as shown in Figure 4, the above-mentioned downsampling module includes: a linear transformation layer (Fully Connected, FC) layer, an activation function layer (such as a Relu activation function) and a random deactivation (dropout) layer. Specifically, the dimension of the output token is changed first through the FC layer, then through the activation function layer, and then through the dropout layer, which randomly causes part of the tokens in the output result to become 0.

In this way, the electronic device inputs the key image feature information selected by the Token selection module into the basic feature module to obtain high-level semantic information corresponding to the key image feature information. As a result, the multi-classification model of this application can obtain more accurate classification results.

Optionally, in this embodiment of the present application, the above-mentioned M classification results include: classification scores corresponding to each classification; in the above-mentioned step 203, "the electronic device determines the target video template from at least one video template corresponding to the M classification results." ” may include the following steps 203a and 203b:

Step 203a: The electronic device determines the target classification result from the M classification results corresponding to the first image set.

In this embodiment of the present application, the above-mentioned target classification result is the classification result with the highest classification score among the above-mentioned M classification results.

In the embodiment of the present application, after obtaining M classification results, the electronic device can sort the M classification results according to the classification score of each classification included in the classification results, and determine the classification result with the highest score as the target classification result. .

Example 3, taking the first image set including N video frames in video A as an example, the electronic device can score and sort the M classification results corresponding to video A output by the multi-classification model to obtain the top three classification results. Record it as A: [Aclass1, Ascore1; Aclass2, Ascore2; Aclass3, Ascore3], and then sort according to the score value of the category to obtain a sorted category sequence AS: [Aclass1, Aclass2, Aclass3], and then select the one with the highest score The classification result Aclass1 is used as the target classification result of video A.

Example 4, as shown in Figure 5, taking the first image set including N video frames in video A and video B as an example, the electronic device can respectively classify the M corresponding video A and video B output by the multi-classification model. The classification results are sorted by scoring, and the top three classification results are obtained, which are recorded as: A: [Aclass1, Ascore1; Aclass2, Ascore2; Aclass3, Ascore3], B: [Bclass1, Bscore1; Bclass2, Bscore2; Bclass3, Bscore3]. Then A and B form a matching chain, that is, AB: [Aclass1, Ascore1, Aclass2, Ascore2, Aclass3, Ascore3; Bclass1, Bscore1, Bclass2, Bscore2, Bclass3, Bscore3]. At this time, AB is sorted according to the score value of the category. , obtain a sorted category sequence ABS: [Aclass1, Aclass2, Aclass3, Bclass1, Bclass2, Bclass3], and then select the highest-rated classification results Aclass1 and Bclass1 as the target classification results of video A and video B respectively.

Step 203b: Determine the target video template from the video templates matching the target classification result.

In this embodiment of the present application, the electronic device can first select at least one video template that matches the target classification result from the video template library, and then determine the target video template that best matches the target classification result from the at least one video template. .

In this way, the electronic device sorts the multiple classification results obtained by the multi-classification model according to the classification score corresponding to each classification result, so that the classification result with the highest score is used as the final target classification result, and then the classification result is classified according to the target classification result. Among the multiple video templates that match the result, the final target video template is determined. Therefore, the target video template determined by the electronic device is more closely matched with the first video, and the video quality of the final generated video is improved.

Optionally, in this embodiment of the present application, the above classification results also include: classification type name. Before the above step 203b, the video generation method provided by the embodiment of the present application may also include the following steps 203b1 and 203b2:

Step 203b1: Calculate the similarity value between the classification type name in the target classification result and the name of each video template in the video template library.

In this embodiment of the present application, the electronic device can convert the text information of the classification type name of the target classification result and the text information of the name of each video template in the video template library into vector values, and then calculate the vector value to obtain each The scores of each video template are used to obtain the similarity between the target classification result and each video template.

Step 203b2: Determine the video template whose similarity value satisfies the first condition as the video template that matches the target classification result.

In this embodiment of the present application, the first condition may be: the video template with the highest similarity value to the classification type name in the target classification result.

In this embodiment of the present application, the electronic device can also sort the video templates according to the similarity value, and push the top-ranked video templates to the user so that the user can manually select the video template to be merged, providing Improved flexibility of video generation.

Example 5, combined with the above Example 3, after obtaining the above category sequence AS, the electronic device can use the same method to perform a category sequence generation operation on the video template in the video template library. The category of the video template is according to the category of the video in the template. The sequential arrangement is recorded as DataSetSi, i∈[0, DataSet]. By traversing the category of each element in the DataSet, calculating the similarity with AS, the video template with the highest similarity to video A is obtained, and then video A is Fusion with the video template to generate the target video.

Example 6, combined with the above Example 4, as shown in Figure 5, after obtaining the above category sequence ABS, the electronic device can use the same method to perform a category sequence generation operation on the video templates in the video template library. The category of the video template is according to the template The order of the categories in which the videos appear is recorded as DataSetSi, i∈[0, DataSet]. By traversing the category of each element in the DataSet, the similarity with ABS is calculated, and the similarity with video A and video B is obtained respectively. The two video templates with the highest similarity, then fuse video A with the video template with the highest similarity, and fuse video B with the template with the highest similarity to get the target video A* and target video B* respectively, and finally merge the target video Video A* and target video B* are simply spliced to generate the target video.

In this way, the electronic device calculates the similarity value between the determined classification type name of the target classification result and the name of each video template in the video template library by performing a category sequence generation operation on the video templates in the video template library. To obtain the video template with the highest similarity to the target classification result. Therefore, the electronic device can determine the video template more accurately.

The following is an exemplary description of the video generation method provided by the embodiments of this application:

Illustratively, taking the first image set including N frames of video A as shown in Figure 6, the video generation method provided by this application may include the following steps P1 to P5:

Step P1: When the user inputs video A, extract video frames from video A. Specifically, N video frames can be evenly extracted according to the duration of video A.

Step P2: Sort the extracted N video frames according to the time order of video A to form the first image collection, which is used to input the MVM model.

Step P3: After inputting the above first image set into the MVM model, obtain M classification results of video A through reasoning. Specifically, the above-mentioned N video frames are converted into tokens through CNN convolution, and then through the Token selection module, some important tokens are selected from the tokens converted into the above-mentioned N video frames, and then these important tokens are selected through basic feature blocks. The token extracts high-level semantic information, and finally passes the extracted high-level semantic information through a fully connected layer to obtain M classification results of video A.

Step P4: Match the video template from the video template library through the M classification results of video A. Specifically, the M classification results can be sorted according to their corresponding scores to obtain the classification result with the highest score. Then convert the text information of the classification type name of the highest-rated classification result into a vector value, and convert the text information of the name of each video template in the video library into a vector value, and calculate each video template and the above score respectively. Based on the highest similarity of the classification results, the video template with the highest similarity will be used as the most matching video template.

Step P5: Fusion of the video A input by the user with the video template matched in step S4 to generate the final Final goal video.

In this way, the video frames extracted from the video are first sorted in chronological order, taking into account the timing between video frames, and improving the classification accuracy of the MVM model, and then the sorted video frames are input into the MVM model as a whole. After processing, multiple classification results of the above video are obtained, which improves the classification speed of the MVM model. Then, the video template most similar to the classification result with the highest score among the multiple classification results is matched from the video template library, and finally the above video is compared with The video templates are fused to obtain the final video. Therefore, it not only improves the classification ability of the MVM model, but also ensures the quality of the final generated video.

For the video generation method provided by the embodiments of the present application, the execution subject may be a video generation device. In the embodiment of the present application, a video generation device executing a video generation method is used as an example to describe the video generation device provided by the embodiment of the present application.

An embodiment of the present application provides a video generation device. As shown in Figure 7, the video generation device 400 includes: an acquisition unit 401, a classification unit 402, a determination unit 403 and a generation unit 404, wherein: the above acquisition unit 401 is used to obtain A first image set; the classification unit 402 is used to input the first image set obtained by the acquisition unit 401 into a multi-classification model for classification, and output M classification results corresponding to the first image set; the determination unit 403 is used to Determine the target video template from at least one video template corresponding to the M classification results obtained from the classification unit 402; the above-mentioned generation unit 404 is used to determine based on the above-mentioned first image set obtained by the acquisition unit 401 and the determination unit 403 The above target video template generates a target video; where M is an integer greater than 1.

Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: after inputting the above-mentioned first image set acquired by the acquisition unit 401 into a multi-classification model, based on the multi-classification model, the above-mentioned first image set is classified into Convert the N frames of images into first image feature information of X image blocks; determine the first key image feature information from the first image feature information of the X image blocks; extract the first key image feature information corresponding to High-level semantic information; based on the high-level semantic information, M classification results corresponding to the above-mentioned first image collection are obtained; where N and X are integers greater than 1.

Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: based on the image feature information conversion module in the above-mentioned multi-classification model, split the N frames of images in the above-mentioned first image set to obtain X image blocks; extract feature information of the X image blocks through a convolutional neural network to obtain the first image feature information of the X image blocks.

Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: based on the image feature information selection module in the above-mentioned multi-classification model, select the first image feature information from the above-mentioned X image blocks. two key image feature information, and transform the arrangement of the first image feature information of the above-mentioned X image blocks to obtain the second image feature information; fuse the above-mentioned second key image feature information and the above-mentioned second image feature information, The above-mentioned first key image feature information is obtained.

Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to perform a normalization operation on the above-mentioned first key image feature information based on the basic feature module in the above-mentioned multi-classification model to obtain a third key image. feature information; extract the basic image feature information in the third key image feature information; combine the above first level The key image feature information is fused with the above-mentioned basic image feature information to obtain the target key image feature information; the high-level semantic information corresponding to the target key image feature information is extracted.

Optionally, in this embodiment of the present application, the above-mentioned obtaining unit 401 is specifically used to extract N video frames from the first video to obtain the first image set; the above-mentioned generating unit 404 is specifically used to convert the above-mentioned first video into Fusion with the above target video template to generate the target video.

In the video generation device provided by the embodiment of the present application, when producing a video, the video generation device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output the corresponding M classification results; then determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is greater than 1 integer. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.

The video generation device in the embodiment of the present application may be an electronic device or a component in the electronic device, such as an integrated circuit or chip. The electronic device may be a terminal or other devices other than the terminal. For example, the electronic device can be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a mobile internet device (Mobile Internet Device, MID), or augmented reality (AR)/virtual reality (VR). ) equipment, robots, wearable devices, ultra-mobile personal computers (UMPC), netbooks or personal digital assistants (personal digital assistants, PDA), etc., and can also be servers, network attached storage (Network Attached Storage), NAS), personal computer (PC), television (TV), teller machine or self-service machine, etc., the embodiments of this application are not specifically limited.

The video generation device in the embodiment of the present application may be a device with an operating system. The operating system can be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.

The video generation device provided by the embodiment of the present application can implement various processes implemented by the method embodiments of Figures 1 to 6. To avoid repetition, they will not be described again here.

Optionally, as shown in Figure 8, this embodiment of the present application also provides an electronic device 600, including a processor 601 and a memory 602. The memory 602 stores programs or instructions that can be run on the processor 601. When the program or instruction is executed by the processor 601, each step of the above video generation method embodiment is implemented, and the same technical effect can be achieved. To avoid duplication, the details will not be described here.

It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

FIG. 9 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, Input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110 and other components.

Those skilled in the art can understand that the electronic device 100 may also include a power supply (such as a battery) that supplies power to various components. The power supply may be logically connected to the processor 110 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions. The structure of the electronic device shown in Figure 9 does not constitute a limitation on the electronic device. The electronic device may include more or less components than shown in the figure, or combine certain components, or arrange different components, which will not be described again here. .

Wherein, the above-mentioned processor 110 is used to acquire a first image set; input the acquired first image set into a multi-classification model for classification, and output M classification results corresponding to the above-mentioned first image set; from the above M classification results In at least one corresponding video template, a target video template is determined; based on the above-mentioned first image set and the above-mentioned target video template, a target video is generated; where M is an integer greater than 1.

Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: after inputting the acquired first image set into a multi-classification model, convert the above-mentioned first image set into a multi-classification model based on the multi-classification model. N frames of images in the image set are converted into first image feature information of X image blocks; first key image feature information is determined from the first image feature information of the X image blocks; and the first key image feature is extracted The high-level semantic information corresponding to the information; based on the high-level semantic information, M classification results corresponding to the above-mentioned first image collection are obtained; where N and X are integers greater than 1.

Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: based on the image feature information conversion module in the above-mentioned multi-classification model, split the N frames of images in the above-mentioned first image set to obtain X image blocks; extract feature information of the X image blocks through a convolutional neural network to obtain the first image feature information of the X image blocks.

Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: select the first image feature information from the first image feature information of the above-mentioned X image blocks based on the image feature information selection module in the above-mentioned multi-classification model. two key image feature information, and transform the arrangement of the first image feature information of the above-mentioned X image blocks to obtain the second image feature information; fuse the above-mentioned second key image feature information and the above-mentioned second image feature information, The above-mentioned first key image feature information is obtained.

Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to perform a normalization operation on the above-mentioned first key image feature information based on the basic feature module in the above-mentioned multi-classification model to obtain a third key image. feature information; extract the basic image feature information in the third key image feature information; fuse the above-mentioned first key image feature information with the above-mentioned basic image feature information to obtain the target key image feature information; extract the target key image feature information corresponding to High-level semantic information.

Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: extract N video frames from the first video to obtain the first image set; fuse the above-mentioned first video with the above-mentioned target video template, Generate target video.

In the electronic device provided in the embodiment of the present application, when making a video, the electronic device can first obtain the first image Set, then input the first image set into a multi-classification model for classification to output M classification results corresponding to the first image set; and then determine the target video template from at least one video template corresponding to the M classification results; Finally, a target video is generated based on the above-mentioned first image set and the target video template; where M is an integer greater than 1. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.

It should be understood that in the embodiment of the present application, the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042. The graphics processor 1041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras). The display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and at least one of other input devices 1072 . Touch panel 1071 is also called a touch screen. The touch panel 1071 may include two parts: a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.

Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, Image playback function, etc.) etc. Additionally, memory 109 may include volatile memory or nonvolatile memory, or memory 109 may include both volatile and nonvolatile memory. Among them, non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (DynamicRAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM) , SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM, SLDRAM) and Direct Rambus RAM (DRRAM). Memory 109 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

The processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor and a modem processor, where the application processor mainly handles operations related to the operating system, user interface, application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the above modem processor may not be integrated into the processor 110 .

Embodiments of the present application also provide a readable storage medium, with programs or instructions stored on the readable storage medium. When the program or instruction is executed by the processor, each process of the above video generation method embodiment is implemented, and the same technical effect can be achieved. To avoid duplication, the details will not be described here.

Wherein, the processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage media, such as computer read-only memory ROM, random access memory RAM, magnetic disk or optical disk, etc.

An embodiment of the present application further provides a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the above video generation method embodiment. Each process can achieve the same technical effect. To avoid duplication, it will not be described again here.

It should be understood that the chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.

Embodiments of the present application provide a computer program product. The program product is stored in a storage medium. The program product is executed by at least one processor to implement each process of the above video generation method embodiment, and can achieve the same technical effect. , to avoid repetition, will not be repeated here.

It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , optical disk), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

A video generation method, the method includes:

Get the first image collection;

Input the first image set into a multi-classification model for classification, and output M classification results corresponding to the first image set;

Determine a target video template from at least one video template corresponding to the M classification results;

Generate a target video based on the first image set and the target video template;

Among them, M is an integer greater than 1.
The method according to claim 1, wherein said inputting the first image set into a multi-classification model for classification and outputting M classification results corresponding to the first image set includes:

After inputting the first image set into a multi-classification model, convert N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model;

Determine first key image feature information from the first image feature information of the X image blocks;

Extract high-level semantic information corresponding to the first key image feature information;

Based on the high-level semantic information, obtain M classification results corresponding to the first image set;

Among them, N and X are integers greater than 1.
The method of claim 2, wherein converting N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model includes:

Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks;

Feature information is extracted from the X image blocks through a convolutional neural network to obtain first image feature information of the X image blocks.
The method according to claim 2, wherein determining the first key image feature information from the first image feature information of the X image blocks includes:

Based on the image feature information selection module in the multi-classification model, the second key image feature information is selected from the first image feature information of the X image blocks, and the first images of the X image blocks are The arrangement of the feature information is transformed to obtain the second image feature information;

The second key image feature information and the second image feature information are fused to obtain the first key image feature information.
The method according to claim 2, wherein the extracting high-level semantic information corresponding to the first key image feature information includes:

Based on the basic feature module in the multi-classification model, perform a normalization operation on the first key image feature information to obtain the third key image feature information;

Extract basic image feature information from the third key image feature information;

Fusion of the first key image feature information and the basic image feature information to obtain the target key image feature information;

Extract high-level semantic information corresponding to the key image feature information of the target.
The method according to claim 1, wherein said obtaining the first image set includes:

Extract N video frames from the first video to obtain a first image set;

Generating a target video based on the first image set and the target video template includes:

The first video is fused with the target video template to generate a target video.
A video generation device, the device includes: an acquisition unit, a classification unit, a determination unit and a generation unit, wherein:

The acquisition unit is used to acquire the first image set;

The classification unit is configured to input the first image set obtained by the acquisition unit into a multi-classification model for classification, and output M classification results corresponding to the first image set;

The determining unit is configured to determine a target video template from at least one video template corresponding to the M classification results obtained by the classification unit;

The generating unit is configured to generate a target video based on the first image set acquired by the acquiring unit and the target video template determined by the determining unit;

Among them, M is an integer greater than 1.
The device according to claim 7, wherein the classification unit is specifically used for:

After the first image set acquired by the acquisition unit is input into a multi-classification model, N frames of images in the first image set are converted into first image feature information of X image blocks based on the multi-classification model. ;

Determine first key image feature information from the first image feature information of the X image blocks;

Extract high-level semantic information corresponding to the first key image feature information;

Based on the high-level semantic information, obtain M classification results corresponding to the first image set;

Among them, N and X are integers greater than 1.
The device according to claim 8, wherein the classification unit is specifically used for:

Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks;

Feature information is extracted from the X image blocks through a convolutional neural network to obtain first image feature information of the X image blocks.
The device according to claim 8, wherein the classification unit is specifically used for:

Based on the image feature information selection module in the multi-classification model, the second key image feature information is selected from the first image feature information of the X image blocks, and the first images of the X image blocks are The arrangement of the feature information is transformed to obtain the second image feature information;

The second key image feature information and the second image feature information are fused to obtain the first key image feature information.
The device according to claim 8, wherein the classification unit is specifically used for:

Based on the basic feature module in the multi-classification model, perform a normalization operation on the first key image feature information to obtain the third key image feature information;

Extract basic image feature information from the third key image feature information;

Fusion of the first key image feature information and the basic image feature information to obtain target key image feature information;

Extract high-level semantic information corresponding to the key image feature information of the target.
The device of claim 7, wherein

The acquisition unit is specifically used to extract N video frames from the first video to obtain the first image set;

The generating unit is specifically configured to fuse the first video with the target video template to generate a target video.
An electronic device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor, the implementation of any one of claims 1 to 6 is achieved. The steps of the video generation method.
A readable storage medium on which a program or instructions are stored. When the program or instructions are executed by a processor, the steps of the video generation method according to any one of claims 1 to 6 are implemented.
A computer program product, which is executed by at least one processor to implement the video generation method according to any one of claims 1 to 6.
An electronic device configured to perform the video generation method according to any one of claims 1 to 6.
A chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement video generation according to any one of claims 1 to 6 method.