CN117095085A

CN117095085A - Video generation method and device, medium and computer equipment

Info

Publication number: CN117095085A
Application number: CN202311078346.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-11-21

Abstract

A video generation method and apparatus, medium and computer device, the method comprising: acquiring a target text; retrieving matching videos semantically matched with the target text from a video database; inputting the target text and the matching video into a target video generation model; and obtaining the target video output by the target video generation model.

Description

Video generation method and device, medium and computer equipment

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video generation method and apparatus, a medium, and a computer device.

Background

Artificial intelligence generated content (AI-Generated Content, AIGC) refers to content generated by a generated artificial intelligence technique. Text-generated video is a specific application scenario of AIGC, i.e., by inputting text into a video generation model, and generating video through the video generation model. The current text generation video mode needs to train a video generation model by adopting a large number of text video pairs, and the training of the video generation model has higher calculation force requirements because of larger video files. Also, high quality text video is very scarce, and therefore, it is difficult to train out a high quality video generation model, resulting in poor quality of video generated by the video generation model.

Disclosure of Invention

In a first aspect, an embodiment of the present disclosure provides a video generating method, including: acquiring a target text; retrieving matching videos semantically matched with the target text from a video database; inputting the target text and the matching video into a target video generation model; and obtaining the target video output by the target video generation model.

In some embodiments, said retrieving from a video database matching videos that semantically match said target text comprises: acquiring semantic features corresponding to the target text; acquiring video features corresponding to a plurality of candidate videos in the video database respectively, wherein the video features are used for representing semantic information of the corresponding candidate videos; and determining matching videos which are matched with the target text semantically from the plurality of candidate videos based on the semantic features corresponding to the target text and the video features corresponding to the plurality of candidate videos respectively.

In some embodiments, the determining, from the plurality of candidate videos, a matching video that semantically matches the target text based on semantic features corresponding to the target text and video features corresponding to the plurality of candidate videos, respectively, includes: determining the similarity between the semantic features corresponding to the target text and the video features corresponding to each candidate video in the plurality of candidate videos; and determining candidate videos corresponding to at least one video feature with the similarity of semantic features corresponding to the target text from high to low as the matching videos.

In some embodiments, the obtaining video features corresponding to the plurality of candidate videos in the video database includes: determining at least one frame of key frame from the candidate video; extracting features of each frame of key frame in the at least one frame of key frame to obtain features of each frame of key frame; and acquiring video features corresponding to the candidate video frames based on the features of the key frames of each frame.

In some embodiments, said retrieving from a video database matching videos that semantically match said target text comprises: obtaining keyword labels respectively corresponding to a plurality of candidate videos in the video database, wherein the keyword labels are used for representing semantic categories to which the corresponding candidate videos belong; determining the matching degree between the target text and the keyword label corresponding to each candidate video in the plurality of candidate videos; and determining candidate videos corresponding to at least one keyword tag with the matching degree of the target text from high to low as the matching videos.

In some embodiments, the target video generation model is trained based on: pre-training an image generation model based on a first sample text and a sample image semantically matched with the first sample text to obtain a pre-training model; fine tuning the pre-training model added with the attention module based on a second sample text and a sample matching video semantically matched with the second sample text to obtain the target video generation model; the attention module is used for extracting time sequence characteristics in the sample matching video.

In some embodiments, the target text comprises a plurality of text blocks, the matching video semantically matching the target text comprises matching video semantically matching each text block of the plurality of text blocks, and the target video comprises a target video corresponding to each text block; the method further comprises the steps of: and fusing the target video corresponding to each text block to obtain a fused video.

In some embodiments, the target video corresponding to the i+1th text block of the plurality of text blocks is generated by the target video generation model based on the i+1th text block, a matching video semantically matching the i+1th text block, and a video feature corresponding to the target video corresponding to the i th text block of the plurality of text blocks, i being a positive integer.

In some embodiments, the target video generation model is trained based on: pre-training an initial video generation model based on at least one piece of sample data to obtain the target video generation model; wherein each piece of sample data in the at least one piece of sample data includes: sample text blocks; sample matching videos semantically matched with the sample text blocks; sample video features corresponding to sample matching videos semantically matched with a previous sample text block in sample text to which the sample text block belongs.

In some embodiments, the target video generation model is trained based on: inputting a plurality of pieces of sample data into an initial video generation model; each sample data in the plurality of sample data comprises one sample text block in a plurality of sample text blocks included in a sample text and a sample matching video semantically matched with the sample text block; acquiring a sample output video corresponding to each sample text block output by the initial video generation model; training the initial video generation model based on the video characteristics corresponding to the sample output video corresponding to each sample text block and preset conditions to obtain the target video generation model; the preset conditions include: a first similarity between a video feature corresponding to a first sample output video and a video feature corresponding to a second sample output video is less than a second similarity between a video feature corresponding to the first sample output video and a video feature corresponding to a reference video, and the first similarity is less than a third similarity between a video feature corresponding to the second sample output video and a video feature corresponding to the reference video; the first sample output video and the second sample output video are respectively sample output videos corresponding to adjacent sample text blocks, and the reference video is a video obtained from a video database.

In some embodiments, each sample data of the plurality of sample data further includes a sample video feature corresponding to a sample matching video that semantically matches a previous sample text block of the sample text to which the sample text block belongs.

In some embodiments, said retrieving from a video database matching videos that semantically match said target text comprises: extracting key information from the target text, wherein the key information is related to visual characteristics of the target text; and searching matching videos which are matched with the target text semantically from a video database based on the key information.

In some embodiments, the method further comprises: and smoothing the adjacent target videos in the fusion video.

In a second aspect, an embodiment of the present disclosure provides a video generating apparatus, the apparatus including: the text acquisition module is used for acquiring a target text; the retrieval module is used for retrieving matching videos which are semantically matched with the target text from a video database; the input module is used for inputting the target text and the matched video into a target video generation model; and the video acquisition module is used for acquiring the target video output by the target video generation model.

In a third aspect, the disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the embodiments of the disclosure.

In a fourth aspect, embodiments of the present disclosure provide a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the embodiments of the present disclosure when the program is executed.

In the embodiment of the disclosure, after the target text is acquired, the matched video which is semantically matched with the target text can be retrieved from the video database, and then the target text and the matched video are used as the input of the target video generation model, so that the semantic information and the time sequence information in the matched video can be acquired in addition to the semantic information in the target text in the process of generating the target video by the target video generation model, the target video output by the target video generation model is more coherent, the jitter of the target video is reduced, and the quality of the target video output by the target video generation model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present disclosure.

Fig. 2 is a flowchart of a video generation method of an embodiment of the present disclosure.

Fig. 3A and 3B are schematic diagrams of video matching processes, respectively, of embodiments of the present disclosure.

Fig. 4 is a schematic diagram of a training process of a target video generation model of an embodiment of the present disclosure.

Fig. 5 is a block diagram of a video generating apparatus of an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a computer device of an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to better understand the technical solutions in the embodiments of the present disclosure and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

Text-generated video refers to the manner in which video is generated by a video generation model based on input text. Referring to fig. 1, in some embodiments, the text includes content of "a boy playing football," which is input into the video generation model, which may generate a piece of video related to text semantics, and the content presented in the video is also "a boy playing football.

In the related art, a video generation model needs to be trained by a large number of text video pairs, each of which needs to include a piece of text and a video related to the text semantics. However, training of the video generation model is computationally demanding due to the relatively large video files. Also, high quality text video is very scarce, and therefore, it is difficult to train out a high quality video generation model, resulting in poor quality of video generated by the video generation model, for example, there is much jitter and discontinuity.

Based on this, an embodiment of the present disclosure provides a video generating method, referring to fig. 2, including:

step S1: acquiring a target text;

step S2: retrieving matching videos semantically matched with the target text from a video database;

step S3: inputting the target text and the matching video into a target video generation model;

step S4: and obtaining the target video output by the target video generation model.

According to the embodiment of the disclosure, video retrieval and target video generation are combined, after the target text is acquired, the matched video which is matched with the target text semantically can be retrieved from the video database, and then the target text and the matched video are used as the input of the target video generation model together, so that the semantic information and time sequence information in the matched video can be acquired in addition to the semantic information in the target text in the process of generating the target video by the target video generation model, the target video output by the target video generation model is more coherent, the jitter of the target video is reduced, and the quality of the target video output by the target video generation model is improved. In addition, the embodiment of the disclosure acquires the matching video semantically matched with the target text from the video database in a searching mode, and a user only needs to input the target text into the target video generation model without inputting the matching video, so that the universality of target video generation is improved. Moreover, the user does not need to pay attention to what video needs to be input into the target video generation model as a matching video, and the whole process is friendly to the user. Specific implementation details of embodiments of the present disclosure are illustrated below.

In step S1, the target text may be edited and input by the user, and the voice information identifying the user may be determined, or may be obtained from another manner such as a web page. The content of the target text may be news, novels, bystander descriptions (e.g., reciting a complete story), dialog content (e.g., dialog subtitles in a movie), etc., which is not limited by the present disclosure. The target text may include at least one sentence. In some embodiments, the target text may be partitioned into a plurality of text blocks. Where each text block may include a piece of semantically complete content in the target text. Alternatively, the target text may be split according to punctuation. For example, the target text may be split according to punctuation marks (simply referred to as ending symbols) indicating the end of a sentence, question marks, exclamation marks, and the like. The text between two end symbols may be regarded as a text block. Or, alternatively, the target text may be text block partitioned according to paragraphs. For example, each paragraph in the target text may be treated as a block of text. The target text may also be subject to line text block partitioning based on semantic understanding. Other ways of dividing the text blocks than those listed above may be used, and are not listed here.

In step S2, the video database may be an offline database or an online database, which is pre-established, and a plurality of videos may be included in the video database. Part or all of the videos in the video database can be used as candidate videos, and matching videos which are matched with the target text semantically can be retrieved from the candidate videos. The number of the matching videos can be greater than or equal to 1, and under the condition that the number of the matching videos is greater than 1, each matching video can be used as the input of the target video generation model, so that the diversity of the generated target videos can be improved. The manner in which matching videos are retrieved is illustrated below.

Referring to fig. 3A, in one approach, matching videos may be determined by feature matching. Specifically, semantic features corresponding to the target text and video features corresponding to the plurality of candidate videos respectively can be obtained, and matching videos which are matched with the target text semantically are determined from the plurality of candidate videos based on the semantic features corresponding to the target text and the video features corresponding to the plurality of candidate videos respectively.

For the target text, the target text may be encoded (for example, the CLIP text encoder may be used to encode the target text), and the feature vector g of the target text is obtained through encoding, and is used as a semantic feature corresponding to the target text.

For each candidate video, a feature vector may be extracted from the candidate video. Assuming that the number of candidate videos is N, the feature vectors extracted from the candidate videos can be denoted as f_1, f_2, … … and f_N, wherein f_i (1 is less than or equal to i is less than or equal to N) is the feature vector extracted from the ith candidate video, and the feature vector f_i is the video feature corresponding to the ith candidate video and can be used for representing semantic information of the ith candidate video.

In some embodiments, at least one frame of key frame may be determined from the candidate video. For example, a frame of key frame may be determined every M frames of video frames, or an I frame in the video may be determined as a key frame, or a certain frame in the video may be extracted as a key frame, or all video frames in the video may be determined as key frames. After determining the key frame, feature extraction may be performed on each key frame to obtain features of the key frame. For example, each key frame may be input into a pre-trained image feature extraction model (VIT, CLIP model, etc.) to derive the features of the key frame. Then, based on the characteristics of each frame key frame, the video characteristics corresponding to the candidate video frames are obtained. For example, the features of each key frame may be pooled (e.g., average pooled or maximum pooled) to obtain video features corresponding to candidate video frames.

Referring to fig. 3B, in other embodiments, the candidate video may also be directly input into a video feature extraction model (e.g., model TimeSformer, I3D, slowFast, etc.) to obtain the features of the candidate video. And extracting the video features of the candidate videos through the video feature extraction model, so that the time sequence features in the candidate videos can be extracted.

After the feature vector g and the feature vectors f_1, f_2, … …, f_n are obtained, the feature vector g may be used to perform similarity matching on the feature vectors f_1, f_2, … …, f_n of the N segments of candidate videos, that is, the similarity between the feature vector g and each of the feature vectors f_1, f_2, … …, f_n is calculated, at least one video feature with high similarity to the feature vector g is obtained, and marked as f_x1, f_x2, … …, f_xk, and the candidate videos corresponding to the video features f_x1, f_x2, … …, f_xk are determined as matching videos.

The feature vectors f_1, f_2, … …, f_n corresponding to the candidate videos in the above embodiment may be extracted in advance and stored in association with each candidate video in the video database. In this way, when the matching video is determined, the pre-stored feature vector can be directly obtained from the video database without extracting the features of the candidate video, so that the efficiency of determining the matching video is improved. Alternatively, only the candidate video may be stored in the video database without storing the feature vector corresponding to the candidate video. Under the condition that the matched video is required to be determined from the candidate videos, the candidate videos can be obtained from a video database, and feature extraction is carried out on the candidate videos, so that feature vectors corresponding to the candidate videos are obtained.

In another approach, matching videos may be determined by keyword matching. Specifically, keyword tags corresponding to a plurality of candidate videos in a video database can be obtained; determining the matching degree between the target text and the keyword label corresponding to each candidate video in the plurality of candidate videos; and determining candidate videos corresponding to at least one keyword tag with the matching degree of the target text from high to low as matching videos.

The keyword labels are used for representing semantic categories to which the corresponding candidate videos belong. The semantic categories may include, but are not limited to, at least one of:

the categories of actions performed by the target subject in the candidate video include categories of limb actions (e.g., running, kicking, climbing, squatting) and/or categories of expressive actions (e.g., smiling, crying, opening mouth, blinking, etc.);

attribute categories of the target object in the candidate video, including but not limited to gender, age, etc. of the target object;

scene categories presented in the candidate video, e.g., forest, grassland, city, indoor, outdoor, etc.;

the climate category presented in the candidate video, e.g., windy, snowy, sunny, rainy, etc.

Semantic categories may also be partitioned according to other criteria, which are not listed here. Each semantic category has its corresponding semantic feature, e.g., where the semantic category includes a category of actions performed by a target object in a candidate video, the semantic feature corresponding to the candidate video is used to characterize the actions performed by the target object in the candidate video; in the case that the semantic category includes an attribute category of a target object in the candidate video, the semantic feature corresponding to the candidate video is used to characterize the attribute of the target object in the candidate video.

Keyword tags corresponding to candidate videos may be pre-generated and stored in association with the candidate videos in a video database. Therefore, when the matched video is determined, the pre-stored keyword label can be directly obtained from the video database, and the generation process of the keyword label is not required to be executed, so that the efficiency of determining the matched video is improved. Alternatively, only the candidate video may be stored in the video database without storing the keyword tag corresponding to the candidate video. In the case that a matching video needs to be determined from the candidate videos, the candidate videos can be obtained from a video database, and keyword tags corresponding to the candidate videos are generated. The keyword tags may be generated by a neural network used to generate the keyword tags, or generated in other ways.

By the method, the matched video which is matched with the target text semantically can be obtained. For example, if the target text is "boy dancing," then the matching video searched for is also the boy dancing video.

In some embodiments, if the target text is longer or the target text includes more description information of visual features, such as description information of psychological activities or character features of a person, key information may be extracted from the target text, and matching videos that are semantically matched with the target text may be retrieved from the video database based on the key information. Wherein the key information relates to visual features in the target text, including, but not limited to, the appearance of the target object described by the target text, the action, and/or the object for which the action of the target object is directed, etc. Typically, the key information includes at least part of the subject, predicate, and object of the target text. For example, the target text is as follows: after the mental fight, the girl who is small and long in the gall is finally determined to be a resolution, the competition registration list is given to a teacher, and in the target text, the descriptive information such as the small and long in the gall, the resolution is difficult to visually display, but the content occupies a considerable space in the target text, so that the characteristic matching and the keyword matching of the target text can be influenced. In order to solve the above problem, the following key information may be extracted from the target text: "girls give the table of entry of the game to teacher". Wherein, the 'girl' is a target object described by the target text, the 'actions given to' are targets, and the 'match entry form' and the 'teacher' are targets for the actions of the target object.

After the key information is extracted, feature matching can be performed based on semantic features corresponding to the key information and video features corresponding to candidate videos in the video database, or keyword matching can be performed based on key information and key frame labels corresponding to the candidate videos in the video database, so that matching videos corresponding to the target text can be determined.

In the case that the target text includes a plurality of text blocks, each text block may be determined as a piece of target text, and a matching video may be determined for each text block in any of the above manners, so as to obtain one or more matching videos corresponding to each text block.

In step S3 and step S4, the target text and the matching video may be input together into the target video generation model, and the target video semantically matched with the target text may be output through the target video generation model. In the case where the target text includes a plurality of text blocks, each text block and the matching video semantically matching each text block may be input into the target video generation model, so that the target video semantically matching each text block is output through the target video generation model. For example, a matching video in which the i-th text block is semantically matched with the i-th text block may be input into the target video generation model, so that the target video semantically matched with the i-th text block is output through the target video generation model. Assuming that X (X is a positive integer) text blocks are included in the target text, an X-segment target video may be finally generated.

Further, the i+1th text block, the matching video of the i+1th text block semantic matching and the video feature corresponding to the target video corresponding to the i text block can be input into the target video generation model, so that the target video which is matched with the i+1th text block semantic matching is output through the target video generation model. In this way, the target video generation module can refer to the video characteristics of the target video corresponding to the last text block when generating the target video corresponding to each text block, so that the characteristic consistency between the target videos corresponding to different text blocks is improved.

After the target videos corresponding to the text blocks are obtained, the target videos corresponding to the text blocks can be fused, and the fused videos are obtained. For example, the target videos corresponding to the text blocks may be spliced to obtain a fusion video. By the method, the embodiment of the disclosure can generate the fusion video with longer duration (hereinafter referred to as long video). Wherein, long video may refer to video with a duration greater than 10 s. In order to further improve the consistency of each target video in the fused video, smoothing processing can be performed on each target video in the fused video. For example, the optical flow estimation and Depth map estimation algorithm (Depth-Aware Video Frame Interpolation, DAIN), the Real-time intermediate optical flow estimation algorithm (Real-Time Intermediate Flow Estimation, life) and other frame interpolation modes are adopted to carry out smoothing processing, so that the head-tail transition of adjacent target videos in the fusion video is more natural.

In the related art, if a long video (for example, a video having a duration longer than 10 s) needs to be generated, a text video pair including the long video and text needs to be employed to train a target video generation model. This further increases the difficulty of acquiring the sample data and the computational effort requirements. According to the embodiment of the disclosure, the target texts are segmented, the target videos with shorter duration are respectively generated for each text block, and then the generated target videos are fused to obtain the long videos with longer duration. Therefore, a text video pair comprising a long video and a text is not required to train a target video generation model, and the acquisition difficulty of sample data and the calculation force requirement during training are reduced.

The target video generation model used in the above embodiment may be trained in advance. The following illustrates a specific manner of training the target video generation model.

In some embodiments, the image generation model may be pre-trained based on the first sample text and the sample image that semantically matches the first sample text, resulting in a pre-trained model. And then, fine tuning the pre-training model added with the attention module based on the second sample text and the sample matching video semantically matched with the second sample text to obtain a target video generation model. The attention module is used for extracting time sequence characteristics in the sample matching video.

In this embodiment, the image generation model is trained by the text image pair (i.e., the first text sample and the sample image), and since the data amount of the text image pair is large and the quality is generally high, a large number of text image pairs can be acquired to train the image generation model with high quality. Then, an attention module is added in the pre-training model obtained through training, and a text video pair (namely a second sample text and a sample matching video) is adopted to finely adjust the pre-training model added with the attention module, so that on one hand, the data size of the text video pair required by the fine adjustment process is smaller, and the model training cost is reduced; on the other hand, the fine-tuned model can extract time sequence characteristics, so that videos generated by the trained target video generation model have higher quality.

It is to be understood that the above training patterns are merely exemplary illustrations. The embodiment of the disclosure can also directly pretrain the initial video generation model by adopting the second sample text and the sample matching video semantically matched with the second sample text, so as to obtain the target video generation model. The sample matching video may be obtained by adopting the feature matching or keyword matching in the foregoing embodiments, which is not described herein.

In some embodiments, the second sample text may include a plurality of sample text blocks. On the basis, each sample text block and the sample matching video semantically matched with the sample text block can be used as one piece of sample data to train a target video generation model. Further, each piece of sample data for training the target video generation model may include, in addition to the sample text block and the sample matching video semantically matched with the sample text block, a sample video feature corresponding to the sample matching video semantically matched with the previous sample text block in the sample text where the sample text block is located.

For example, assuming that the sample text includes a sample text block a, a sample text block B, and a sample text block C in sequence, three pieces of sample data may be generated, which are respectively:

sample data 1, including a sample text block a and a sample matching video semantically matched with the sample text block a;

sample data 2, including sample text block B, sample matching video semantically matched with sample text block B, and sample video features corresponding to sample matching video semantically matched with sample text block a;

sample data 3, including sample text block C, sample matching video semantically matching sample text block C, and sample video features corresponding to sample matching video semantically matching sample text block B.

The initial video generation model may be pre-trained based on the sample data to obtain a target video generation model. The initial video generation model may be a pre-training model obtained by training based on the text image pair, or may be the untrained initial video generation model.

It should be noted that, in the embodiment where the sample text and the target text are divided into a plurality of text blocks, the initial video generation model has three inputs (respectively, the sample text block, the sample matching video semantically matched with the sample text block, and the sample video feature corresponding to the sample matching video semantically matched with the previous sample text block in the sample text where the sample text block is located, simply referred to as the previous sample video feature), and the trained target video generation model also has three inputs. The previous sample video feature may be input as an initial input into the backbone network of the initial video generation model and the target video generation model, and the backbone network may be, for example, a stable diffusion UNet (Stable Diffusion UNet). Alternatively, the previous sample video feature may be input as a condition input as a priori information to the initial video generation model and the target video generation model via a cross-attention mechanism (cross-attention). Whereas in embodiments where the sample text and the target text do not divide text blocks, i.e., the sample text is used as one text block to train the initial video generation model, the initial video generation model includes only two inputs (sample text block, sample matching video semantically matching the sample text block, respectively), the trained target video generation model also has two inputs. Thus, the target video generation model employed is different in different situations. To solve the above problem, in the case where the sample text and the target text do not divide the text block, an initial video generation model and a target video generation model including three inputs, one of which is null, may still be employed. In this way, the versatility of the model can be improved.

In embodiments where the sample text includes a plurality of sample text blocks, a sample output video of the initial video generation model output based on each sample data may be obtained, where each sample data includes one sample text block and a sample matching video semantically matching the sample text block, or each sample data includes one sample text block, a sample matching video semantically matching the sample text block, and a sample video feature corresponding to a sample matching video semantically matching a previous sample text block in the sample text in which the sample text block is located. The initial video generation model can be trained based on the video characteristics corresponding to the sample output video corresponding to each sample text block and preset conditions, and a target video generation model is obtained.

The initial video generation model may be obtained based on the similarity between video features corresponding to the sample output video output by the adjacent text blocks, and the preset condition may be determined based on the similarity. In an ideal case, the similarity between the video features corresponding to the sample output video output by the adjacent text blocks should be greater than the similarity between the video features corresponding to the sample output video output by any one of the adjacent text blocks and the video features corresponding to the other videos.

Based on this, the following preset conditions can be determined: the first similarity between the video features corresponding to the first sample output video and the video features corresponding to the second sample output video is smaller than the second similarity between the video features corresponding to the first sample output video and the video features corresponding to the reference video, and the first similarity is smaller than the third similarity between the video features corresponding to the second sample output video and the video features corresponding to the reference video.

The reference video may be any video in the video database. The video features in this embodiment may be obtained based on the manner of obtaining the video features in any of the foregoing embodiments, which is not described herein. The loss function may be generated based on a preset condition for model training, for example, a triple loss function or a continuous loss function, or the loss function generated based on the preset condition may be weighted with other loss functions, and the weighted loss function may be used for model training. And finally training a target video generation model by adjusting model parameters of the initial video generation model to minimize a loss function used for model training. The model training process of some embodiments is shown in fig. 4, in a training process, an i+1th sample text block, a sample matching video semantically matched with the i+1th sample text block, and sample video features of the sample matching video semantically matched with the i sample text block may be input into the initial video generation model, and a target video corresponding to the i+1th sample text block output by the initial video generation model may be obtained. Then, a loss function is calculated based on the target video corresponding to the (i+1) th sample text block, the target video corresponding to the (i) th sample text block and the reference video in the video database, and the initial video generation model is trained based on the loss function, so that the target video generation model is obtained.

The target video generation model trained in the mode has more similar characteristics among target videos generated for adjacent text blocks, namely, each target video has better characteristic consistency. Thus, by fusing the target videos, a long video (i.e., a fused video) with better feature consistency can be obtained.

As shown in fig. 5, an embodiment of the present disclosure further provides a video generating apparatus, including:

a text acquisition module 11 for acquiring a target text;

a retrieving module 12, configured to retrieve, from a video database, a matching video that semantically matches the target text;

an input module 13 for inputting the target text and the matching video into a target video generation model;

and the video acquisition module 14 is used for acquiring the target video output by the target video generation model.

In some embodiments, the retrieval module 12 is specifically configured to: acquiring semantic features corresponding to the target text; acquiring video features corresponding to a plurality of candidate videos in the video database respectively, wherein the video features are used for representing semantic information of the corresponding candidate videos; and determining matching videos which are matched with the target text semantically from the plurality of candidate videos based on the semantic features corresponding to the target text and the video features corresponding to the plurality of candidate videos respectively.

In some embodiments, the retrieval module 12 is specifically configured to: determining the similarity between the semantic features corresponding to the target text and the video features corresponding to each candidate video in the plurality of candidate videos; and determining candidate videos corresponding to at least one video feature with the similarity of semantic features corresponding to the target text from high to low as the matching videos.

In some embodiments, the retrieval module 12 is specifically configured to: determining at least one frame of key frame from the candidate video; extracting features of each frame of key frame in the at least one frame of key frame to obtain features of each frame of key frame; and acquiring video features corresponding to the candidate video frames based on the features of the key frames of each frame.

In some embodiments, the retrieval module 12 is specifically configured to: obtaining keyword labels respectively corresponding to a plurality of candidate videos in the video database, wherein the keyword labels are used for representing semantic categories to which the corresponding candidate videos belong; determining the matching degree between the target text and the keyword label corresponding to each candidate video in the plurality of candidate videos; and determining candidate videos corresponding to at least one keyword tag with the matching degree of the target text from high to low as the matching videos.

In some embodiments, the target video generation model is trained based on the following modules: the first pre-training module is used for pre-training the image generation model based on a first sample text and a sample image which is matched with the first sample text semantically to obtain a pre-training model; the fine tuning module is used for fine tuning the pre-training model added with the attention module based on a second sample text and a sample matching video semantically matched with the second sample text to obtain the target video generation model; the attention module is used for extracting time sequence characteristics in the sample matching video.

In some embodiments, the target text comprises a plurality of text blocks, the matching video semantically matching the target text comprises matching video semantically matching each text block of the plurality of text blocks, and the target video comprises a target video corresponding to each text block; the apparatus further comprises: and the fusion module is used for fusing the target video corresponding to each text block to obtain a fused video.

In some embodiments, the target video generation model is trained based on the following modules: pre-training an initial video generation model based on at least one piece of sample data to obtain the target video generation model; wherein each piece of sample data in the at least one piece of sample data includes: sample text blocks; sample matching videos semantically matched with the sample text blocks; sample video features corresponding to sample matching videos semantically matched with a previous sample text block in sample text to which the sample text block belongs.

In some embodiments, the target video generation model is trained based on: the input module is used for inputting a plurality of pieces of sample data into the initial video generation model; each sample data in the plurality of sample data comprises one sample text block in a plurality of sample text blocks included in a sample text and a sample matching video semantically matched with the sample text block; the sample video acquisition module is used for acquiring a sample output video corresponding to each sample text block output by the initial video generation model; the training module is used for training the initial video generation model based on the video characteristics corresponding to the sample output video corresponding to each sample text block and preset conditions to obtain the target video generation model; the preset conditions include: a first similarity between a video feature corresponding to a first sample output video and a video feature corresponding to a second sample output video is less than a second similarity between a video feature corresponding to the first sample output video and a video feature corresponding to a reference video, and the first similarity is less than a third similarity between a video feature corresponding to the second sample output video and a video feature corresponding to the reference video; the first sample output video and the second sample output video are respectively sample output videos corresponding to adjacent sample text blocks, and the reference video is a video obtained from a video database.

In some embodiments, the retrieval module 12 is specifically configured to: extracting key information from the target text, wherein the key information is related to visual characteristics of the target text; and searching matching videos which are matched with the target text semantically from a video database based on the key information.

In some embodiments, the apparatus further comprises: and the smoothing processing module is used for carrying out smoothing processing on adjacent target videos in the fusion video.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer device comprising at least a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of the preceding embodiments when executing the program.

FIG. 6 illustrates a more specific computing device hardware architecture diagram provided by embodiments of the present disclosure, which may include: a processor 21, a memory 22, an input/output interface 23, a communication interface 24 and a bus 25. Wherein the processor 21, the memory 22, the input/output interface 23 and the communication interface 24 are in communication connection with each other inside the device via a bus 25.

The processor 21 may be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided by the embodiments of the present disclosure. The processor 21 may also include a graphics card, which may be an Nvidia titanium X graphics card or a 10120Ti graphics card, or the like.

The Memory 22 may be implemented in the form of Read Only Memory (ROM), random access Memory (Random Access Memory, RAM), static storage devices, dynamic storage devices, and the like. The memory 22 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 22 and invoked by the processor 21 for execution.

The input/output interface 23 is used for connecting with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 24 is used to connect a communication module (not shown) to enable communication interaction of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 25 includes a path for transferring information between components of the device, such as processor 21, memory 22, input/output interface 23, and communication interface 24.

It should be noted that, although the above-described device only shows the processor 21, the memory 22, the input/output interface 23, the communication interface 24, and the bus 25, in the implementation, the device may include other components necessary for achieving normal operation. Furthermore, those skilled in the art will appreciate that the above-described apparatus may include only the components necessary to implement the embodiments of the present disclosure, and not all of the components shown in the figures.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the disclosed embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer apparatus or entity, or by an article of manufacture having some function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

The various embodiments in this disclosure are described in a progressive manner, and identical and similar parts of the various embodiments are all referred to each other, and each embodiment is mainly described as different from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, which should also be considered as the protection scope of the embodiments of this disclosure.

Claims

1. A method of video generation, the method comprising:

acquiring a target text;

retrieving matching videos semantically matched with the target text from a video database;

inputting the target text and the matching video into a target video generation model;

and obtaining the target video output by the target video generation model.

2. The method of claim 1, wherein retrieving matching videos from a video database that semantically match the target text comprises:

acquiring semantic features corresponding to the target text;

acquiring video features corresponding to a plurality of candidate videos in the video database respectively, wherein the video features are used for representing semantic information of the corresponding candidate videos;

and determining matching videos which are matched with the target text semantically from the plurality of candidate videos based on the semantic features corresponding to the target text and the video features corresponding to the plurality of candidate videos respectively.

3. The method of claim 2, wherein the determining, from the plurality of candidate videos, a matching video that semantically matches the target text based on semantic features corresponding to the target text and video features corresponding to the plurality of candidate videos, respectively, comprises:

Determining the similarity between the semantic features corresponding to the target text and the video features corresponding to each candidate video in the plurality of candidate videos;

and determining candidate videos corresponding to at least one video feature with the similarity of semantic features corresponding to the target text from high to low as the matching videos.

4. The method according to claim 2, wherein the obtaining video features respectively corresponding to the plurality of candidate videos in the video database includes:

determining at least one frame of key frame from the candidate video;

extracting features of each frame of key frame in the at least one frame of key frame to obtain features of each frame of key frame;

and acquiring video features corresponding to the candidate video frames based on the features of the key frames of each frame.

5. The method of claim 1, wherein retrieving matching videos from a video database that semantically match the target text comprises:

obtaining keyword labels respectively corresponding to a plurality of candidate videos in the video database, wherein the keyword labels are used for representing semantic categories to which the corresponding candidate videos belong;

determining the matching degree between the target text and the keyword label corresponding to each candidate video in the plurality of candidate videos;

And determining candidate videos corresponding to at least one keyword tag with the matching degree of the target text from high to low as the matching videos.

6. The method of claim 1, wherein the target video generation model is trained based on:

pre-training an image generation model based on a first sample text and a sample image semantically matched with the first sample text to obtain a pre-training model;

fine tuning the pre-training model added with the attention module based on a second sample text and a sample matching video semantically matched with the second sample text to obtain the target video generation model; the attention module is used for extracting time sequence characteristics in the sample matching video.

7. The method of claim 1, wherein the target text comprises a plurality of text blocks, the matching video semantically matching the target text comprises matching video semantically matching each text block of the plurality of text blocks, and the target video comprises a target video corresponding to each text block; the method further comprises the steps of:

and fusing the target video corresponding to each text block to obtain a fused video.

8. The method of claim 7, wherein the target video corresponding to an i+1th text block of the plurality of text blocks is generated by the target video generation model based on the i+1th text block, a matching video that semantically matches the i+1th text block, and a video feature corresponding to the target video corresponding to the i th text block of the plurality of text blocks, i being a positive integer.

9. The method of claim 7, wherein the target video generation model is trained based on:

pre-training an initial video generation model based on at least one piece of sample data to obtain the target video generation model; wherein each piece of sample data in the at least one piece of sample data includes:

sample text blocks;

sample matching videos semantically matched with the sample text blocks;

sample video features corresponding to sample matching videos semantically matched with a previous sample text block in sample text to which the sample text block belongs.

10. The method of claim 7, wherein the target video generation model is trained based on:

inputting a plurality of pieces of sample data into an initial video generation model; each sample data in the plurality of sample data comprises one sample text block in a plurality of sample text blocks included in a sample text and a sample matching video semantically matched with the sample text block;

Acquiring a sample output video corresponding to each sample text block output by the initial video generation model;

training the initial video generation model based on the video characteristics corresponding to the sample output video corresponding to each sample text block and preset conditions to obtain the target video generation model; the preset conditions include:

a first similarity between a video feature corresponding to a first sample output video and a video feature corresponding to a second sample output video is less than a second similarity between a video feature corresponding to the first sample output video and a video feature corresponding to a reference video, and the first similarity is less than a third similarity between a video feature corresponding to the second sample output video and a video feature corresponding to the reference video; the first sample output video and the second sample output video are respectively sample output videos corresponding to adjacent sample text blocks, and the reference video is a video obtained from a video database.

11. The method of claim 10, wherein each sample data of the plurality of sample data further comprises a sample video feature corresponding to a sample matching video of a previous sample text block semantic match of the sample text to which the sample text block belongs.

12. The method of claim 7, wherein the method further comprises:

and smoothing the adjacent target videos in the fusion video.

13. The method of claim 1, wherein retrieving matching videos from a video database that semantically match the target text comprises:

extracting key information from the target text, wherein the key information is related to visual characteristics of the target text;

and searching matching videos which are matched with the target text semantically from a video database based on the key information.

14. A video generating apparatus, the apparatus comprising:

the text acquisition module is used for acquiring a target text;

the retrieval module is used for retrieving matching videos which are semantically matched with the target text from a video database;

the input module is used for inputting the target text and the matched video into a target video generation model;

and the video acquisition module is used for acquiring the target video output by the target video generation model.

15. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any one of claims 1 to 13.

16. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 13 when executing the program.