CN115186133A

CN115186133A - Video generation method and device, electronic equipment and medium

Info

Publication number: CN115186133A
Application number: CN202210863915.XA
Authority: CN
Inventors: 胡敏
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-10-14

Abstract

The application discloses a video generation method, a video generation device, electronic equipment and a medium, and belongs to the technical field of artificial intelligence. The video generation method comprises the following steps: extracting a behavior descriptor and a visual descriptor in a first text; determining a target video segment matched with the behavior descriptor from N first videos, and determining a target video frame matched with the visual descriptor from the N first videos; generating a target video based on the target video clip and the target video frame; the N first videos are N videos similar to the first text in a video library; n is an integer greater than 1.

Description

Video generation method, device, electronic equipment and medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a video generation method, a video generation device, electronic equipment and a medium.

Background

With the rapid development of the deep neural network, algorithms of video generation related directions are more and more diversified, so that it is possible to directly generate videos corresponding to the meanings of the videos according to the text expressions.

In the related art, in the process of generating a video based on a text, the text is usually input into a network model, and then a text modality of the network model is used to extract text features corresponding to the text, and then the video modality of the network model directly generates the video based on the text features.

However, in the above scheme, the video is generated by the network model directly based on the extracted text features, and therefore, when the text modality and the video modality of the network model have poor information, the feature extraction may not be accurately performed, so that the generated video is not matched with the text description.

Disclosure of Invention

An embodiment of the present application provides a video generation method, an apparatus, an electronic device, and a medium, which can solve a problem that a generated video is not matched with a text description.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a video generation method, where the method includes: extracting a behavior descriptor and a visual descriptor in a first text; determining a target video segment matched with the behavior descriptor from N first videos, and determining a target video frame matched with the visual descriptor from the N first videos; generating a target video based on the target video clip and the target video frame; the N first videos are N videos similar to the first text in a video library; n is an integer greater than 1.

In a second aspect, an embodiment of the present application provides a video generating apparatus, including: the device comprises an extraction module, a determination module and a generation module; the extraction module is used for extracting the behavior descriptors and the visual descriptors in the first text; a determining module, configured to determine a target video segment matching the behavior descriptor from N first videos, and determine a target video frame matching the visual descriptor from the N first videos; the generating module is used for generating a target video based on the target video clip and the target video frame determined by the determining module; the N first videos are N videos similar to the first text in a video library; n is an integer greater than 1.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, stored on a storage medium, for execution by at least one processor to implement the method according to the first aspect.

In this embodiment of the application, when generating a video based on a text, the electronic device may extract a behavior descriptor and a visual descriptor in a first text, then determine a target video segment matching the behavior descriptor and a target video frame matching the visual descriptor from N first videos similar to the first text in a video library, and finally generate a target video based on the target video segment and the target video frame; wherein N is an integer greater than 1. Therefore, the video clip matched with the main body behavior is searched from the first video by using the behavior descriptor for describing the main body behavior in the text, and the video frame matched with the visual presentation image is searched from the first video by using the visual descriptor for describing the visual presentation image in the text, so that after the matched video clip and the video frame are fused, a more critical and real target video can be obtained, and the video quality of the finally generated video is ensured.

Drawings

Fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a multi-modal feature extraction model provided in an embodiment of the present application;

FIG. 3 is a flow chart of a multi-modal feature extraction model according to an embodiment of the present disclosure;

FIG. 4 is a second flowchart of a multi-modal feature extraction model according to an embodiment of the present disclosure;

FIG. 5 is a third flowchart of a multi-modal feature extraction model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a hardware schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in sequences other than those illustrated or described herein, and that the terms "first," "second," etc. are generally used in a generic sense and do not limit the number of terms, e.g., a first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The video generation method, the video generation device, the electronic device, and the video generation medium provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

In the related art, in the process of generating a video based on a text, the electronic device uses two schemes that are roughly divided into:

one scheme is that a multi-mode learning generation method is adopted, a text modal model which is trained completely independently is used for extracting text features from a text input by a user, and then the text features are input into a generation type confrontation network to generate a video. The feature dimensions of the text modal model and the video modal model applied by the method may be different, so that the finally extracted features have poor information and cannot be well matched, or when the effective information of one modal model is less, the model cannot extract useful features, so that the generated video is not matched with the text description.

In another scheme, a complete video is generally divided into multiple frames of images, then when each frame of image is processed, a fixed configuration is selected from a text parameter pool and loaded to a specific part of the image, a single frame of image can be configured by combining multiple parts, and then the processed multiple frames of images are spliced to generate a customized video. However, this generation method may result in the generated video lacking continuity and reality, and spatial information is lost.

In the video generation method, the apparatus, the electronic device, and the medium provided in the embodiments of the present application, when generating a video based on a text, the electronic device may extract a behavior descriptor and a visual descriptor in a first text, determine a target video segment matching the behavior descriptor and a target video frame matching the visual descriptor from N first videos similar to the first text in a video library, and generate a target video based on the target video segment and the target video frame; wherein N is an integer greater than 1. Therefore, the video clip matched with the main body behavior is searched from the first video by using the behavior descriptor for describing the main body behavior in the text, and the video frame matched with the visual presentation image is searched from the first video by using the visual descriptor for describing the visual presentation image in the text, so that after the matched video clip and the video frame are fused, a more tangential and real target video can be obtained, and the video quality of the finally generated video is ensured.

The execution subject of the video generation method provided in this embodiment may be a video generation apparatus, and the video generation apparatus may be an electronic device, or may also be a control module or a processing module in the electronic device. The technical solutions provided in the embodiments of the present application are described by taking electronic devices as examples.

An embodiment of the present application provides a video generation method, as shown in fig. 1, the video generation method may include the following steps 201 to 203:

step 201: the electronic device extracts the behavioral descriptors and the visual descriptors in the first text.

In the embodiment of the present application, the behavior descriptor may be a word used to describe the behavior of the subject in the first text, such as running, jumping, flying, swimming, and the like.

In the embodiment of the present application, the visual descriptor may be a word used to describe a visual presentation in the first text, for example, a person wearing red clothes, a white puppy, a blue-sky sea, or the like.

In the embodiment of the application, after the electronic device obtains the first text, the electronic device may perform word segmentation on the text, perform part-of-speech decomposition on the word segmentation, and extract the visual descriptor and the behavior descriptor respectively.

In an embodiment of the present application, the first text may include at least one behavior descriptor.

In an embodiment of the present application, the first text may include at least one visual descriptor.

Step 202: the electronic device determines a target video segment matching the behavior descriptor from the N first videos, and determines a target video frame matching the visual descriptor from the N first videos.

Wherein N is an integer greater than 1.

In this embodiment, the N first videos are N videos similar to the first text in the video library.

In an embodiment of the present application, the video library is pre-stored with a plurality of videos, each of which is composed of at least one video segment and/or at least two video frames.

In this embodiment of the application, when there are multiple video segments matching the behavior descriptor, the target video segment may be a video segment with the highest similarity to the behavior descriptor.

In this embodiment of the application, in a case that the first text contains a plurality of behavior descriptors, the electronic device may search a video clip corresponding to each behavior descriptor from the first video. Illustratively, the target video segment may include: and each behavior descriptor in the plurality of behavior descriptors corresponds to a video segment. Alternatively, the target video segment may include: and the partial behavior descriptors correspond to video clips.

In this embodiment of the application, in a case that the first text contains a plurality of visual descriptors, the electronic device may search for a video frame corresponding to each visual descriptor from the first video. Illustratively, the target video clip may include: each visual descriptor in the plurality of visual descriptors corresponds to a respective video frame. Alternatively, the target video frame may include: the partial visual descriptor corresponds to a video frame.

Step 203: the electronic device generates a target video based on the target video segment and the target video frame.

In the embodiment of the application, after the target video segment and the target video frame are determined, the electronic device can fuse the target video segment and the target video frame, so as to generate the target video.

Optionally, in this embodiment of the present application, the target video segment includes a plurality of video segments. The step 203 "the electronic device generates the target video based on the target video segment and the target video frame" may include the steps 203a and 203b:

step 203a: the electronic equipment sequences the plurality of video clips according to the word order of the behavior descriptors in the first text.

In this embodiment, the electronic device may sort the determined video segments according to the original word order of the behavior descriptor in the first text, so that the determined video segments have continuity and reality in time sequence.

Step 203b: and the electronic equipment fuses the target video frame and the sequenced video clips according to the word order of the visual descriptor in the first text to generate a target video.

In this embodiment of the application, the electronic device may input the target video frame into the general detection model to obtain the main body information and the background information, and then fuse the main body information and the target video clip by using a 3D rendering technique according to a word order of the visual descriptor in the first text, so that the main body in the target video frame moves, and then splices with the background information to generate the target video.

For example, the general detection model may be: a detection model performing multi-label classification of objects detected in images (yolov 3) trains a converged model on a computer vision system recognition item (imagenet) dataset, which is used to extract subject feature information in text.

Illustratively, the electronic device finds a plurality of first videos in the video library that are most similar to the text. Then, searching a target video segment matched with the behavior descriptors in the first text from the plurality of first videos, extracting and selecting motion information corresponding to the target video segment, and sequentially splicing the motion information of the selected video segment according to the sequence of the behavior descriptors in the first text. And simultaneously, finding a frame most similar to the visual descriptor in the selected videos, then obtaining a main body and a background by using a universal detection model, and finally, merging the main body and the motion information with the background by using a 3D rendering technology and then splicing to generate the videos.

Therefore, the electronic equipment fuses the target video clip and the target video frame according to the word sequence in the first text, so that the continuity of the target video clip and the target video frame is ensured, and the accuracy of the spatial information of the video is ensured under the condition that the finally generated video can be matched with the text content of the first text.

In the video generation method provided by the embodiment of the application, when a video is generated based on a text, the electronic device may extract a behavior descriptor and a visual descriptor in a first text, determine a target video segment matching the behavior descriptor and a target video frame matching the visual descriptor from N first videos similar to the first text in a video library, and generate a target video based on the target video segment and the target video frame; wherein N is an integer greater than 1. Therefore, the video clip matched with the main body behavior is searched from the first video by using the behavior descriptor for describing the main body behavior in the text, and the video frame matched with the visual presentation picture is searched from the first video by using the visual descriptor for describing the visual presentation picture in the text, so that after the matched video clip and the video frame are fused, a more tangential and real target video can be obtained, and the video quality of the finally generated video is ensured.

Optionally, in this embodiment of the application, the step 201 of "the electronic device extracts the behavior descriptor and the visual descriptor in the first text" may include steps 201a and 201b:

step 201a: after the electronic equipment inputs the first text into the named entity recognition model, word segmentation is carried out on the first text to obtain a plurality of word segments.

In an embodiment of the present application, the named entity recognition model may be: named Entity Recognition (NER). Further, the NER refers to a model that can recognize and extract entity words in the text.

Step 201b: the electronic equipment identifies the part of speech of each participle in the multiple participles based on the named entity identification model, and determines the behavior descriptor and the visual descriptor in the first text.

In this embodiment, the electronic device may use the NER model to segment the first text into a plurality of participles, and then extract the visual descriptors and the behavior descriptors in the plurality of participles respectively.

Therefore, the electronic equipment inputs the first text into the entity recognition model, performs word segmentation on the first text, and determines the behavior descriptor and the visual descriptor in the multiple word segments, so that the electronic equipment can more accurately acquire the feature information contained in the first text.

Optionally, in this embodiment of the present application, the video library includes video feature information of each of a plurality of videos and a plurality of videos.

The following describes a process for extracting video feature information of each video in this embodiment:

optionally, in this embodiment, before the step 202, the video generation method provided in this embodiment may further include the following steps 301 to 302:

step 301: the electronic equipment inputs a plurality of videos in a video library into the multi-mode feature extraction model for feature extraction, and outputs video feature information of each video in the plurality of videos.

In this embodiment of the present application, the video library may further include a video auxiliary information list, and further, the auxiliary information list includes a video name of each video and a type of each video main body.

Step 302: the electronic equipment stores the video characteristic information of each video into a video library.

In the embodiment of the application, the electronic device can extract video feature information of all videos in the video library by using a multi-modal feature extraction model, then load all the video feature information into a retrieval engine, and record a video name and a type to which a video main body belongs to an attached information list.

Therefore, the video feature information of each video in the video library is extracted by directly utilizing the multi-mode feature extraction model, so that the subsequent electronic equipment can directly match the text features of the text to be used with the video feature information in the video library, and the matching efficiency is improved.

Optionally, in this embodiment of the application, after extracting the visual descriptor of the first text, the electronic device may extract a main body in the visual descriptor, then query the category mapping table to determine a main body type of the main body in the visual descriptor in the first text, and finally screen out, from the video library, a video corresponding to a video type matching the main body type based on the main body type, thereby forming a new video feature search library.

Illustratively, the electronic device can extract subjects in the visual descriptors using a named entity recognition model (e.g., a NER pre-training model).

For example, the category mapping table may be shown in table 1 below. It should be noted that only a part of the subject types and a part of the detailed categories corresponding to the subject types are shown in table 1. In practical applications, the category mapping table may include more main types and corresponding detailed categories, which are not described herein again.

TABLE 1

Categories	Detailed classes
		Human being	Man, woman, old man and child
Quadruped animal	Dog, cat, bear, rabbit, \\ 8230
		Avian birds	Chicken, duck, goose, parrot, \ 8230
Plant and method for producing the same	Flower, tree, \ 8230
		Aircraft with a flight control device	Passenger planes, cargo planes, aircraft, unmanned planes, \ 8230
Vehicle with a steering wheel	Car, truck, bus, \8230
		Commodity	Toy, living goods, \8230

Therefore, the range of the electronic equipment for searching the video is narrowed, and the searching efficiency of the electronic equipment is improved.

How to implement the technical solution provided by the present application based on the video library will be described below:

optionally, in this embodiment of the present application, before the step 202, the video generation method provided in this embodiment of the present application may further include the following steps A1 to A3:

step A1: the electronic equipment inputs the first text into the multi-modal feature extraction model and then converts the first text into at least one text feature information.

In an embodiment of the present application, the multi-modal feature extraction model may be: a multi-level attention-alignment model (Muti-alignment-model, MAAM). Further, the MAAM is an end-to-end multimodal video text retrieval model.

In an embodiment of the present application, the at least one text feature information may include a text feature vector. Such as a token for the text.

Step A2: from the at least one text feature information, key text feature information is determined.

In this embodiment, the key text feature information may be: and the text feature vector in the at least one text feature information meets a preset condition.

Step A3: and calling a video library, clustering the key text characteristic information and the video characteristic information of each video to obtain first video characteristic information, and taking the video corresponding to the first video characteristic information as the first video.

In this embodiment, the similarity between the first video feature information and the key text feature information satisfies a first condition.

In an embodiment of the present application, the first video feature information may include a video feature vector. Such as token for the video.

It should be noted that token is used to represent a sequence feature vector for converting features into fixed dimensions, and each atomic feature in a sequence is a token.

In the embodiment of the application, after the electronic device extracts the first text feature vector and acquires the video feature vector from the video library, the vector value is calculated to obtain the corresponding score of each video, so that the similarity between the first text and each video is obtained.

In an embodiment of the present application, the first condition may be: and the video feature vector with the highest similarity value with the key text feature vector.

Therefore, the electronic equipment calculates the similarity between the extracted key text characteristic information and the video characteristic information in the video library, and determines the video with the highest similarity value as the first video, so that the accuracy of determining the first video by the electronic equipment is improved.

For example, the multi-modal feature extraction model is taken as an MAAM. The extraction process of the multi-modal feature extraction model is exemplified.

Illustratively, as shown in fig. 2, the MAAM model is composed of a visual module and a text module. The visual Module comprises a Vision-transformer model (a visual model in the image-Text pre-training model) and a Guide-Study-Module (GSM), and the Text Module comprises a Text-transformer model (a Text model in the image-Text pre-training model) and a GSM. The GSM module consists of an Attention (Attention) module and a Cluster Alignment (Cluster Alignment) module, wherein the Cluster Alignment module is a shared module.

It should be noted that the above-mentioned Attention module is used to guide the model to focus on only the module with distinguishable features.

Exemplarily, the GSM is intended to obtain feature information with respective discrimination in each modality, and uniformly map the extracted significant feature vectors to a new feature space, so that each modality feature vector is in the same semantic dimension in the new space, and a semantic information difference between modalities is eliminated.

Illustratively, as shown in fig. 3, the Attention module in GSM is used to extract key information, and reduces the number of output tokens of the penultimate layers of Vision-transformer and Text-transformer from n to k, where k is the number of self-Attention heads in a novel model structure (transformer). L is the number of layers of the transformer structure, a _l Is the first layer attention weight of the transformer model, wherein the value range of l is from the first layer to the second last layer of the transformer structure, the last layer does not participate in the attention weight calculation, a _l The structure is as shown in (formula 1),

wherein,

is a characteristic structure of a certain attention head at the l-th layer, each attention head comprises N tokens, and the structure is shown as (formula 2):

in the Attention module, all layer Attention weights in (1, L-1) are subjected to matrix multiplication, as shown in (equation 3):

a _final ＝∏a _l (formula 3)

a _final And a penultimate layer a of a transform structure _L-1 Performing matrix multiplication to obtain a _select The attention vector is shown as (equation 4):

a _select ＝a _final *a _L-1 (formula 4)

To a _L-1 Each of the k components of the layer takes a maximum value,

the structure is shown as (formula 5):

for convenience of display, the max component value selected by the kth attention head is recorded as a value

Note that the k selected attention tokens and the classification tokens are spliced to form a new sequence to replace the original token sequence of the L-1 layer and input into the L-th layer of the transformer, and the new sequence is in the form shown in (formula 6):

illustratively, as shown in fig. 4, the cluster alignment module is configured to: clustering k tokens output by the text Attention module and k tokens output by the visual Attention module to obtain p sharing centers { c ₁ ,c ₂ ,…,c _p }. Each token is re-represented by p sharing centers, and the tokens of the text model and the tokens of the visual model are expressed by using the same group of vector bases, so that the semantic difference between the two modalities is further weakened.

Wherein, each token in the text model and the visual model and each component of the sharing center calculate a dot product, the product is converted into a confidence degree through a softmax function, and the confidence degree represents the importance degree of the contribution of the component of the current sharing center to the designated token to represent the data distribution of the new feature space formed by the token in the sharing center, as shown in (formula 7):

wherein,

represents the ith token, c _j Denotes the jth shared center, w _ij Indicating how important the jth shared center contributes to the ith token representation. The final representation of each token in the shared center feature is shown in (formula 8), where τ represents the total number of feature tokens of the text modality and the video modality, and there are k attention tokens output by the text modality and the visual modality, that is, τ =2k.

Exemplarily, as shown in fig. 3, the process of extracting the feature vector by the MAAM model includes the following steps S1 to S4:

step S1: obtaining m tokens by passing the Text through a Text-transformer model, wherein the classification token (cls-token) is marked as cls _g . And then the m tokens pass through an Attention module in the GSM, and the token number is reduced to k +1 tokens.

Step S2: uniformly sampling 16 frames (Frame) of a video, obtaining n tokens from each Frame through a Vision-transformer model, splicing the corresponding cls-tokens and newly-added cls-tokens of each Frame into a new token sequence feature vector, wherein the newly-added cls-tokens are recorded as cls _a . And inputting the new token sequence feature vector into an Attention module in the GSM to extract key information, and reducing the number of tokens to k + 1.

And step S3: and respectively outputting k attribute tokens for removing the cls-tokens in the text modality and the video modality, and obtaining the p-dimensional space representation through a cluster alignment module.

And step S4: the overall training loss is shown in (formula 9), and through multi-loss joint optimization, the model can learn the most representative attention feature vector of the video modality and the text modality.

L＝L _g +L _a +L _c (formula 9)

L _g Is through a Global alignment module (Global alignment)) The global alignment loss is calculated as shown in (equation 10):

to train the number of all video text pairs in the set,

for the ith video feature vector, the motion vector is calculated,

for the ith text feature vector, the feature vector,

expressing the cosine similarity between the video feature vector and the text feature vector in a normalized mode

And normalized

The dot product is calculated as shown in (equation 11):

L _a for Attention loss (Attention alignment loss) as shown in (equation 12), sim (z) _i ,z _j ) Representing cosine similarity between attribute feature vectors corresponding to the samples i and j;

L _c as shown in (equation 13) for the cluster loss,

for the ith clustered video feature vector,

and the feature vector of the text after the ith clustering.

Optionally, in this embodiment of the application, the step 202 "the electronic device determines, from the N first videos, the target video segment matching the behavior descriptor" may include steps B1 to B3:

step B1: after the behavior descriptor and the N first videos are input into the multi-mode feature extraction model by the electronic equipment, the behavior descriptor is converted into at least one behavior feature information, and the N first videos are converted into at least one video feature information.

In an embodiment of the present application, the at least one behavior feature information may include a behavior feature vector. Such as tokens for the actions.

And step B2: from the at least one behavior feature information, key behavior feature information is determined, and from the at least one video feature information, first key video feature information is determined.

In the embodiment of the present application, the key behavior feature information refers to behavior feature information that plays a decisive role in the at least one behavior feature information.

And step B3: and according to the key behavior characteristic information, determining second video characteristic information from the first key video characteristic information, and taking a video clip corresponding to the second video characteristic information as a target video clip.

In an embodiment of the present application, a similarity between the second video feature information and the key behavior feature information satisfies a second condition.

In the embodiment of the application, after the electronic device extracts the first key video feature vectors and the key behavior feature vectors corresponding to the behavior descriptors, the vector values are calculated to obtain the scores corresponding to the video segments in each first video, so that the similarity between the behavior descriptors and each video segment is obtained.

In an embodiment of the present application, the second condition may be: and the first key video feature vector with the highest similarity value with the key behavior feature vector.

In an embodiment of the present application, the video segment is a video segment in the first video. One first video corresponds to at least one video clip.

Example 1, taking the first video as 5 videos as an example, after the electronic device sends each behavior descriptor and the 5 videos into the MAAM to extract a behavior feature vector, the electronic device calculates a clustering loss L by using each behavior descriptor and feature vectors of 5 selected videos _c Taking out L _c If the selected video frames are not adjacent, arranging all the video frames in the interval from the first selected video frame to the last selected video frame according to the time sequence to form a video clip. As shown in FIG. 5, the Attention tokens output by the text modality and the video modality after passing through the Attention module are respectively labeled as 31 and 31, and the original video frames at the corresponding positions 32 are taken to form the video action fragment most relevant to the current behavior descriptor. And extracting the position information of key points of a main body by using a key point retrieval model from all video frames in the video clip, calculating the position difference of the key points corresponding to the main body in adjacent video frames, and recording the position difference of n-1 groups of key points in the video clip as motion information.

Therefore, the electronic equipment calculates the similarity between the extracted key behavior characteristic information and the first key video characteristic information in the first video, and determines the video clip with the highest similarity value as the target video clip, so that the accuracy of determining the video clip by the electronic equipment is further improved.

Optionally, in this embodiment of the application, the step 202 "the electronic device determines, from the N first videos, the target video frame matching the visual descriptor" may include steps C1 to C3:

step C1: after the visual descriptor and the N first videos are input into the multi-mode feature extraction model by the electronic equipment, the visual descriptor is converted into at least one piece of visual feature information, and the N first videos are converted into at least one piece of video feature information.

In an embodiment of the present application, the at least one piece of visual feature information may include a visual feature vector. Such as a visually corresponding token.

And C2: key visual characteristic information is determined from the at least one visual characteristic information, and second key video characteristic information is determined from the at least one video characteristic information.

In the embodiment of the present application, the key visual characteristic information refers to visual characteristic information that plays a decisive role in the at least one piece of visual characteristic information.

And C3: and determining third video characteristic information from the second key video characteristic information according to the key visual characteristic information, and taking a video frame corresponding to the third video characteristic information as a target video frame.

In an embodiment of the present application, a similarity between the third video feature information and the key visual feature information satisfies a third condition.

In the embodiment of the application, after extracting the second key video feature vector and extracting the key visual feature vector corresponding to the visual descriptor, the electronic device obtains a score corresponding to a video frame in each first video by calculating the vector value, so as to obtain the similarity between the visual descriptor and each video frame.

In an embodiment of the present application, the third condition may be: and the second key video feature vector with the highest similarity value with the key visual feature vector.

In an embodiment of the present application, the video frame is a video frame in the first video. One first video corresponds to at least two video frames.

Example 2, with reference to example 1, after the electronic device sends each visual descriptor and the 5 videos into the MAAM to extract visual feature vectors, cosine similarity is calculated between each visual descriptor and all the Attention tokens (5 × k in total) in the 5 selected videos, and a video frame corresponding to the Attention token with the largest similarity is taken as a target video frame, that is, the target video frame is a visual image most conforming to the visual descriptor.

Therefore, the electronic equipment calculates the similarity between the extracted key visual feature information and the second key video feature information in the first video, and determines the video frame with the highest similarity value as the target video frame, so that the accuracy of the electronic equipment in determining the video frame is further improved.

In the related art, the multi-modal feature extraction model training scheme is generally divided into synchronous training and asynchronous training. Wherein:

asynchronous training is single-mode independent training, after the characteristics of a model which is completely trained by each mode are extracted, similarity matching is directly carried out, the visual mode model and the text mode model which are obtained by the training mode have different characteristic dimensions, the characteristics learned by each mode model have poor information, and a good matching method is not provided. Alternatively, when there is less information available in one of the modalities, the modeler may not have the available information.

Synchronous training is that multiple modalities are trained together, the information concerned by each modality model is influenced by another modality model, and each modality model tends to learn more characteristics which can be more related to the characteristics of other modality models. But the collaborative training is not enough, a large amount of redundant information is contained in both a video mode and a text mode, each mode model is difficult to learn key information, and the model cannot capture the text and key contents to be expressed in the video.

In the embodiment of the present application, the multi-modal feature extraction model (such as MAAM) is designed as follows:

firstly, a text modal model in the MAAM uses an existing pre-trained text-transformer model structure as a basic network structure, a video modal model in the MAAM is used for sampling 16 frames of video frames at equal intervals, and then, an existing pre-trained vision-transformer model structure is used as the basic network structure for each frame of video frame. And in order to remove the interference of redundant information, GSM is introduced behind two basic network structures to guide each modal model to pay more attention to the related information between the modals. It should be noted that the GSM module consists of an Attention module and a Cluster Alignment module. Specifically, after outputting the original penultimate second layer of each mode in the GSM module, accessing an Attention module, selecting k tokens with the highest discrimination from respective modes, splicing the k tokens with the classification tokens into k +1 token sequences, replacing the penultimate second layer for outputting, and then obtaining k +1 Attention feature tokens through the last layer of transformer.

Secondly, in order to better align the characteristics of the text modal model and the video modal model, k attention characteristic tokens except the classification tokens can be sent to a Cluster Alignment module, and the characteristics of the video modal model and the text modal model are represented by using the same basis vector, so that the information difference between the video modal model and the text modal model in a new characteristic space can be eliminated. Meanwhile, in order to improve the consistency of the two modal characteristics, attention alignment loss (Attention alignment loss) and Cluster alignment loss (Cluster alignment loss) are introduced, so that the consistency of the video modal model and the text modal model from the local part to the global part is maximized.

Therefore, compared with the multi-mode feature extraction model training scheme provided in the related art, the training scheme provided by the application enables the trained multi-mode feature extraction model to more accurately capture the key feature information contained in the text or the video when extracting the feature information by additionally arranging the GSM module in the multi-mode feature extraction model.

According to the video generation method provided by the embodiment of the application, the execution main body can be a video generation device. The video generation apparatus provided in the embodiment of the present application will be described with reference to an example in which a video generation apparatus executes a video generation method.

An embodiment of the present application provides a video generating apparatus, as shown in fig. 6, the video generating apparatus 400 includes: an extraction module 401, a determination module 402 and a generation module 403, wherein: the extracting module 401 is configured to extract a behavior descriptor and a visual descriptor in the first text; the determining module 402 is configured to determine a target video segment matching the behavior descriptor from N first videos, and determine a target video frame matching the visual descriptor from the N first videos; the generating module 403 is configured to generate a target video based on the target video segment and the target video frame determined by the determining module 402; the N first videos are N videos similar to the first text in a video library; n is an integer greater than 1.

Optionally, in this embodiment of the application, the extracting module 401 is specifically configured to, after the first text is input into the named entity recognition model, perform word segmentation on the first text to obtain a plurality of words; and performing part-of-speech recognition on each word segmentation in the plurality of word segmentations based on the named entity recognition model to determine the behavior descriptors and the visual descriptors in the first text.

Optionally, in this embodiment of the present application, the video library includes a plurality of videos and video feature information of each of the plurality of videos; the determining module 402 is further configured to, before determining a target video segment matching the behavior descriptor from the N first videos and determining a target video frame matching the visual descriptor from the N first videos, input the first text into a multi-modal feature extraction model, and then convert the first text into at least one text feature information; determining key text characteristic information from the at least one text characteristic information; calling the video library, clustering the key text characteristic information and the video characteristic information of each video to obtain first video characteristic information, and taking the video corresponding to the first video characteristic information as the first video; and the similarity between the first video characteristic information and the key text characteristic information meets a first condition.

Optionally, in this embodiment of the application, the determining module 402 is specifically configured to convert the behavior descriptor and the N first videos into at least one behavior feature information after the behavior descriptor and the N first videos are input into a multi-modal feature extraction model, and convert the N first videos into at least one video feature information; determining key behavior characteristic information from the at least one behavior characteristic information, and determining first key video characteristic information from the at least one video characteristic information; according to the key behavior feature information, determining second video feature information from the first key video feature information, and taking a video clip corresponding to the second video feature information as the target video clip; and the similarity between the second video characteristic information and the key behavior characteristic information meets a second condition.

Optionally, in an embodiment of the present application, the determining module 402 is specifically configured to convert the visual descriptor and the N first videos into at least one piece of visual feature information after the visual descriptor and the N first videos are input into a multi-modal feature extraction model, and convert the N first videos into at least one piece of video feature information; determining key visual characteristic information from the at least one piece of visual characteristic information, and determining second key video characteristic information from the at least one piece of video characteristic information; according to the key visual characteristic information, determining third video characteristic information from the second key video characteristic information, and taking a video frame corresponding to the third video characteristic information as the target video frame; and the similarity between the third video characteristic information and the key visual characteristic information meets a third condition.

Optionally, in this embodiment of the application, the extracting module 401 is further configured to determine, before the determining module 402 determines a target video segment matching the behavior descriptor from N first videos and a target video frame matching the visual descriptor from the N first videos; inputting the videos in the video library into a multi-modal feature extraction model for feature extraction, and outputting video feature information of each video in the videos; and storing the video characteristic information into the video library.

Optionally, in this embodiment of the present application, the target video segment includes a plurality of video segments; the generating module 403 is specifically configured to sort the plurality of video segments according to a word order of the behavior descriptor in the first text; and according to the language sequence of the visual descriptors in the first text, fusing the target video frame and the sequenced video fragments to generate a target video.

In the video generating apparatus provided in the embodiment of the present application, when a video is generated based on a text, the video generating apparatus may first extract a behavior descriptor and a visual descriptor in a first text, then determine a target video segment matching the behavior descriptor and a target video frame matching the visual descriptor from N first videos similar to the first text in a video library, and finally generate a target video based on the target video segment and the target video frame; wherein N is an integer greater than 1. Therefore, the video clip matched with the main body behavior is searched from the first video by using the behavior descriptor for describing the main body behavior in the text, and the video frame matched with the visual presentation image is searched from the first video by using the visual descriptor for describing the visual presentation image in the text, so that after the matched video clip and the video frame are fused, a more tangential and real target video can be obtained, and the video quality of the finally generated video is ensured.

The video generating apparatus in the embodiment of the present application may be an electronic device, and may also be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (Network Attached Storage, NAS), a personal computer (NAS), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The video generation apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The video generation apparatus provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 5, and is not described here again to avoid repetition.

Optionally, as shown in fig. 7, an electronic device 600 is further provided in an embodiment of the present application, and includes a processor 601 and a memory 602, where the memory 602 stores a program or an instruction that can be executed on the processor 601, and when the program or the instruction is executed by the processor 601, the steps of the embodiment of the video generation method are implemented, and the same technical effects can be achieved, and are not described again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, and processor 110.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power supply (e.g., a battery) for supplying power to various components, and the power supply may be logically connected to the processor 110 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation to the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The processor 110 is configured to extract a behavior descriptor and a visual descriptor in the first text; determining a target video clip matched with the behavior descriptor from N first videos, and determining a target video frame matched with the visual descriptor from the N first videos; generating a target video based on the target video clip and the target video frame; the N first videos are N videos similar to the first text in the video library; n is an integer greater than 1.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to, after the first text is input into the named entity recognition model, perform word segmentation on the first text to obtain a plurality of word segments; and performing part-of-speech recognition on each word in the multiple words based on the named entity recognition model, and determining the behavior descriptor and the visual descriptor in the first text.

Optionally, in this embodiment of the present application, the video library includes a plurality of videos and video feature information of each of the plurality of videos; the processor 110 is further configured to convert the first text into at least one text feature information after inputting the first text into a multi-modal feature extraction model before determining a target video segment matching the behavior descriptor from the N first videos and determining a target video frame matching the visual descriptor from the N first videos; determining key text characteristic information from the at least one text characteristic information; calling the video library, clustering the key text characteristic information and the video characteristic information of each video to obtain first video characteristic information, and taking the video corresponding to the first video characteristic information as the first video; and the similarity between the first video characteristic information and the key text characteristic information meets a first condition.

Optionally, in an embodiment of the present application, the processor 110 is specifically configured to convert the behavior descriptor and the N first videos into at least one behavior feature information after the behavior descriptor and the N first videos are input into a multi-modal feature extraction model, and convert the N first videos into at least one video feature information; determining key behavior characteristic information from the at least one behavior characteristic information, and determining first key video characteristic information from the at least one video characteristic information; according to the key behavior feature information, determining second video feature information from the first key video feature information, and taking a video clip corresponding to the second video feature information as the target video clip; and the similarity between the second video characteristic information and the key behavior characteristic information meets a second condition.

Optionally, in an embodiment of the present application, the processor 110 is specifically configured to convert the visual descriptor and the N first videos into at least one piece of visual feature information after the visual descriptor and the N first videos are input into a multi-modal feature extraction model, and convert the N first videos into at least one piece of video feature information; determining key visual characteristic information from the at least one piece of visual characteristic information, and determining second key video characteristic information from the at least one piece of video characteristic information; according to the key visual characteristic information, determining third video characteristic information from the second key video characteristic information, and taking a video frame corresponding to the third video characteristic information as the target video frame; and the similarity between the third video characteristic information and the key visual characteristic information meets a third condition.

Optionally, in this embodiment of the present application, the processor 110 is further configured to determine a target video segment matching the behavior descriptor from the N first videos, and determine a target video frame matching the visual descriptor from the N first videos; inputting the videos in the video library into a multi-modal feature extraction model for feature extraction, and outputting video feature information of each video in the videos; and storing the video characteristic information into the video library.

Optionally, in this embodiment of the present application, the target video segment includes a plurality of video segments; the processor 110 is specifically configured to sort the plurality of video segments according to a word order of the behavior descriptor in the first text; and according to the word order of the visual descriptor in the first text, fusing the target video frame and the sequenced video clips to generate a target video.

In the electronic device provided by the embodiment of the application, when a video is generated based on a text, the electronic device may extract a behavior descriptor and a visual descriptor in a first text, determine a target video segment matching the behavior descriptor and a target video frame matching the visual descriptor from N first videos similar to the first text in a video library, and generate a target video based on the target video segment and the target video frame; wherein N is an integer greater than 1. Therefore, the video clip matched with the main body behavior is searched from the first video by using the behavior descriptor for describing the main body behavior in the text, and the video frame matched with the visual presentation picture is searched from the first video by using the visual descriptor for describing the visual presentation picture in the text, so that after the matched video clip and the video frame are fused, a more tangential and real target video can be obtained, and the video quality of the finally generated video is ensured.

It should be understood that, in the embodiment of the present application, the input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes at least one of a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first storage area storing a program or an instruction and a second storage area storing data, wherein the first storage area may store an operating system, an application program or an instruction (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, memory 109 may include volatile memory or non-volatile memory, or memory 109 may include both volatile and non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). Memory 109 in the embodiments of the subject application includes, but is not limited to, these and any other suitable types of memory.

Processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor, which mainly handles operations related to the operating system, user interface, application programs, etc., and a modem processor, which mainly handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned video generation method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above video generation method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

Embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the foregoing video generation method embodiments, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of video generation, the method comprising:

extracting a behavior descriptor and a visual descriptor in a first text;

determining a target video segment from the N first videos that matches the behavioral descriptor, and determining a target video frame from the N first videos that matches the visual descriptor;

generating a target video based on the target video segment and the target video frame;

the N first videos are similar to the N first texts in a video library;

n is an integer greater than 1.

2. The method of claim 1, wherein extracting the behavior descriptors and the visual descriptors in the first text comprises:

after the first text is input into a named entity recognition model, performing word segmentation on the first text to obtain a plurality of word segments;

and performing part-of-speech recognition on each word segmentation in the plurality of word segmentations based on the named entity recognition model, and determining the behavior descriptors and the visual descriptors in the first text.

3. The method of claim 1, wherein the video library comprises a plurality of videos and video feature information for each of the plurality of videos;

before determining a target video segment from the N first videos that matches the behavior descriptor and determining a target video frame from the N first videos that matches the visual descriptor, the method further comprises:

after the first text is input into a multi-modal feature extraction model, converting the first text into at least one text feature information;

determining key text characteristic information from the at least one text characteristic information;

calling the video library, clustering the key text characteristic information and the video characteristic information of each video to obtain first video characteristic information, and taking the video corresponding to the first video characteristic information as the first video;

and the similarity between the first video characteristic information and the key text characteristic information meets a first condition.

4. The method of claim 1, wherein determining the target video segment matching the behavior descriptor from the N first videos comprises:

after the behavior descriptor and the N first videos are input into a multi-modal feature extraction model, converting the behavior descriptor into at least one behavior feature information, and converting the N first videos into at least one video feature information;

determining key behavior characteristic information from the at least one behavior characteristic information, and determining first key video characteristic information from the at least one video characteristic information;

according to the key behavior characteristic information, determining second video characteristic information from the first key video characteristic information, and taking a video clip corresponding to the second video characteristic information as the target video clip;

and the similarity between the second video characteristic information and the key behavior characteristic information meets a second condition.

5. The method according to claim 1 or 4, wherein the determining a target video frame matching the visual descriptor from the N first videos comprises:

inputting the visual descriptor and the N first videos into a multi-modal feature extraction model, converting the visual descriptor into at least one piece of visual feature information, and converting the N first videos into at least one piece of video feature information;

determining key visual characteristic information from the at least one piece of visual characteristic information, and determining second key video characteristic information from the at least one piece of video characteristic information;

determining third video characteristic information from the second key video characteristic information according to the key visual characteristic information, and taking a video frame corresponding to the third video characteristic information as the target video frame;

and the similarity between the third video characteristic information and the key visual characteristic information meets a third condition.

6. The method of any of claims 3 to 5, wherein prior to determining a target video segment from the N first videos that matches the behavioral descriptor and determining a target video frame from the N first videos that matches the visual descriptor, the method further comprises:

inputting the videos in the video library into a multi-modal feature extraction model for feature extraction, and outputting video feature information of each video in the videos;

and storing the video characteristic information into the video library.

7. The method of claim 1, wherein the target video segment comprises a plurality of video segments;

generating a target video based on the target video segment and the target video frame, including:

sequencing the plurality of video clips according to the word order of the behavior descriptors in the first text;

and according to the word order of the visual descriptor in the first text, fusing the target video frame and the sequenced video clips to generate a target video.

8. A video generation apparatus, characterized in that the apparatus comprises: the device comprises an extraction module, a determination module and a generation module, wherein:

the extraction module is used for extracting the behavior descriptors and the visual descriptors in the first text;

the determining module is used for determining a target video segment matched with the behavior descriptor from N first videos and determining a target video frame matched with the visual descriptor from the N first videos;

the generating module is used for generating a target video based on the target video clip and the target video frame determined by the determining module;

the N first videos are similar to the N first texts in a video library;

n is an integer greater than 1.

9. The apparatus of claim 8,

the extraction module is specifically configured to:

and performing part-of-speech recognition on each participle in the multiple participles based on the named entity recognition model, and determining the behavior descriptor and the visual descriptor in the first text.

10. The apparatus of claim 8, wherein the video library comprises a plurality of videos and video feature information of each of the plurality of videos;

the determining module is further configured to:

converting the first text into at least one text feature information after inputting the first text into a multi-modal feature extraction model before the determining the target video segment matched with the behavior descriptor from the N first videos and the determining the target video frame matched with the visual descriptor from the N first videos;

11. The apparatus of claim 8,

the determining module is specifically configured to:

inputting the behavior descriptor and the N first videos into a multi-modal feature extraction model, converting the behavior descriptor into at least one behavior feature information, and converting the N first videos into at least one video feature information;

12. The apparatus of claim 8 or 11,

the determining module is specifically configured to:

13. The apparatus according to any one of claims 10 to 12,

the extraction module is further configured to:

before the determining module determines a target video segment matched with the behavior descriptor from N first videos and determines a target video frame matched with the visual descriptor from the N first videos, inputting the plurality of videos in the video library into a multi-modal feature extraction model for feature extraction, and outputting video feature information of each video in the plurality of videos;

and storing the video characteristic information into the video library.

14. The apparatus of claim 8, wherein the target video segment comprises a plurality of video segments;

the generation module is specifically configured to:

and according to the word order of the visual descriptor in the first text, fusing the target video frame and the sequenced video fragments to generate a target video.

15. An electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions when executed by the processor implementing the steps of the video generation method of any of claims 1 to 7.

16. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the video generation method of any one of claims 1 to 7.