CN114598926B - Video generation method and device, electronic equipment and storage medium - Google Patents

Video generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114598926B
CN114598926B CN202210068441.XA CN202210068441A CN114598926B CN 114598926 B CN114598926 B CN 114598926B CN 202210068441 A CN202210068441 A CN 202210068441A CN 114598926 B CN114598926 B CN 114598926B
Authority
CN
China
Prior art keywords
video
text
video frame
identification sequence
inferred
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210068441.XA
Other languages
Chinese (zh)
Other versions
CN114598926A (en
Inventor
王卫宁
朱欣鑫
刘静
孙铭真
刘佳伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Zidong Taichu (Beijing) Technology Co.,Ltd.
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210068441.XA priority Critical patent/CN114598926B/en
Publication of CN114598926A publication Critical patent/CN114598926A/en
Application granted granted Critical
Publication of CN114598926B publication Critical patent/CN114598926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440218Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440254Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering signal-to-noise parameters, e.g. requantization
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440263Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the spatial resolution, e.g. for displaying on a connected PDA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440281Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a video generation method, a video generation device, electronic equipment and a storage medium, wherein the video generation method comprises the following steps: preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred; inputting a text identification sequence of a text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred; the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame. The method realizes the generation of the high-quality video with good generalization and high resolution which is matched with the semantics of the text to be inferred.

Description

Video generation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a video generation method and apparatus, an electronic device, and a storage medium.
Background
Text-to-video generation is a research which needs to span multiple fields and relates to information of multiple modes, the task of the research aims to realize the generation of high-resolution video matched with the text semantics based on input text, and the research has great significance on artificial intelligence research.
At present, in the prior art of text-to-video generation, a hybrid framework based on generation of countermeasure Networks (GANs) is mostly adopted in common research methods and frameworks, however, a model based on the GANs has poor comprehension capability of text semantics, resulting in poor generalization performance and low resolution of generated videos.
Therefore, how to better realize the generation of text-to-video has become an urgent problem to be solved in the industry.
Disclosure of Invention
The invention provides a video generation method, a video generation device, electronic equipment and a storage medium, which are used for better realizing generation from a text to a video.
The embodiment of the invention provides a video generation method, which comprises the following steps:
preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred;
inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
According to the video generation method provided by the embodiment of the invention, the step of inputting the text identification sequence of the text to be inferred into the trained neural network video generation model to generate the video corresponding to the text to be inferred comprises the following steps:
step 201, inputting the text identification sequence of the text to be inferred into a trained transform neural network Decoder model, and generating an identification sequence of a second video frame at the current moment corresponding to the text identification sequence of the text to be inferred;
step 202, inputting the identification sequence of the second video frame at the current moment into a decoder of a vector quantization self-encoder to obtain the second video frame at the current moment, and storing the second video frame at the current moment into a preset output queue;
step 203, down-sampling the second video frame at the current moment to obtain a down-sampled video frame, and inputting the down-sampled video frame into an encoder of a vector quantization self-encoder to obtain an identification sequence of the down-sampled video frame;
step 204, inputting the identification sequence of the downsampling video frame and the text identification sequence of the text to be inferred into a trained transform neural network Decoder model to generate an identification sequence of a second video frame at the next moment, and taking the identification sequence of the second video frame at the next moment as the identification sequence of the second video frame at the current moment;
repeating the step 202 to the step 204 until the number of second video frames stored in the preset output queue reaches a preset number of frames, and generating a video corresponding to the text to be reasoned according to each second video frame in the preset output queue;
the trained neural network video generation model comprises the trained transform neural network Decoder model and an encoder and a Decoder of the vector quantization self-encoder.
According to the video generation method provided by the embodiment of the present invention, before the text identification sequence of the text to be inferred is input into the trained neural network video generation model, the method further includes:
preprocessing the text real sample to be inferred to obtain a text identification sequence of the text real sample to be inferred, and preprocessing the video real sample to obtain an identification sequence of the first video frame and an identification sequence of the second video frame;
determining a target identification sequence based on the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame;
and acquiring a plurality of groups of target identification sequences, and training a neural network video generation model by using the plurality of groups of target identification sequences.
According to a video generating method provided by an embodiment of the present invention, the preprocessing the real video sample to obtain the identifier sequence of the first video frame and the identifier sequence of the second video frame includes:
randomly extracting adjacent video frames in the real video sample according to a preset frame rate to obtain a video frame at the previous moment of the second video frame and the second video frame;
down-sampling a video frame at the previous moment of the second video frame to obtain the first video frame;
and inputting the first video frame and the second video frame into a vector quantization self-encoder to obtain an identification sequence of the first video frame and an identification sequence of the second video frame.
According to a video generating method provided by an embodiment of the present invention, the determining a target identification sequence based on the text identification sequence, the identification sequence of the first video frame, and the identification sequence of the second video frame includes:
splicing the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame to obtain a spliced identification sequence;
and determining the target identification sequence based on the spliced identification sequence and the preset sequence length.
According to a video generation method provided by an embodiment of the present invention, the training of the neural network video generation model by using the plurality of sets of target identification sequences includes:
and training the neural network video generation model in an autoregressive mode according to any one group of the target identification sequences, and obtaining the trained neural network video generation model when preset training conditions are met.
According to the video generation method provided by the embodiment of the invention, the preprocessing is performed on the text real sample to be inferred to obtain the text identification sequence of the text real sample to be inferred, and the method comprises the following steps:
and coding the real text sample to be inferred based on a byte pair coding method to obtain a text identification sequence of the real text sample to be inferred.
An embodiment of the present invention further provides a video generating apparatus, including:
the system comprises a preprocessing module, a text identification module and a text recognition module, wherein the preprocessing module is used for preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred;
the video generation module is used for inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
An embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the video generation methods described above when executing the program.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the video generation methods described above.
An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps of the video generation method described in any one of the above are implemented.
According to the video generation method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention, the text token sequence of the text to be inferred is input into the trained neural network video generation model, and the redundancy of the video frames in time and space is fully utilized by combining the first video frame and the second video frame, namely the adjacent high-resolution and low-resolution video frames, so that the redundancy information between frames is reduced, the continuous video frames can be generated in an iterative manner, meanwhile, the flexibility of the length of the generated video sequence is improved, and the generation of the high-quality video with good generalization and high resolution matched with the semantics of the text to be inferred is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model framework in a video generation method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a preprocessing flow of a video generation method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a model training flow of a video generation method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present invention;
fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A video generation method, apparatus, electronic device, and storage medium according to the present invention are described below with reference to fig. 1 to 6.
It should be noted that the text-to-video generation described in the present invention refers to a general open domain text-to-video generation, and by means of a general generation model of large-scale data set pre-training, a cross-modal generation task of a video corresponding to semantics can be synthesized according to an input text of unlimited content types.
The traditional video generation model based on the GANs network has insufficient capturing capability on text semantics, can only generate videos aiming at limited texts of certain specific topics, and cannot adapt to an open domain well.
In the field of natural language processing, transformer-based open-field Pre-Training language models (GPT) and the like have been used with great success. Subsequently, image and video generation models based on the Transformer model architecture also emerge endlessly.
The inventor researches and discovers that through the coding of a Vector Quantized variable Auto-encoder (VQVAE) and the sequence modeling capability of a Transformer, the models can obtain better generalization capability and semantic correlation, however, when the work relates to a video generation model, the length of a generated sequence is increased in a doubling way along with the increase of the generated frame number, the calculation load is greatly increased, and the maximum length is limited by the display card display memory of a calculation platform; and when the high-resolution video is generated, the generated high-resolution long sequence of the video frame is only used for prompting the generation of the subsequent video frame, a large amount of redundant information exists, and the sequence is too long and is not easy to train and learn.
In order to better realize text-to-video generation, embodiments of the present invention provide a flexible iterative video generation method and apparatus, an electronic device, and a storage medium.
Fig. 1 is a schematic flowchart of a video generation method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes: step 101 and step 102.
101, preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred;
in this step, the text to be inferred described in the present invention refers to text information for video generation, and the text information may be used to semantically describe a video to be generated by using one or more natural languages.
The text identification (token) sequence described in the invention refers to a discretization element identification sequence obtained after encoding the text to be reasoned.
In the embodiment of the invention, firstly, the text to be processed is preprocessed to obtain the text token sequence of the text to be inferred. Specifically, a word list can be firstly constructed, for a text sentence in the text to be processed, the text sentence is split into words in the constructed word list, and position labels of the words in the word list are taken as identifiers to represent the words, so that original text character string information of the text to be processed can be converted into token sequences, and the text token sequences of the text to be inferred can be obtained.
Step 102, inputting a text identification sequence of a text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
In this step, the trained neural network video generation model described in the present invention is obtained by training according to the text identification sequence of the text real sample to be inferred and the identification sequence of the video real sample corresponding to the text real sample to be inferred, and is used for identifying the token sequence of the text to be inferred input by the user, learning the text semantic understanding, the cross-modal connection between the text videos, and the video modeling capability, and generating the high resolution video of the specified time length matching the semantic of the text to be inferred.
It should be noted that, in the embodiment of the present invention, the cross-modality relation refers to an association relationship between multi-modality information, where the multi-modality information includes text information and video image information.
The text real sample to be inferred described by the embodiment of the invention is matched with the video real sample corresponding to the text real sample to be inferred, namely the text real sample to be inferred comprises semantic description information of the video real sample corresponding to the text real sample to be inferred.
In this embodiment, the sourcing data set such as Webvid and MSRVTT is adopted, and the data set in the specific field can also be used according to the requirement, and the real sample of the text to be inferred and the real sample of the video corresponding to the real sample of the text to be inferred can be obtained from the sourcing data set, that is, the real sample data set can be obtained.
The target resolution threshold described in the embodiments of the present invention refers to a resolution threshold defining a high resolution video.
It is to be understood that the identification sequence of the first video frame refers to a video token sequence generated based on the first video frame, and the identification sequence of the second video frame refers to a video token sequence generated based on the second video frame.
In this embodiment, the identifier sequence of the real sample of the video includes an identifier sequence of a first video frame and an identifier sequence of a second video frame, the resolution of the second video frame is higher than the target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at a previous moment of the second video frame;
in the embodiment of the present invention, the resolution of the second video frame is higher than the target resolution threshold, where the target resolution threshold is a threshold defining a high resolution video, that is, the second video frame may be a high resolution video frame; and selecting the first video frame and the second video frame as adjacent video frames, namely the video frames which are continuous in time, wherein the resolution of the first video frame is lower than the target resolution threshold.
In this embodiment, the resolution of the first video frame may also be lower than another target resolution threshold, which is a threshold defining a low resolution video, i.e. the first video frame may be a low resolution video frame.
It should be noted that the first video frame described in the embodiments of the present invention may represent a video frame of a low resolution type, and the second video frame may represent a video frame of a high resolution type.
In some embodiments, the neural network video generation model described in the embodiments of the present invention may be constructed based on a transform neural network Decoder (Decoder) and VQVAE.
In the embodiment of the present invention, in step 102, inputting the text identification sequence of the text to be inferred into the trained neural network video generation model, and generating the video corresponding to the text to be inferred, specifically including:
step 201, inputting a text identification sequence of a text to be inferred into a trained transform neural network Decoder model, and generating an identification sequence of a second video frame at the current moment corresponding to the text identification sequence of the text to be inferred;
step 202, inputting the identification sequence of the second video frame at the current moment into a decoder of a vector quantization self-encoder to obtain the second video frame at the current moment, and storing the second video frame at the current moment into a preset output queue;
step 203, down-sampling the second video frame at the current moment to obtain a down-sampled video frame, and inputting the down-sampled video frame into an encoder of a vector quantization self-encoder to obtain an identification sequence of the down-sampled video frame;
step 204, inputting the identification sequence of the down-sampling video frame and the text identification sequence of the text to be inferred into a trained transform neural network Decoder model to generate an identification sequence of a second video frame at the next moment, and taking the identification sequence of the second video frame at the next moment as the identification sequence of the second video frame at the current moment;
repeating the step 202 to the step 204 until the number of the second video frames stored in the preset output queue reaches the preset number of frames, and generating a video corresponding to the text to be inferred according to each second video frame in the preset output queue;
the trained neural network video generation model comprises a trained transform neural network Decoder model and an encoder and a Decoder of a vector quantization self-encoder.
Specifically, the downsampled video frame described in the embodiment of the present invention refers to a low-resolution video frame obtained by downsampling a second video frame at the current time.
The preset frame number described in the embodiment of the present invention refers to a preset threshold for generating the number of second video frames, and a specific value thereof may be set according to an actual calculation requirement.
It can be understood that the preset frame number is consistent with the number of iterations of iteratively generating the second video frame by the model, that is, each time the iterative computation is completed, the second video frame at a time is generated.
The preset output queue described in the embodiment of the present invention refers to a preset queue, and may be used to store second video frames generated in each iteration at different times in continuous time, and when the number of the stored second video frames reaches a preset number of frames, the second video frames may be integrated to output a complete video corresponding to a text to be inferred.
In the embodiment of the invention, in the implementation manner of steps 202 to 204, a text identification sequence of a text to be inferred is input into a trained transform neural network Decoder model, and a low-resolution video token sequence, that is, a video token sequence of a first video frame, and a high-resolution video token sequence, that is, a video token sequence of a second video frame, are sequentially generated in a frame-by-frame token-by-token manner, and the video token sequence of the second video frame is reconstructed by a VQVAE Decoder to obtain the high-resolution video frame, so that one iteration is completed;
furthermore, a high-resolution video frame at a moment, namely a second video frame, generated by each iteration is stored in a preset output queue, and the second video frame is also downsampled to obtain a downsampled video frame which is used as a low-resolution video frame at the moment, namely a first video frame, and a video token sequence of the first video frame is used as an input of next iteration calculation to predict and generate the second video frame at the next moment.
Therefore, the identifier sequence of the second video frame at the current time described in the embodiment of the present invention may be the second video frame at any time in the video generated by the text identifier sequence for generating the text to be inferred; the identification sequence of the second video frame at the current moment is the identification sequence of the second video frame at any moment.
Further, under the condition that the frame number of the second video frame stored in the preset output queue is judged not to reach the preset frame number, the steps 202 to 204 are repeated until the frame number of the second video frame stored in the preset output queue reaches the preset frame number, so that a complete video can be generated according to each frame of the second video frame in the preset output queue, and the video corresponding to the text to be inferred is obtained.
Fig. 2 is a schematic diagram of a model framework in the video generation method provided in the embodiment of the present invention, and as shown in fig. 2, a text token sequence [ [ TXT ], t ] of a text to be inferred is input to a transform neural network Decoder model in a trained neural network video generation model, where t = [ t [1], t [2], …, t [ N ] ], and a next token is sequentially generated for each token of the text token sequence, and the generated token is used as an input of next iterative computation;
meanwhile, a token sequence of a group of low-resolution video frames, namely a video token sequence [ [ VS ], VS ] of a first video frame, is generated in the first iteration, wherein VS = [ VS [1], VS [2], …, VS [ M ] ], and a token sequence of a group of high-resolution video frames, namely a video token sequence [ [ VL ], wherein VL = [ VL [1], VL [2], …, VL [ L ] ], L, M, N can be determined according to parameters set by actual calculation, and respectively represent a high-resolution video frame identification sequence length, a low-resolution video frame identification sequence length and a text identification sequence length. [ TXT ], [ VS ], [ VL ] denote the indicator symbols for text, low resolution video frames, high resolution video frames, respectively.
In this embodiment, a video token sequence of a first video frame is used to represent visual information, so as to ensure continuity of a generated video, a video token sequence of a second video frame is decoded into a high-resolution video frame by a VQVAE Decoder in a neural network video generation model and then stored in an output queue, if the frame number of the generated high-resolution video frame does not reach a preset frame number, the high-resolution video frame is downsampled to a low-resolution video frame and then subjected to token-based encoding by using the VQVAE Encoder, so that at the start of a second iteration, the downsampled low-resolution token sequence is used as an input, that is, the video token sequence of the first video frame required by a second iteration calculation can be obtained at this time;
and then inputting the video token sequence of the first video frame required by the second iterative computation into a transform neural network Decoder for iterative computation, wherein the subsequent iteration only needs to predict the video token sequence of the second video frame at the next moment, so that the smooth high-resolution video frame can be continuously generated by the iterative generation mode. And integrating the generated multi-frame high-resolution video frames after the frame number of the high-resolution video frames generated in the preset output queue reaches the preset frame number, and outputting a complete video, namely obtaining the video corresponding to the text to be inferred.
In the embodiment of the invention, the neural network video generation model is constructed by adopting the Transformer neural network decoder and the VQVAE framework, so that the trained neural network video generation model has better generalization capability and semantic understanding capability, and the iterative computation is carried out by adopting adjacent high-resolution and low-resolution video frames, thereby being beneficial to improving the precision of the trained model and obtaining the high-resolution video with good generalization and unlimited length.
The inventor researches and discovers that, through analysis on the resolution, the low-resolution video frames are consistent with the high-resolution video frames in the aspects of semantics and scene composition, the low-resolution video frames can be used as a consistency guide of the high-resolution video frames, redundant information among the video frames is reduced, the video generation capability of model learning is convenient, if the high-resolution video frames are adopted, the token sequence length is greatly increased, namely the training burden is increased, and the model is prone to enabling the two frames to be too similar to each other, so that the loss is reduced, and the model falls into the problem of mode collapse.
In the embodiment of the invention, the low-resolution video frame and the high-resolution video frame are adopted, and the characteristics of the high-resolution video frame and the low-resolution video frame are combined, so that the redundancy of the video in time and space is fully utilized, and the video frames are generated in an iterative manner, thereby ensuring the continuity of the video, improving the flexibility of the sequence length and realizing the generation of the high-resolution long-sequence video.
According to the method provided by the embodiment of the invention, the text token sequence of the text to be inferred is input into the trained neural network video generation model, and the redundancy of the video frames in time and space is fully utilized by combining the first video frame and the second video frame, namely the adjacent high-resolution and low-resolution video frames, so that the redundancy information between frames is reduced, the continuous video frames can be generated in an iterative manner, the flexibility of generating the length of the video sequence is improved, and the generation of the high-quality video with good generalization and high resolution which is matched with the semantics of the text to be inferred is realized.
In some embodiments, before inputting the text identification sequence of the text to be inferred into the trained neural network video generation model, the method further comprises:
preprocessing a text real sample to be inferred to obtain a text identification sequence of the text real sample to be inferred, and preprocessing a video real sample to obtain an identification sequence of a first video frame and an identification sequence of a second video frame;
determining a target identification sequence based on the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame;
and acquiring a plurality of groups of target identification sequences, and training the neural network video generation model by using the plurality of groups of target identification sequences.
Specifically, the target identification sequence described in the embodiment of the present invention refers to a token sequence obtained by splicing a text identification sequence of a text real sample to be inferred, an identification sequence of a first video frame, and an identification sequence of a second video frame, and the identification sequence is used as training data input by a model.
In the embodiment of the invention, the real sample of the text to be inferred and the real sample of the video corresponding to the real sample of the text to be inferred need to be preprocessed, so as to obtain the text token sequence of the real sample of the text to be inferred, the video token sequence of the first video frame and the video token sequence of the second video frame in the real sample of the video.
Similarly, for the real text sample to be inferred, a word list is constructed, the text sentences in the real text sample to be processed are split to obtain each vocabulary in the constructed word list, and the position labels of each vocabulary in the word list are taken as the identifier representations, so that the original text string information can be converted into token sequences, and the text identifier sequences of the real text sample to be inferred are obtained.
For a real video sample corresponding to a real text sample to be inferred, firstly, frame extraction and compression processing can be performed on the real video sample according to a preset frame rate to obtain two groups of adjacent video frames with different resolutions, namely a first video frame and a second video frame, then, the first video frame and the second video frame are subjected to discretization coding to form a token form, and an identification sequence of the first video frame and an identification sequence of the second video frame are obtained.
In order to reduce the burden of subsequent model training calculation, video frames with two resolutions are adopted in the preprocessing stage of a real video sample, and a discretization token sequence of a low-resolution video frame, namely an identification sequence of a first video frame, and a discretization token sequence of a high-resolution video frame, namely an identification sequence of a second video frame, are extracted and used for training a subsequent neural network video generation model.
Further, after determining a text identification sequence of the real text sample to be inferred, an identification sequence of the first video frame and an identification sequence of the second video frame, sequentially splicing the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame to determine a target identification sequence.
In this embodiment, through the above operations, a plurality of groups of target identification sequences may be obtained according to a plurality of groups of text real samples to be inferred and corresponding video real samples, and the neural network video generation model is iteratively trained to obtain a trained neural network video generation model.
According to the method provided by the embodiment of the invention, the text token sequence, the video token sequence of the first video frame and the video token sequence of the second video frame are obtained by carrying out discretization coding on the text real sample to be inferred and the corresponding video real sample, iterative training is carried out on the neural network video generation model based on splicing of the multi-mode token sequences, the neural network video generation model is facilitated to better learn text semantic information and the multi-mode incidence relation of the text video, and therefore the video generation modeling capability is improved.
In some embodiments, the preprocessing the text real sample to be inferred to obtain the text identification sequence of the text real sample to be inferred includes:
and coding the real text sample to be inferred based on a byte pair coding method to obtain a text identification sequence of the real text sample to be inferred.
In particular, a Byte Pair Encoding (BPE) method is a common encoding method of a data compression form; in the embodiment of the invention, based on a BPE method, a word list is constructed, words, phrases and the like of a text sentence in a text real sample to be inferred are subjected to discretization coding, and element identifications of the words, the phrases and the like are determined, so that a text token sequence of the text real sample to be inferred is obtained.
Fig. 3 is a schematic diagram of a preprocessing flow of the video generation method provided by the embodiment of the present invention, and as shown in fig. 3, text sentences in a real sample of a text to be inferred are input to a preprocessing module for preprocessing, and the real sample of the text to be inferred is converted into a token sequence represented by an integer through quantization and token transformation.
In the embodiment of the invention, the preprocessing module can perform BPE word segmentation and coding on text sentences in the real sample of the text to be inferred based on a BPE method, and then map the text sentences into a form of element identification token.
For example: assuming that the text true sample to be inferred is 'one person is running', the text true sample to be inferred 'one person is running'The phrase groups of 'one', 'person', 'on', 'running' are disassembled, and are further mapped to corresponding reference numerals (i) of the phrases in the constructed word list 1 ,i 2 ,i 3 ,i 4 ) In token sequence t = [ i ] 1 ,i 2 ,i 3 ,i 4 ]The method represents the text sentence in the real sample of the text to be inferred, so that the original text string information is converted into an integer type token sequence t epsilon R 1×N
According to the method provided by the embodiment of the invention, the BPE method is adopted to preprocess the real text sample to be inferred, so that the text identification sequence of the real text sample to be inferred can be efficiently obtained, and effective training sample data is provided for subsequent model training.
In some embodiments, preprocessing the real samples of video to obtain an identification sequence of the first video frame and an identification sequence of the second video frame includes:
randomly extracting adjacent video frames in the real video sample according to a preset frame rate to obtain a video frame at the previous moment of a second video frame and the second video frame;
down-sampling a video frame at the previous moment of the second video frame to obtain a first video frame;
and inputting the first video frame and the second video frame into a vector quantization self-encoder to obtain an identification sequence of the first video frame and an identification sequence of the second video frame.
Specifically, the preset frame rate described in the embodiment of the present invention refers to a fixed frame rate preset for extracting the number of real sample frames of the video, which may be set to 1 frame per second, or 2 frames per second, and so on, and is not limited in the embodiment of the present invention.
The video frame at the previous moment of the second video frame and the second video frame described in the invention are two groups of video frames which are continuous in time and obtained after frame extraction at a preset frame rate, and both the two groups of video frames are high-resolution video frames.
The adjacent video frames described in the embodiments of the present invention refer to temporally continuous video frame images.
Further, a video frame at a previous moment of the second video frame is downsampled, so that a low-resolution video frame corresponding to the video frame at the previous moment, that is, the first video frame, can be obtained.
It can be understood that when the preset frame rate is too small, the inter-frame redundancy information is large, the motion of the synthesized video is less, and when the preset frame rate is large, the synthesized video is not coherent and smooth enough, so in the embodiment of the present invention, different preset frame rates can be adopted according to the difference of the real sample data sets to ensure the suitability of the motion intensity between the extracted video frames. In a specific embodiment, to ensure that the generated video has proper motion intensity, the preset frame rate may be selected to be 2 frames per second.
In this embodiment, through a random frame extraction manner, every two adjacent video frames in each real video sample can be traversed as much as possible, so that continuity between the first video frame and the second video frame can be ensured, and it is ensured that continuity and time sequence relation between the video frames can be learned through subsequent model training.
In this embodiment, video frames may be discretized encoded into token form using a pre-trained vector quantization auto-encoder (VQVAE).
The VQVAE includes an encoder and a decoder, wherein the encoder is configured to compress a video, perform downsampling and feature extraction on the video, and map extracted features into a visual codebook, thereby implementing video token serialization; the decoder maps the video token sequence back to the corresponding features and upsamples to the original video frame.
As the number of pixels of the image or the video is more and continuous information, the image or the video is not easily processed by using a transform model, as shown in fig. 3, a real video sample is input to a preprocessing module, frames are extracted through a preset frame rate, adjacent video frames in the real video sample are randomly extracted to obtain a video frame and a second video frame of a second video frame at a previous moment, and a first video frame is obtained after the video frame of the second video frame at the previous moment is down-sampled by multiple times. And then, encoding the first video frame by using an Encoder Encoder of the VQVAE, wherein the function of the VQVAE is similar to that of the BPE, and the VQVAE can perform discretization encoding on the first video frame by mapping into a visual codebook to generate a video token to obtain a token sequence of the first video frame. Similarly, the VQVAE may also perform discretization encoding on the second video frame to obtain a token sequence of the second video frame.
More specifically, in the embodiment of the present invention, 2D VQVAE for image processing and 3D VQVAE for video processing may be employed, which are mainly different in whether information integration and down-sampling are performed on the time dimension.
Among them, 2D VQVAE considers only spatial information and treats a video frame as a separate image. For video V epsilon R T ×C×H×W Wherein T is the time sequence length of the video, which is related to the preset frame rate of the frame extraction, for the video with duration of tau seconds, sampling with frame rate of fps, T = fps tau, C is the number of video channels, for the color video, the values are 3,H and W are respectively the length and width of the video, the invention uses the videos with two resolutions, the difference is H and W;
in addition, since the 2D VQVAE does not take time into account, the down-sampling magnification in the time dimension is 1, and depending on the number of the down-sampling convolutional layers, there are spatially different down-sampling magnifications (1, f), and the compression strength is higher as the down-sampling magnification is higher.
Whereas 3D VQVAE considers inter-frame interaction and compression in the time dimension, the downsampling magnification can be set to (s, f, f). Accordingly, after 2D VQVAE encoding, the real video sample can be converted into a video token sequence, which can be expressed as
Figure BDA0003481120580000171
After being processed by 3D VQVAE, the product is
Figure BDA0003481120580000172
In a specific embodiment, the present invention uses 3D VQVAE as a codec for real sample preprocessing of video, where a low resolution video frame, i.e., a first video frame, has a size of 128 × 128, a high resolution video frame, i.e., a second video frame, has a size of 256 × 256, and the down-sampling magnification of the 3D VQVAE is (4,8,8), the length of the video token sequence of each set of second video frames that can be finally obtained is L =1024, and the length of the video token sequence of each set of first video frames is M =256, so as to represent consecutive 4 frames of video.
According to the method provided by the embodiment of the invention, the adjacent video frames of the real video samples are extracted randomly according to the preset frame rate, so that every two adjacent video frames in each real video sample can be traversed as much as possible, and the consistency and time sequence relation between the video frames can be learned through subsequent model training; meanwhile, the vector quantization-based self-encoder can more efficiently perform discretization encoding processing on the first video frame and the second video frame to obtain the identification sequence of the first video frame and the identification sequence of the second video frame, and is favorable for improving the training efficiency of subsequent model training.
In some embodiments, determining the target identification sequence based on the text identification sequence, the identification sequence of the first video frame, and the identification sequence of the second video frame comprises:
splicing the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame to obtain a spliced identification sequence;
and determining a target identification sequence based on the spliced identification sequence and the preset sequence length.
Specifically, the preset sequence length described in the embodiment of the present invention refers to a preset sequence data length, and is used to ensure that token sequences of an input model are all sequences with the same length, which is convenient for parallel training of training data.
In the embodiment of the invention, after a text token sequence, a video token sequence vs of a first video frame and a video token sequence vl of a second video frame are obtained, the text token sequence, the video token sequence vs of the first video frame and the video token sequence vl of the second video frame are sequentially spliced end to end, and identification indication symbols are inserted, namely, the text indication symbols are respectively inserted into the head of the text token sequence, the low-resolution video frame indication symbols are inserted into the head of the video token sequence of the first video frame, and the high-resolution video frame indication symbols are inserted into the head of the video token sequence of the second video frame, so that the spliced identification sequences are obtained; thus, the spliced identification sequences can be expressed as [ TXT ], t, [ VS ], VS, [ VL ], VL ].
And further, performing data expansion on the spliced identification sequence according to the preset sequence length to obtain a target identification sequence.
In this embodiment, a sequence [ PAD ] n ] is added after the spliced marker sequence, where [ PAD ] represents the indicator of the supplemental marker and [ PAD ] n represents the n supplemental [ PAD ] symbols, so that a target marker sequence for training the neural network video generation model can be obtained, which can be represented as [ [ TXT ], t, [ VS ], VS, [ VL ], VL, [ PAD ] n ].
According to the method provided by the embodiment of the invention, the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame are spliced and data expansion is carried out, so that all data in a real sample data set are input into the model by data with the same length, the parallel training of training data is facilitated, and the training efficiency of subsequent model training is improved.
In some embodiments, the training of the neural network video generation model using the plurality of sets of target identification sequences includes:
and training the neural network video generation model in an autoregressive mode according to any one group of target identification sequences, and obtaining the trained neural network video generation model when preset training conditions are met.
Specifically, the autoregressive mode described in the embodiment of the present invention refers to that a group of target identifier sequences, that is, a text token sequence, a group of video token sequences of a first video frame, and a group of video token sequences of a second video frame, are sequentially spliced and inserted with an indicator, and then input into a transform neural network Decoder in a neural network video generation model, and a next token is predicted according to an existing token, and a training task is reconstruction of own data information.
The preset training condition described in the embodiment of the present invention refers to a preset threshold condition for stopping training of the model, and for example, the preset training condition may be set as a threshold of the number of iterations, so as to ensure that training of the model can be stopped when the number of iterations in the training process reaches the preset number of iterations, and obtain a trained neural network video generation model.
It should be noted that a group described in the present invention refers to a minimum unit in time after a video frame is subjected to discretization encoding, if 2D VQVAE encoding is adopted, a group of video frames is a frame of video frame, and if 3D VQVAE encoding is adopted, a group of video frames is a down-sampling rate frame, for example, the down-sampling rate is 4, and a group of video frames is a 4-frame video frame.
In an embodiment of the present invention, the neural network video generation model may be constructed based on a transform neural network Decoder and VQVAE.
Fig. 4 is a schematic diagram of a model training flow of a video generation method provided in an embodiment of the present invention, as shown in fig. 4, when a transform neural network Decoder model is input, a target identification sequence [ [ TXT ], t [1], …, t [ N ], [ VS ], VS [1], …, VS [ M ], [ VL ], VL [1], …, VL [ L ] ] is subjected to Text coding and video coding through an Embedding layer (Text Embedding), and then added to a Position coding (Position Embedding) representing Position information, and the added result is input to the transform neural network Decoder model;
further, in the embodiment of the present invention, in a self-attention (self-attention) phase of the transform, a commonly used triangle mask method is adopted to ensure that only the generated token can be focused when predicting the current token. The training task adopts an autoregressive task, and utilizes the existing token information to calculate and predict the next token iteratively, therefore, the iterative generation mode can be used for continuously generating the text token sequences [ t [1], …, t [ N ] ] of the reconstructed text real sample and the video token sequences [ vs [1], …, vs [ M ] ] of the first video frame of the reconstructed video real sample and the video token sequences [ vl [1], …, vl [ L ] ] of the second video frame, when the preset iteration times are reached, namely the preset training condition is met, a trained neural network video generation model is obtained;
in the embodiment of the present invention, the autoregressive task is divided into a text autoregressive task and a video autoregressive task.
Text autoregressionThe method is used for enabling the model to learn the language modeling capability, namely, when the model can be considered to reconstruct the original sample text according to the foregoing, the capability of learning the language structure modeling analysis is explained, and further the whole semantics can be effectively understood and expressed, wherein the objective function
Figure BDA0003481120580000201
May be a negative log-likelihood, as defined below:
Figure BDA0003481120580000202
wherein the content of the first and second substances,
Figure BDA0003481120580000203
represents the text word to be currently predicted, i.e. the text token to be reconstructed,
Figure BDA0003481120580000204
representing predicted text words, adopting real sample values to facilitate parallel computation during training, v representing a video token, and since the text token is in front of the video, the autoregression of text sentences can not use information of a video modality, D representing a real sample data set, theta representing model parameters,
Figure BDA0003481120580000211
means that the next, i.e. ith token is predicted on the basis of the first i-1 tokens
Figure BDA0003481120580000212
Of value the probability of the occurrence of the event,
Figure BDA0003481120580000213
indicating the average expectation, i.e. the value of the objective function is averaged over the entire data set.
The video autoregressive task is used for enabling a model to learn the capability of video modeling and generation, and text statement information is fused to ensure that the generated video conforms to text semantics. In particular, a triangle mask is still employedAccording to the method, since the text information is input as the prompt information before the video, all the text tokens and the generated video tokens can be concerned when the video tokens are generated, wherein the target function
Figure BDA0003481120580000214
May be a negative log-likelihood, defined as follows:
Figure BDA0003481120580000215
wherein the content of the first and second substances,
Figure BDA0003481120580000216
representing the video token to be predicted currently, i.e. the video token to be reconstructed, t represents the text token,
Figure BDA0003481120580000217
representing a video token that has been predicted using a preceding text token,
Figure BDA0003481120580000218
representing the prediction of the next, i.e., ith, video token based on all of the preceding text tokens and the first i-1 video tokens
Figure BDA0003481120580000219
Probability of value.
Further, the weighted sum of the text autoregressive loss and the video autoregressive loss is the total loss, and then the total loss is:
Figure BDA00034811205800002110
where β represents an adjustable hyper-parameter, β is typically set to a value greater than 1, so that the model learns preferentially the comprehension of text semantics. Experiments prove that the method is easier for reconstructing the model learning visual token and can improve the semantic relevance of the generated video.
According to the method provided by the embodiment of the invention, the target identification sequence is input into the neural network video generation model to carry out autoregressive training, and high-resolution and low-resolution video frames in the target representation sequence are combined, so that the redundancy of the video frames in time and space is fully utilized, the inter-frame redundancy information of the video frames can be effectively reduced, and the model training efficiency and the model precision are improved.
It should be noted that the method proposed by the embodiment of the present invention is similar to a sparse attention mechanism, that is, only the current frame and the previous frame are focused on, but the embodiment of the present invention uses the high-resolution and low-resolution video frames to further reduce the sequence length and improve the model learning efficiency, which cannot be achieved by the sparse attention mechanism.
To facilitate the demonstration of the advantages of flexible iterative text-to-video generation proposed by the present invention, two methods common in the art are used as a comparison in the embodiments of the present invention. The GODIVA model provides a 3D sparse attention mechanism, and the generation of a general text with a low resolution and a fixed frame number to a video is realized; the baseline model uses 2 frames of high resolution video frames to iteratively generate a video with a specified number of frames. In the embodiment of the invention, the fact that the GODIVA is limited by the sequence length and cannot generate a high-resolution video can be found, and the video definition is lower than that of the video generated in the embodiment of the invention; the baseline model has too long sequence length, the time consumption for generating a long sequence is greatly increased due to the secondary complexity of the Transformer, and in addition, a high-resolution video frame is used as a coherent guide frame during iteration to have a large amount of redundant information, so that the model learning difficulty is increased, the model is more prone to keeping two adjacent frames as similar as possible so as to reduce loss, and the generated video lacks movement.
According to the method provided by the embodiment of the invention, the high-resolution and low-resolution video frames are adopted, the inter-frame redundant information is reduced, the model is forced to learn the generation of the video frames with continuity and semantic matching only by using the spatial and semantic information of the previous frame, the high-resolution video with better quality can be generated, and the motion among the video frames can be better modeled relative to the baseline model.
The following describes the video generation apparatus provided by the present invention, and the video generation apparatus described below and the video generation method described above may be referred to in correspondence with each other.
Fig. 5 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present invention, as shown in fig. 5, including:
the preprocessing module 501 is configured to preprocess a text to be inferred to obtain a text identification sequence of the text to be inferred;
the video generation module 502 is configured to input the text identification sequence of the text to be inferred into the trained neural network video generation model, and generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
The video generating apparatus described in this embodiment may be used to implement the embodiments of the video generating method, and the principle and the technical effect are similar, which are not described herein again.
According to the video generation device provided by the embodiment of the invention, the text token sequence of the text to be inferred is input into the trained neural network video generation model, and the redundancy of the video frames in time and space is fully utilized by combining the first video frame and the second video frame, namely the adjacent high-resolution and low-resolution video frames, so that the redundancy information between frames is reduced, the continuous video frames can be generated in an iterative manner, meanwhile, the flexibility of the length of the generated video sequence is improved, and the generation of the high-quality video with good generalization and high resolution, which is matched with the semantics of the text to be inferred, is realized.
Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device may include: a processor (processor) 610, a communication Interface (Communications Interface) 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the video generation method provided by the above methods, the method comprising: preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred; inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred; the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the video generation method provided by the above methods, the method comprising: preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred; inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred; the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a video generation method provided by the above methods, the method comprising: preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred; inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred; the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A method of video generation, comprising:
preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred;
inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, wherein the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold value, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame;
inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred, wherein the method comprises the following steps:
step 201, inputting the text identification sequence of the text to be inferred into a trained transform neural network Decoder model, and generating an identification sequence of a second video frame at the current moment corresponding to the text identification sequence of the text to be inferred;
step 202, inputting the identification sequence of the second video frame at the current moment into a decoder of a vector quantization self-encoder to obtain the second video frame at the current moment, and storing the second video frame at the current moment into a preset output queue;
step 203, down-sampling the second video frame at the current moment to obtain a first video frame at the current moment, and inputting the first video frame into an encoder of a vector quantization self-encoder to obtain an identification sequence of the first video frame;
step 204, inputting the identification sequence of the first video frame and the text identification sequence of the text to be inferred into a trained transform neural network Decoder model to generate an identification sequence of a second video frame at the next moment, and taking the identification sequence of the second video frame at the next moment as the identification sequence of the second video frame at the current moment;
repeating the step 202 to the step 204 until the number of second video frames stored in the preset output queue reaches a preset number of frames, and generating a video corresponding to the text to be reasoned according to each second video frame in the preset output queue;
the trained neural network video generation model comprises the trained transform neural network Decoder model and an encoder and a Decoder of the vector quantization self-encoder.
2. The video generation method according to claim 1, wherein before inputting the text identification sequence of the text to be inferred into the trained neural network video generation model, the method further comprises:
preprocessing the text real sample to be inferred to obtain a text identification sequence of the text real sample to be inferred, and preprocessing the video real sample to obtain an identification sequence of the first video frame and an identification sequence of the second video frame;
determining a target identification sequence based on the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame;
and acquiring a plurality of groups of target identification sequences, and training a neural network video generation model by using the plurality of groups of target identification sequences.
3. The video generation method according to claim 2, wherein the preprocessing the real samples of the video to obtain the identification sequence of the first video frame and the identification sequence of the second video frame comprises:
randomly extracting adjacent video frames in the real video sample according to a preset frame rate to obtain a video frame at the previous moment of the second video frame and the second video frame;
down-sampling a video frame at the previous moment of the second video frame to obtain the first video frame;
and inputting the first video frame and the second video frame into a vector quantization self-encoder to obtain an identification sequence of the first video frame and an identification sequence of the second video frame.
4. The video generation method of claim 2, wherein determining a target identification sequence based on the text identification sequence, the identification sequence of the first video frame, and the identification sequence of the second video frame comprises:
splicing the text identification sequence, the identification sequence of the first video frame and the identification sequence of the second video frame to obtain a spliced identification sequence;
and determining the target identification sequence based on the spliced identification sequence and the preset sequence length.
5. The video generation method of claim 2, wherein the training of the neural network video generation model using the plurality of sets of target identification sequences comprises:
and training the neural network video generation model in an autoregressive mode according to any one group of the target identification sequences, and obtaining the trained neural network video generation model when preset training conditions are met.
6. The video generation method according to claim 2, wherein the preprocessing the text real sample to be inferred to obtain a text identification sequence of the text real sample to be inferred comprises:
and coding the real text sample to be inferred based on a byte pair coding method to obtain a text identification sequence of the real text sample to be inferred.
7. A video generation apparatus, comprising:
the system comprises a preprocessing module, a text identification module and a text recognition module, wherein the preprocessing module is used for preprocessing a text to be inferred to obtain a text identification sequence of the text to be inferred;
the video generation module is used for inputting the text identification sequence of the text to be inferred into a trained neural network video generation model to generate a video corresponding to the text to be inferred;
the trained neural network video generation model is obtained by training according to a text identification sequence of a text real sample to be inferred and an identification sequence of a video real sample corresponding to the text real sample to be inferred, wherein the identification sequence of the video real sample comprises an identification sequence of a first video frame and an identification sequence of a second video frame, the resolution of the second video frame is higher than a target resolution threshold value, the resolution of the first video frame is smaller than the resolution of the second video frame, and the first video frame is a video frame at the previous moment of the second video frame;
the video generation module is further specifically configured to:
step 201, inputting the text identification sequence of the text to be inferred into a trained transform neural network Decoder model, and generating an identification sequence of a second video frame at the current moment corresponding to the text identification sequence of the text to be inferred;
step 202, inputting the identification sequence of the second video frame at the current moment into a decoder of a vector quantization self-encoder to obtain the second video frame at the current moment, and storing the second video frame at the current moment into a preset output queue;
step 203, down-sampling the second video frame at the current moment to obtain a first video frame at the current moment, and inputting the first video frame into an encoder of a vector quantization self-encoder to obtain an identification sequence of the first video frame;
step 204, inputting the identification sequence of the first video frame and the text identification sequence of the text to be inferred into a trained transform neural network Decoder model to generate an identification sequence of a second video frame at the next moment, and taking the identification sequence of the second video frame at the next moment as the identification sequence of the second video frame at the current moment;
repeating the step 202 to the step 204 until the number of second video frames stored in the preset output queue reaches a preset number of frames, and generating a video corresponding to the text to be reasoned according to each second video frame in the preset output queue;
the trained neural network video generation model comprises the trained transform neural network Decoder model and an encoder and a Decoder of the vector quantization self-encoder.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video generation method according to any of claims 1 to 6 are implemented when the program is executed by the processor.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the video generation method according to any one of claims 1 to 6.
CN202210068441.XA 2022-01-20 2022-01-20 Video generation method and device, electronic equipment and storage medium Active CN114598926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210068441.XA CN114598926B (en) 2022-01-20 2022-01-20 Video generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210068441.XA CN114598926B (en) 2022-01-20 2022-01-20 Video generation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114598926A CN114598926A (en) 2022-06-07
CN114598926B true CN114598926B (en) 2023-01-03

Family

ID=81804985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210068441.XA Active CN114598926B (en) 2022-01-20 2022-01-20 Video generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114598926B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757254B (en) * 2023-08-16 2023-11-14 阿里巴巴(中国)有限公司 Task processing method, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 variational self-encoder and video generation method combining generation countermeasure network
CN113051420A (en) * 2021-04-15 2021-06-29 山东大学 Robot vision man-machine interaction method and system based on text generation video
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11587548B2 (en) * 2020-06-12 2023-02-21 Baidu Usa Llc Text-driven video synthesis with phonetic dictionary
CN114144790A (en) * 2020-06-12 2022-03-04 百度时代网络技术(北京)有限公司 Personalized speech-to-video with three-dimensional skeletal regularization and representative body gestures

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 variational self-encoder and video generation method combining generation countermeasure network
CN113051420A (en) * 2021-04-15 2021-06-29 山东大学 Robot vision man-machine interaction method and system based on text generation video
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114598926A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN111916067A (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN113158665A (en) Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN112348111B (en) Multi-modal feature fusion method and device in video, electronic equipment and medium
CN114598926B (en) Video generation method and device, electronic equipment and storage medium
Kim et al. L-verse: Bidirectional generation between image and text
Zhu et al. Multiscale temporal network for continuous sign language recognition
CN111653270A (en) Voice processing method and device, computer readable storage medium and electronic equipment
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
CN117291232A (en) Image generation method and device based on diffusion model
Cornia et al. Towards cycle-consistent models for text and image retrieval
CN114638905B (en) Image generation method, device, equipment and storage medium
CN116975347A (en) Image generation model training method and related device
CN116363563A (en) Video generation method and device based on images and texts
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN115408494A (en) Text matching method integrating multi-head attention alignment
CN115937641A (en) Method, device and equipment for intermodal joint coding based on Transformer
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN114626529A (en) Natural language reasoning fine-tuning method, system, device and storage medium
CN114329005A (en) Information processing method, information processing device, computer equipment and storage medium
CN117609553B (en) Video retrieval method and system based on local feature enhancement and modal interaction
CN116229332B (en) Training method, device, equipment and storage medium for video pre-training model
Yokota et al. Augmenting Image Question Answering Dataset by Exploiting Image Captions
Zhou et al. Lightweight Self-Attention Network for Semantic Segmentation
Xie et al. G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240419

Address after: Room 524, Automation Building, No. 95 Zhongguancun East Road, Haidian District, Beijing, 100190

Patentee after: BEIJING ZHONGZI SCIENCE AND TECHNOLOGY BUSINESS INCUBATOR CO.,LTD.

Country or region after: China

Address before: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District

Patentee before: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240423

Address after: 200-19, 2nd Floor, Building B, Wanghai Building, No.10 West Third Ring Middle Road, Haidian District, Beijing, 100190

Patentee after: Zhongke Zidong Taichu (Beijing) Technology Co.,Ltd.

Country or region after: China

Address before: Room 524, Automation Building, No. 95 Zhongguancun East Road, Haidian District, Beijing, 100190

Patentee before: BEIJING ZHONGZI SCIENCE AND TECHNOLOGY BUSINESS INCUBATOR CO.,LTD.

Country or region before: China