CN117499711A - Training method, device, equipment and storage medium of video generation model - Google Patents

Training method, device, equipment and storage medium of video generation model Download PDF

Info

Publication number
CN117499711A
CN117499711A CN202311486824.XA CN202311486824A CN117499711A CN 117499711 A CN117499711 A CN 117499711A CN 202311486824 A CN202311486824 A CN 202311486824A CN 117499711 A CN117499711 A CN 117499711A
Authority
CN
China
Prior art keywords
video
generation model
video generation
text
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311486824.XA
Other languages
Chinese (zh)
Inventor
项进喜
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311486824.XA priority Critical patent/CN117499711A/en
Publication of CN117499711A publication Critical patent/CN117499711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a training method, device and equipment for a video generation model and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring at least one training sample, wherein each training sample comprises a sample video and a description text corresponding to the sample video; the method comprises the steps of adding noise to a sample video to obtain hidden space representation corresponding to the sample video; denoising the hidden space representation according to the description text by at least one denoising unit, and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit; and adjusting parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model. The method improves the effect of the video generated by the video generation model according to the description text.

Description

Training method, device, equipment and storage medium of video generation model
Technical Field
The present disclosure relates to the field of artificial intelligence (Artificial Intelligence, abbreviated as AI) technologies, and in particular, to a training method, apparatus, device, and storage medium for a video generation model.
Background
With advances in computing power, data scalability, and architectural design, visual generation models are becoming an important research direction. The video generation model is one of the branches of the visual generation model, and can input descriptive text, namely, can output high-fidelity video related to the descriptive text.
In the related art, a diffusion model is generally utilized to construct a video generation model, that is, a text feature of a description text is combined, and hidden space features (noise features) are denoised for multiple times through a reverse process (namely a denoising network) of the diffusion model, so that a predicted video can be obtained. The basic building block of the denoising network, une (denoising unit), includes a series of convolution layers (or referred to as convolution modules) and attention layers (or referred to as attention modules). The convolution layer generally performs a convolution in a time dimension (one-dimensional convolution) on an input feature, and then performs a convolution in a space dimension (two-dimensional convolution).
However, in the denoising process, one-dimensional convolution is directly performed first and then two-dimensional convolution is performed, and the convolution mechanism is limited, that is, the video generated through the denoising process is easy to be mismatched with the descriptive text, so that the effect of generating the video is poor.
Disclosure of Invention
The embodiment of the application provides a training method, device and equipment for a video generation model and a storage medium, which can promote the effect of generating video. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a training method of a video generation model, where the video generation model includes a denoising network, and the denoising network includes at least one denoising unit, and the method includes:
acquiring at least one training sample, wherein each training sample comprises a sample video and a description text corresponding to the sample video;
the hidden space representation corresponding to the sample video is obtained by adding noise to the sample video;
denoising the hidden space representation according to the description text by the at least one denoising unit, and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit;
and adjusting parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model.
According to an aspect of an embodiment of the present application, there is provided a video generating method based on a video generating model, the video generating model including a denoising network, the denoising network including at least one denoising unit, the method including:
Acquiring descriptive text for generating video;
acquiring a hidden space representation, wherein the hidden space representation is used for representing noise distribution before video generation;
denoising the hidden space representation according to the description text by using the at least one denoising unit, and then decoding to obtain a prediction video corresponding to the description text; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
According to an aspect of an embodiment of the present application, there is provided a training apparatus for a video generation model, the video generation model including a denoising network, the denoising network including at least one denoising unit, the apparatus including:
the sample acquisition module is used for acquiring at least one training sample, and each training sample comprises a sample video and a description text corresponding to the sample video;
the video noise adding module is used for obtaining hidden space representation corresponding to the sample video by adding noise to the sample video;
the video generation module is used for denoising the hidden space representation according to the description text through the at least one denoising unit and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit;
And the parameter adjustment module is used for adjusting the parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model.
According to an aspect of an embodiment of the present application, there is provided a video generating apparatus based on a video generating model, the video generating model including a denoising network including at least one denoising unit therein, the apparatus including:
the acquisition module is used for acquiring descriptive text for generating video;
the representation acquisition module is used for acquiring a hidden space representation, wherein the hidden space representation is used for representing noise distribution before video generation;
the video generation module is used for denoising the hidden space representation according to the description text through the at least one denoising unit and then decoding to obtain a prediction video corresponding to the description text; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, in which a computer program is stored, the computer program being loaded and executed by the processor to implement the training method of the video generation model or the video generation method based on the video generation model.
According to an aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the training method of the video generation model described above, or the video generation method based on the video generation model described above.
According to an aspect of embodiments of the present application, there is provided a computer program product comprising a computer program loaded and executed by a processor to implement the above-described training method of a video generation model, or the above-described video generation method based on a video generation model.
The technical scheme provided by the embodiment of the application can bring the following beneficial effects:
in the video generation process, denoising the hidden space representation according to the description text by at least one denoising unit, and then decoding to obtain a predicted video; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit. By improving the convolution mechanism of the denoising unit in the denoising network, the matching degree between the generated video and the description text can be improved, and the effect of generating the video is improved.
Drawings
FIG. 1 is a schematic diagram of an implementation environment for an embodiment provided herein;
FIG. 2 is a schematic diagram of a training method for a video generation model provided in one embodiment of the present application;
FIG. 3 is a flow chart of a training method for a video generation model provided in one embodiment of the present application;
FIG. 4 is a flow chart of a training method for a video generation model provided in another embodiment of the present application;
FIG. 5 is a schematic diagram of a convolutional layer provided by one embodiment of the present application;
FIG. 6 is a schematic diagram of a process flow of motion excitation provided by one embodiment of the present application;
FIG. 7 is a schematic illustration of an attention layer provided by one embodiment of the present application;
FIG. 8 is a flow chart of a training method for a video generation model provided in another embodiment of the present application;
FIG. 9 is a schematic diagram of a training method for a decoding network provided in one embodiment of the present application;
FIG. 10 is a schematic diagram of an adapter and control network provided by one embodiment of the present application;
FIG. 11 is a schematic diagram of a training process for a video generation model provided in one embodiment of the present application;
FIG. 12 is a flow chart of a video generation method based on a video generation model provided in one embodiment of the present application;
FIG. 13 is a schematic diagram of a usage flow of a video generation model provided in one embodiment of the present application;
FIG. 14 is a block diagram of a training apparatus for a video generation model provided in one embodiment of the present application;
FIG. 15 is a block diagram of a video generation device based on a video generation model provided in another embodiment of the present application;
fig. 16 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Before the technical scheme of the application is described, some background knowledge related to the application is described. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application, which all belong to the protection scope of the embodiments of the present application. Embodiments of the present application include at least some of the following.
Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as identifying and measuring objects, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a swin-transformer, viT (Vision Transformers), V-MOE (Vision Mixture of Expert), MAE (Masked Auto Encoder) and other vision field pre-training model can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, and the like, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial intelligence generation content (Artificial Intelligence Generated Content, abbreviated as AIGC), conversational interactions, smart medicine, smart customer service, game AI, virtual Reality (VR), augmented Reality (Augmented Reality, abbreviated as AR), etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
The scheme provided by the embodiment of the application relates to the technologies of computer vision technology, deep learning and the like of artificial intelligence, and in the embodiment of the application, a video generation model is trained by describing texts and sample videos, and the trained video generation model can generate videos according to the description texts. The following examples are provided to illustrate the invention.
Before describing the technical scheme of the application, some nouns related to the application are explained. The following related explanations may be arbitrarily combined with the technical solutions of the embodiments of the present application as alternatives, which all belong to the protection scope of the embodiments of the present application. Embodiments of the present application include at least some of the following.
The Pre-Training Model (PTM) is also called a basic stone Model and a large Model, and is a deep neural network (Deep Neural Network, DNN) with large parameters, the deep neural network is trained on massive unlabeled data, the PTM extracts common features on the data by utilizing the function approximation capability of the large-parameter DNN, and the deep neural network is suitable for downstream tasks through technologies such as fine tuning, efficient fine tuning of parameters, prompt-tuning and the like. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTMs can be classified according to the data modality of processing into language models, visual models (swin-transformer, viT, V-MOEs), speech models, multimodal models, etc., where multimodal models refer to models that establish a characteristic representation of two or more data modalities. The pre-training model is an important tool for outputting artificial intelligence to generate content, and can also be used as a general interface for connecting a plurality of specific task models. Text encoding networks, image encoding networks, noise adding networks, encoding networks, decoding networks, etc. in embodiments of the present application may be considered pre-training models.
Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The scenario implementation environment may include a model training apparatus 10 and a model using apparatus 20.
The model training device 10 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, an intelligent robot, an intelligent television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited in this application. Model training apparatus 10 is used to train video generation model 30.
In the present embodiment, the video generation model 30 is a machine learning model. Alternatively, the model training apparatus 10 may train the video generation model 30 in a machine learning manner so that it has better performance. Optionally, the training process of the video generation model 30 is as follows (only briefly described herein, and specific training process is described in the following embodiments, which are not described here): and constructing at least one training sample according to the acquired description text and the sample video corresponding to the description text. In some embodiments, N denoising units are included in video generation model 30, each denoising unit including a convolution layer and an attention layer, wherein the convolution layer includes at least two convolution mechanisms. In some embodiments, both the convolution layer and the attention layer are machine learning models. In some embodiments, the latent space representation obtained based on the sample video is denoised according to the descriptive text by at least one denoising unit, and the prediction video is obtained after decoding. In some embodiments, parameters of the video generation model are adjusted according to differences between the predicted video and the sample video to obtain a trained video generation model.
In some embodiments, the model-using device 20 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, a smart robot, a smart television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited in this application. Illustratively, trained video generation model 30 may be used to generate predictive video consistent with descriptive text from descriptive text.
The model training apparatus 10 and the model using apparatus 20 may be two independent apparatuses or the same apparatus.
In the method provided by the embodiment of the application, the execution subject of each step may be a computer device, and the computer device refers to an electronic device with data computing, processing and storage capabilities. When the electronic device is a server, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The computer device may be the model training device 10 of fig. 1 or the model using device 20.
Referring to fig. 2, a schematic diagram of a training method of a video generation model according to an embodiment of the present application is shown.
As shown in fig. 2, the training process of the video generation model includes at least one of a first training 210, a second training 220, and a third training 230.
Illustratively, the specific training procedure of the first training 210 is as follows: and inputting the descriptive text into a video generation model to obtain a first predicted video. And adjusting parameters of a denoising network of the video generation model according to the difference between the first predicted video and the sample video to obtain the video generation model after the first training. Illustratively, the other networks in the video generation model than the denoising network are pre-trained networks, and do not participate in parameter adjustment.
Illustratively, the specific training procedure of the second training 220 is as follows: firstly, an adapter and a control network are added in a video generation model based on the video generation model after the first training. Illustratively, the training samples need to be reconstructed at this point. The training samples in the second training process are also called supplementary training samples, and the supplementary training samples comprise descriptive text, sample videos and visual information corresponding to the sample videos. Illustratively, the descriptive text is input into the text encoding network, while the visual information is input into the adapter, passed through the control network, and finally input into the denoising network to participate in the denoising process. In some embodiments, the output of the video generation model during the second training process is a third predictive video. And adjusting parameters of an adapter and a control network of the video generation model according to the difference between the third predicted video and the sample video to obtain the video generation model after the second training. Illustratively, only parameters of the adapter and the control network are changed during the second training.
Illustratively, the specific training procedure for the third training 230 is as follows: first, based on the video generation model after the second training, a space-time compensation module (also referred to as a time sequence compensation module) is added in a decoding network of the video generation model. Illustratively, no reconstruction of the training samples is required at this time, and still the training samples are used as supplemental training samples in the second training process. Illustratively, the descriptive text is input into the text encoding network, while the visual information is input into the adapter, passed through the control network, and finally input into the denoising network to participate in the denoising process. In some embodiments, the output of the video generation model during the third training process is the second predicted video. And adjusting parameters of a decoding network of the video generation model according to the difference between the third predicted video and the sample video and the difference of characteristics generated in the encoding and decoding processes (the detailed description of the embodiment is omitted herein), so as to obtain the video generation model after the third training. Illustratively, only parameters of the decoding network (including the spatio-temporal compensation module) are changed during the third training process.
Of course, if the input is descriptive text only, only the first training process and the third training process need be combined, and the second training is not performed. First, based on the video generation model after the first training, a space-time compensation module is added in a decoding network of the video generation model. Illustratively, no reconstruction of the training samples is required at this time, and the training samples are still used as training samples according to the training samples in the first training process. Illustratively, the descriptive text is input into a text encoding network and finally decoded via a denoising network to obtain a fourth predictive video. And adjusting parameters of a decoding network of the video generation model according to the difference between the fourth predicted video and the sample video and the difference of the characteristics generated in the encoding and decoding processes to obtain the video generation model after the third training. Illustratively, only parameters of the decoding network (including the spatio-temporal compensation module) are changed during the third training process.
The specific flow of the application process is illustratively as follows: and inputting the description text into the video generation model after the first training to obtain a predicted video. Or, inputting the descriptive text and the visual information into a video generation model after the second training to obtain a predicted video. Or, inputting the descriptive text and the visual information into a video generation model after the third training to obtain a predicted video.
Referring to fig. 3, a flowchart of a training method of a video generation model according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be the model training apparatus described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may comprise at least one of the following steps (310-340).
At step 310, at least one training sample is obtained, where each training sample includes a sample video and descriptive text corresponding to the sample video.
Description text: for describing the content of the sample video. In the embodiment of the present application, the description text corresponding to the sample video may be a real text input by a user, or may be a text extracted from the sample video through a model. Of course, the number of words describing the text, the display type, the display style, and the like are not limited in the embodiment of the present application. The descriptive text may characterize the overall scene characteristics of the sample video, may also characterize characteristics for the primary objects in the sample video, and is not limited in this regard.
Sample video: a collective term for a plurality of consecutive video frames. In the embodiment of the present application, the number of frames of a picture, the size of a picture, and the like in a sample video are not limited. In some embodiments, the sample video is a real video, but may also be a composite video.
And 320, obtaining the hidden space representation corresponding to the sample video by adding noise to the sample video.
In some embodiments, the random noise video is generated based on a random number. Illustratively, the size of the random noise video is the same as that of the sample video, and the sum of the pixel values of the pixel points at the corresponding positions in each frame image in the sample video and the random noise video is determined as the pixel value of the pixel point at the corresponding position in the sample video with noise.
In some embodiments, the hidden space representation corresponding to the noisy sample video, that is, the hidden space representation corresponding to the sample video, is generated through a forward process of the diffusion model. In some embodiments, the forward process of the diffusion model, also referred to as the diffusion process (diffusion process), is used to successively add noise to the input data until the input data approaches pure noise. The diffusion process as a whole may be, for example, a parameterized Markov chain (Markov chain). It should be noted that the diffusion model in the embodiment of the present application is a pre-trained diffusion model, and has a certain capability of generating video based on noise video. The model parameters of the diffusion model can adopt an open-source model structure and model parameters, the application is not limited to the open-source model structure and model parameters, and the pretraining process of the diffusion model is not described too much.
In some embodiments, the noisy sample video is encoded by an encoding network to obtain an initial feature vector of the noisy sample video (i.e., the encoded video features mentioned below); and (3) carrying out T times of noise adding on the initial feature vector through the forward process of the diffusion model, generating hidden space representation corresponding to the noisy sample video, wherein T is a positive integer. In some embodiments, the initial feature of the noisy sample video is used as input data for the forward process of the diffusion model, noise is added to the initial feature vector successively through the diffusion process, the initial feature vector then loses its feature successively, and after T times of noise addition, the initial feature vector becomes a hidden space representation without any feature. I.e. the hidden spatial representation refers to a representation of a pure noise video corresponding to a noisy video without video features. The form of the hidden space representation can be a representation of a vector form or a representation of a matrix form.
In some embodiments, the forward process of the diffusion model performs T times of denoising on the initial feature vector, generates a hidden space representation corresponding to the random noise video, and the backward process of the diffusion model performs T times of denoising on the hidden space representation according to the text representation corresponding to the descriptive text, so as to obtain a denoised hidden space representation. The backward process of the diffusion model is used for successively removing noise from the input data according to the constraint condition, thereby generating a video. The backward process of the diffusion model as a whole can also be a parameterized markov chain, for example. In some embodiments, the hidden space representation and the text representation corresponding to the descriptive text are used as input data of a backward process of the diffusion model, and the backward process of the diffusion model performs successive denoising constraint on the hidden space representation based on the text representation of the descriptive text, so that the generated predictive video meets constraint requirements of the text representation.
Step 330, denoising the hidden space representation according to the description text by at least one denoising unit, and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit.
The denoising network in the embodiment of the present application may also be understood as a backward process of the diffusion model. In some embodiments, in T denoising processes for the hidden space characterization, one denoising subnet is corresponding to each denoising process.
In some embodiments, the denoising network in the diffusion model includes T denoising subnets, including a downsampling network and an upsampling network. The T denoising subnets are connected in series. The backward process of the diffusion model carries out primary denoising on the hidden space representation according to the text representation, namely, the hidden space representation is denoised through a denoising network, and the denoised hidden space representation is obtained after T times of denoising. T times of noise adding is carried out on the initial feature vector through the forward process of the diffusion model, and a hidden space representation Z corresponding to the sample video is generated T . Hidden space characterization Z T And the text representation is used as the input data of a downsampling network of the denoising network, the input data of an upsampling network is obtained according to the output data of the downsampling network, and the upsampling network obtains the output characteristic Z after one denoising according to the text representation and the input data of the upsampling network T-1′ . And then the hidden space representation Z 'after denoising is obtained through the action of the T-1 times denoising subnetwork, and the hidden space representation Z' after denoising is decoded through a decoding network to generate an output image Y.
Specifically, in the process of the ith denoising, the text representation and the ith input representation are respectively input into the downsampling network of the ith denoising subnet to obtain output data of the downsampling network of the ith denoising subnet. The ith input representation is the hidden space representation after i-1 times of denoising, and the 1 st input representation is the hidden space representation. And inputting the text token and the i input token into a downsampling network of the i denoising subnet, denoising the i input token based on the text token, and obtaining output data of the downsampling network of the i denoising subnet. In some embodiments, the downsampling network of the ith denoising subnetwork includes N cascaded network elements, N being an integer greater than 1. In some embodiments, the network element refers to a denoising unit including the convolutional layer, and the ith denoising subnet includes N cascaded denoising units, M cascaded residual modules, and one space transformer. In some embodiments, the downsampling network comprises 3 cascaded denoising units, 3 cascaded residual modules, and one converter, and the upsampling network comprises 3 cascaded residual modules and 3 cascaded denoising units. In the process of the ith denoising, the text representation and the ith input representation are used as input data of a downsampling network of the ith denoising subnet, and output data of a converter of the downsampling network of the ith denoising subnet is obtained.
In some embodiments, a plurality of denoising units are included in each denoising subnet. In some embodiments, at least one convolution layer is included in one denoising unit, and the convolution layer adopts at least two convolution mechanisms for hidden space characterization.
Illustratively, a first of the two convolution mechanisms includes: the information of the spatial dimension in the input feature is subjected to convolution processing (two-dimensional convolution), and then the information of the temporal dimension in the convolved feature is subjected to convolution processing (one-dimensional convolution). Or, the convolution processing (one-dimensional convolution) is performed on the information of the time dimension in the input feature, and then the convolution processing is performed on the information of the space dimension in the convolved feature. Or, the information of the space dimension and the information of the time dimension in the input features are respectively convolved, and then added to obtain the output features. The information of the time dimension mentioned in the embodiment of the present application refers to the information of the feature in the dimension of "frame number", and the information of the space dimension mentioned in the embodiment of the present application refers to the information of the feature in the two dimensions of "length and width of picture or image".
Illustratively, the second of the two convolution mechanisms includes: the method comprises the steps of firstly compressing information of space dimension in the input characteristics, and then carrying out convolution processing on the information of the space dimension. Or compressing the time dimension information in the input characteristics, and then carrying out convolution processing on the space dimension information. Or, calculating the difference between the feature corresponding to the t-th frame image in the input features and the feature corresponding to the t-1 th frame image in the input features after convolution, wherein t is a positive integer. The following embodiments are specifically described for three types of the second convolution mechanism, which are not described herein.
And step 340, adjusting parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model.
In some embodiments, the sample video and the first prediction video are uniform in size, i.e., have the same number of frames, the same length and width of the image. Accordingly, the loss function value is determined based on a difference in pixel value of each pixel point in the image of the corresponding frame in the sample video and the first prediction video.
The embodiment of the present application does not limit the manner of parameter adjustment. Illustratively, parameters of the video generation model are adjusted by minimizing a loss function value determined from a difference between the sample video and the first predicted video as a goal to obtain a trained video generation model. Illustratively, parameters of the video generation model are adjusted according to the loss function value in a reverse gradient propagation mode, so that a trained video generation model is obtained. Illustratively, parameters of the video generation model are adjusted according to the loss function value in a forward gradient propagation mode, so that a trained video generation model is obtained.
In the technical scheme provided by the embodiment of the application, in the video generation process, after the hidden space characterization is denoised according to the description text by at least one denoising unit, the prediction video is obtained by decoding; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit. By improving the convolution mechanism of the denoising unit in the denoising network, the matching degree between the generated video and the description text can be improved, and the effect of generating the video is improved.
Referring to fig. 4, a flowchart of a training method of a video generation model according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be the model training apparatus described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may comprise at least one of the following steps (410-460).
At step 410, at least one training sample is obtained, and each training sample includes a sample video and a description text corresponding to the sample video.
And step 420, obtaining the hidden space representation corresponding to the sample video by adding noise to the sample video.
And 430, encoding the description text through a text encoding network to obtain a text representation corresponding to the description text.
The text encoding network in the embodiment of the present application may be a CLIP model (one of multi-modal models), or may be another pre-training model, which is not limited in this application. In some embodiments, descriptive text is input to a text encoding network, and a text representation corresponding to the descriptive text is generated. The dimensions, sizes, etc. of the text representations are not limited by the embodiments of the present application. Illustratively, the text token is a text vector, a text matrix, or the like.
In some embodiments, the present application implements, when adjusting parameters in a video generation model, not changing parameters in a text encoding network, but adjusting parameters of only other modules in the video generation model than the text encoding network. Through the method, the accuracy of text characterization extraction can be guaranteed to the greatest extent, and meanwhile, the training cost of the model is reduced.
Step 440, denoising the hidden space representation according to the text representation by at least one denoising unit to obtain a first denoised hidden space representation.
In some embodiments, the denoising unit includes a convolution layer, where the convolution layer includes a first convolution sub-layer and at least one second convolution sub-layer, where the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms.
In some embodiments, the first convolution sublayer is configured to perform a convolution process in a time dimension and then a convolution process in a space dimension on a feature input into the convolution layer. Or, the method is used for carrying out convolution processing (one-dimensional convolution) on the information of the time dimension in the input characteristics, and then carrying out convolution processing on the information of the space dimension in the convolved characteristics. Or, the method is used for respectively carrying out convolution processing on the space dimension information and the time dimension information in the input features, and then adding to obtain the output features.
In some embodiments, a second convolution sublayer for the inputAnd compressing the characteristics of the convolution layer in the time dimension, and carrying out convolution processing on the information of the compressed characteristics in the space dimension. In some embodiments, the second convolution sub-layer is configured to compress the features input into the convolution layer in the spatial dimension, and then convolve the compressed features with information in the temporal dimension. In some embodiments, the processing of the second convolution sub-layer may be expressed as:wherein (1)>Representing input features, F representing features after temporal or spatial compression, z SPE Representing the output characteristics of the second convolution sub-layer, M S Representing sigmoid (Conv) 3D (F) I.e., the stimulus mask). At this point, the second convolution sub-layer may also be referred to as temporal excitation or spatial excitation.
In some embodiments, the second convolution sublayer is configured to compress features input into the convolution layer on a channel axis representing the pixel, and perform a three-dimensional convolution process on the compressed features in time and space. The processing of the second convolution sub-layer may be expressed as:wherein (1)>Representing input features, C 3 Representing the characteristics after compression by the axis of the channel, z C Representing the output characteristics of the second convolution sub-layer. At this point, the second convolution sub-layer may also be referred to as channel excitation.
In some embodiments, the second convolution sub-layer is configured to obtain difference information between two adjacent frames in a time dimension of a feature input into the convolution layer. In some embodiments, the processing of the second convolution sub-layer may be expressed as:wherein (1)>Characteristic information representing the t-th frame, +.>Characteristic information, K, representing the t-1 th frame 2D Representation pair->Two-dimensional convolution is performed. At this point, the function of the second convolution sub-layer is also referred to as motion excitation.
In some embodiments, as shown in FIG. 5, a schematic diagram of a convolutional layer is shown. The convolution layers include a first convolution sub-layer 510 and three second convolution sub-layers, namely a second convolution sub-layer 520, a second convolution sub-layer 530, and a second convolution sub-layer 540.
In some embodiments, a process flow for motion excitation is shown as 600 in FIG. 6. Zt in 600 of FIG. 6, described aboveZt-1 is also described above +.>The dimensions of the input feature are (B, C, T, H, W), B representing the batch size, C representing the channel information, T representing the number of frames, H representing the height of the image, and W representing the width (length) of the image.
In some embodiments, the convolution layers include the first convolution sub-layer and at least one second convolution sub-layer described above. In some embodiments, the first convolution sub-layer and three different second convolution sub-layers are included in the convolution layer, i.e., four convolution sub-layers are included in the convolution layer. The input features of the four convolution sublayers are of the same dimension, and the output features are of the same dimension. The sum of the output characteristics of the four convolution sublayers is taken as the output characteristic of the convolution layer.
In some embodiments, the second convolution sub-layer performs a time-dimension shift (t.shift) on the features input into the convolution layer before processing the features input into the convolution layer, and performs a subsequent process on the shifted features. In some embodiments, the time offset process also puts a part of the characteristic information of the t-1 th frame into the characteristic information of the t-1 th frame, so that the characteristic information of the t-1 th frame includes the content of the characteristic information of a part of the t-1 th frame, thereby improving the processing effect.
In some embodiments, the denoising unit further includes an attention layer, where the attention layer includes a temporal attention sub-layer and a spatial attention sub-layer; the time attention sub-layer is used for carrying out time attention operation on the characteristics input into the attention layer, and obtaining an operation result of the time attention sub-layer through linear transformation of a plurality of first feedforward neural networks after grouping; the spatial attention sub-layer is used for performing spatial attention operation on operation results of the time attention sub-layer, grouping the operation results and then performing linear transformation on the operation results of the time attention sub-layer through a plurality of second feedforward neural networks to obtain operation results of the spatial attention sub-layer, wherein the operation results of the spatial attention sub-layer are operation results of the attention layer.
In some embodiments, the denoising unit further includes an attention layer, where the attention layer includes a temporal attention sub-layer and a spatial attention sub-layer; the spatial attention sub-layer is used for performing spatial attention operation on the features input into the attention layer, grouping the features, and then performing linear transformation on the features through a plurality of second feedforward neural networks to obtain an operation result of the spatial attention sub-layer, wherein the operation result of the spatial attention sub-layer is the operation result of the attention layer. And the time attention sub-layer is used for carrying out time attention operation on the operation result of the space attention sub-layer, and obtaining the operation result of the attention layer through linear transformation of a plurality of first feedforward neural networks after grouping.
In the related art, generally, when performing attention operation, a time attention sub-layer is used for performing time attention operation on features input into the attention layer, and obtaining an operation result of the time attention sub-layer through linear transformation of a first feedforward neural network; the spatial attention sub-layer is used for performing spatial attention operation on the operation result of the temporal attention sub-layer, and obtaining the operation result of the spatial attention sub-layer through linear transformation of a second feedforward neural network, wherein the operation result of the spatial attention sub-layer is the operation result of the attention layer. According to the method and the device, the task quantity of one feedforward neural network is distributed to a plurality of feedforward neural networks, so that the processing efficiency of the model can be improved, and meanwhile, the processing cost of the model is reduced.
In some embodiments, the time attention operation refers to paying attention to the information of the time dimension in the input feature, resulting in an output result. The spatial attention operation refers to paying attention to the information of the spatial dimension in the input feature, and obtaining an output result. Specific temporal attention calculations and spatial attention calculations are not described here in detail.
In some embodiments, the attention layer operational flow is as follows:
wherein { x 1 ,x 2 ,...,x L -input feature }, ∈>Representing the features after a temporal or spatial attention operation, feedback representing a feed-forward neural network, Y i =E S,i (X i ) And represents the output result of the feedforward neural network.
In some embodiments, the first and second feedforward neural networks are different feedforward neural networks. In some embodiments, the feed forward neural network (Feedforward neural Network, FFN) is one type of artificial neural network. In such a neural network, each neuron starts at an input layer, receives a previous stage input, and inputs to a next stage, up to an output layer. The whole network has no feedback and can be represented by a directed acyclic graph. The feed-forward neural network is the earliest proposed artificial neural network, and is also the simplest artificial neural network type. The feedforward neural network can be divided into a single-layer feedforward neural network and a multi-layer feedforward neural network according to the different layers of the feedforward neural network. Common feedforward neural networks are perceptron (perceprons), BP (Back Propagation) network, RBF (Radial Basis Function) network, etc. The embodiment of the application is not limited to the specific network structure of the first feedforward neural network and the second feedforward neural network.
In some embodiments, after each training iteration, weight averages are performed on all first feedforward neural networks or all second feedforward neural networks, respectively: wherein: />That is, after all the first feedforward neural networks and all the second feedforward neural networks are trained, the weights of all the first feedforward neural networks or all the second feedforward neural networks are averaged, and the first feedforward neural networks or all the second feedforward neural networks after the weight average are obtained. When the model is used, the first feedforward neural network or all the second feedforward neural networks are directly used according to the average later weight value, which is equivalent to the mode of adopting model integration, and the training results of a plurality of feedforward neural networks are integrated in one feedforward neural network.
In some embodiments, the structure of the attention layer is as shown at 700 of FIG. 7. And normalizing the characteristics input into the space attention sub-layer through the Norm layer, performing space attention operation, normalizing the operation result, and grouping to obtain a group 1, a group 2 and the like. Group 1 is subjected to linear transformation by a first second feedforward neural network, group 2 is subjected to a second feedforward neural network, and so on. And determining the sum of the outputs of the second feedforward neural networks as the operation result of the spatial attention sub-layer. The output of the space attention sub-layer is input into the time attention sub-layer, the characteristics input into the time attention sub-layer are normalized through the Norm layer, then the time attention operation is carried out, and the operation results are normalized and then grouped to obtain a group 1, a group 2 and the like. Group 1 is subjected to linear transformation by a first feedforward neural network, group 2 is subjected to a second first feedforward neural network, and so on. And determining the sum of the outputs of the first feedforward neural networks as the operation result of the time attention sub-layer. The output result of the temporal attention sub-layer is determined as the output result of the attention layer.
In some embodiments, the output of the convolution layer is used as the input of the attention layer, or the output of the attention layer is used as the input of the convolution layer, the sequence of the modules is not limited, and the dimensions of the input information and the output information of the convolution layer and the attention layer are the same.
And step 450, obtaining the first prediction video according to the first denoised hidden space representation through the decoding network.
In some embodiments, the denoised implicit spatial representation is decoded by a decoding network to generate a predictive video. The decoding network may also be called a decoder, and the denoised hidden space representation is decoded by the decoder to obtain a prediction video corresponding to the denoised hidden space representation.
Step 460, according to the difference between the sample video and the first predicted video, adjusting the parameters of the video generation model to obtain a trained video generation model.
In the technical scheme provided by the embodiment of the application, a plurality of convolution processing mechanisms are introduced into the convolution layer, and at least one of space-time excitation, space excitation, channel excitation and motion excitation is included besides two-dimensional convolution and one-dimensional convolution in the related technology. Because the two-dimensional convolution and the one-dimensional convolution in the related technology are used as the basis, and other types of convolution are performed, the integral output of the convolution layer does not lose the information of the time dimension or the space dimension, and dimensional complementation is realized among different convolution modes, so that the convolution effect is improved, the combining capability of the time dimension and the space dimension is further improved, and the effect of generating the video is improved.
Referring to fig. 8, a flowchart of a training method of a video generation model according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be the model training apparatus described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may include at least one of the following steps (810-880).
At step 810, at least one training sample is obtained, and each training sample includes a sample video and a description text corresponding to the sample video.
And step 820, obtaining the hidden space representation corresponding to the sample video by adding noise to the sample video.
And 830, encoding the description text through a text encoding network to obtain a text representation corresponding to the description text.
In step 840, the hidden space representation is denoised according to the text representation by at least one denoising unit, so as to obtain a first denoised hidden space representation.
And step 850, obtaining the first prediction video according to the first denoised hidden space representation through the decoding network.
Step 860, according to the difference between the sample video and the first predicted video, the parameters of the video generation model are adjusted to obtain an adjusted video generation model.
In some embodiments, after step 860, training of the denoising network is achieved. At this time, it may be considered that the video generation model provided in the embodiment of the present application has been trained, and of course, the processing flow may also be continued.
In step 870, a space-time compensation module is added to the decoding network of the adjusted video generation model to obtain an updated video generation model, where the space-time compensation module is used to supplement the timing information lost in the encoding process, and the parameters of the space-time compensation module are initialization parameters.
In some embodiments, the space-time compensation module (Timing Compensation Module, abbreviated as TCM) is a path-time arbiter constructed by three-dimensional convolution, which includes a plurality of convolution layers, and the specific structure of the module is not limited in this application.
In some embodiments, the updated video generation model includes an encoding network, where the encoding network is configured to encode the sample video to obtain encoded video features, where the encoded video features are noisy to obtain the hidden spatial representation, where the encoding network includes at least one encoder, and where the decoding network includes at least one decoder, where each decoder corresponds to a space-time compensation module.
In step 880, the updated video generation model is trained using the training samples to obtain a trained video generation model.
In some embodiments, step 880 includes at least one of the following steps S1-S8 (not shown).
Step S1, the output characteristics of the ith-1 encoder are encoded through the ith encoder, the output characteristics of the ith encoder are obtained, the input characteristics of the 1 st encoder are sample videos, the output characteristics of the last encoder are encoded video characteristics, and i is a positive integer.
And S2, adding and denoising the coded video features to obtain a second denoised hidden space representation.
And S3, performing time discrimination on the output characteristics of the i-1 th decoder through a space-time compensation module corresponding to the i-1 th decoder to obtain the output characteristics after time sequence compensation corresponding to the i-1 th decoder.
And S4, determining the output characteristic corresponding to the ith decoder through the ith decoder according to the time sequence compensated output characteristic corresponding to the ith-1 decoder and the output characteristic of the ith-1 decoder, wherein the input characteristic of the space-time compensation module corresponding to the 1 st decoder is the hidden space representation after the second denoising.
Step S5, determining a second prediction video according to the output characteristics of the last decoder.
In some embodiments, the output characteristics of the last decoder are passed through a reconstruction network to obtain the second predicted video, wherein the reconstruction network is a refinement network. Illustratively, the reconstruction network is added at the same time as the space-time compensation module is added to the decoding network. The parameters of the space-time compensation module and the reconstruction network are initialization parameters, and the parameters of other modules in the decoding network are parameters after the first training.
Step S6, determining a first loss function value according to the difference between the sample video and the second prediction video.
In some embodimentsIs a first loss function value, wherein X represents sample video,>representing the second predicted video.
And S7, determining a second loss function value according to the output characteristics respectively corresponding to the at least one encoder and the output characteristics respectively corresponding to the at least one decoder after time sequence compensation.
In some embodiments, the second loss function value is determined according to the coding order from the difference between the output characteristic corresponding to the first encoder (e.g., encoder 1 in fig. 9) (which may be understood as the characteristic obtained by once encoding the sample video) and the output characteristic corresponding to the second-to-last decoder (e.g., decoder 2 in fig. 9) (which may be understood as the characteristic before the last decoding generated the predicted video), the difference between the output characteristic corresponding to the second encoder (e.g., encoder 2 in fig. 9) and the output characteristic corresponding to the third-to-last decoder (e.g., decoder 3 in fig. 9), and so on.
In some embodiments, the second loss function value isWhere M represents the number of encoders or decoders. Wherein (1)>Representing the output characteristics of the ith encoder, < >>Indicating the output characteristics of the ith decoder, where the ith decoder and the ith encoder are considered to have correspondence, i.e., the first encoder and the penultimate decoder have correspondence, the second encoder and the penultimate decoder have correspondence, and so on.
And step S8, training the updated video generation model according to the first loss function value and the second loss function value to obtain a trained video generation model.
In some embodiments, the first loss function value and the second loss function value are weighted and summed to obtain a composite loss function value. And training the video generating model according to the comprehensive loss function value to obtain a trained video generating model.
In some embodiments, the integrated loss function value Where λ is a weight value of the second loss function value, which may be set in advance and remain unchanged, or may be changed continuously with a change of the network parameter.
In some embodiments, the parameters of the other networks in the video generation model are maintained unchanged while the updated video generation model is trained based on the first loss function value and the second loss function value, and only the parameters in the decoding network are changed. In the process of changing the parameters of the decoding network, the parameters of the decoder, the space-time compensation module and the reconstruction network are changed simultaneously. When the reconstruction network is not added in the decoding network, the parameters of the decoder and the space-time compensation module are changed simultaneously in the process of changing the parameters of the decoding network.
In some embodiments, as shown at 900 of fig. 9, a schematic diagram of a training method for a decoding network is shown. A first loss (function value) is determined based on the difference between the sample video and the second predicted video. And determining a second loss (function value) according to the output characteristics corresponding to the at least one encoder and the output characteristics corresponding to the at least one decoder after time sequence compensation, and training the decoding network according to the two losses.
In the embodiment of the application, the space-time compensation module is introduced in the decoding process, so that the intermediate latent image can be enhanced to compensate the time sequence information lost in the decoding process, and the picture jitter among different frames in the generated video is reduced, thereby improving the accuracy of the reconstructed video.
Further, the reconstruction network is introduced to process the output characteristics of the decoder, so that the generation effect of the video can be improved.
An exemplary model training process when there are a plurality of inputs is described below.
In some embodiments, the above method further comprises at least one of the following steps P1-P3 (not shown) before adding the space-time compensation module in the decoding network.
And step P1, adding a control network and an adapter into the adjusted video generation model to obtain a control video generation model, wherein the control network comprises at least one denoising unit.
In some embodiments, after the first training, parameters of the denoising network in the video generation model are fixed, and parameters of the denoising unit of the downsampling network, i.e., the coding section, in the denoising subnetwork are used as basic components in the control network to construct the control network. Illustratively, the control network includes an initialized convolutional layer in addition to the denoising unit. The control video generation model in the embodiment of the present application may also be understood as a video generation model added with a control network and an adapter, that is, a video generation model in 220 of fig. 2.
And P2, adjusting the control video generation model by using at least one supplementary training sample to obtain an adjusted control video generation model, wherein the supplementary training sample is used for supplementing the sample video visual information in the training sample.
In some embodiments, the visual information is visual aspect information corresponding to the sample video. The video information is consistent in size with the sample video. The embodiment of the application is not limited to the type of visual information, and illustratively includes, but is not limited to, HED (Holistic-Nested Edge Detection) information, depth information, and Canny map information.
And step P3, updating the adjusted video generation model by using the adjusted control video generation model.
In some embodiments, the adjusted control video generation model is used as an adjusted video generation model to perform a subsequent training process on the decoding network. In some embodiments, the updated adjusted video generation model may be understood as the video generation model in 230 of FIG. 2 described above.
In some embodiments, performing feature adaptation on visual information in the supplemental training sample by using an adapter to obtain an adaptation feature corresponding to the visual information; obtaining visual characteristics corresponding to the visual information through a control network according to the adaptive characteristics corresponding to the visual information; coding the description text in the supplementary training sample through a text coding network to obtain a text representation corresponding to the description text; denoising the hidden space representation according to the text representation and the visual characteristics through a denoising network to obtain a third denoised hidden space representation; obtaining a third prediction video according to the hidden space representation after the third denoising through the decoding network; and adjusting the control video generation model according to the difference between the sample video in the supplementary training sample and the third prediction video to obtain an adjusted control video generation model.
In some embodiments, the different kinds of visual information correspond to different adapters. When multiple types of visual information are input, each visual information is respectively passed through the respective adapter to obtain the corresponding adapting feature of each visual information, and the corresponding adapting features of each visual information are added to obtain the output of the adapter.
In some embodiments, the process flow of the adapter may be expressed as: wherein 1 is an indication function, ">Represents the kth visual information,/or->Representing the adapter convolution.
In some embodiments, the hidden space representation is denoised by the denoising network according to the text representation and the visual features, and for each denoising process, the sum of the output of the downsampling network in the ith denoising subnet, the text representation and the visual features is taken as the input of the upsampling network of the denoising subnet.
The process of generating video is referred to the above embodiments, and will not be described in detail herein.
In some embodiments, parameters of the adapter and the control network in the control video generation model are adjusted according to differences between the sample video and the third prediction video in the supplemental training samples, resulting in an adjusted control video generation model.
In the embodiment of the application, the adapter and the control network are introduced, so that the video generation is guided by utilizing different visual conditions, and the visual conditions can also participate in the denoising process every time, thereby enhancing the understanding of the model on the visual conditions and improving the video generation effect of the model.
The input information is also described as an example below.
In some embodiments, the control video generation model further comprises an image encoder and a linear adapter, and the supplemental training samples further comprise style information corresponding to the sample video. Style information in embodiments of the present application may be understood as a picture representing this particular style. For example, using an image with a typical Sanskyline as style information can guide the model to generate a Sanskyline video.
In some embodiments, the image encoder may also be referred to as an image encoding network. The image coding network in the embodiment of the present application may be a CLIP model (one of multi-mode models), or may be another pre-training model, which is not limited in this application. In some embodiments, the style information is input to an image encoding network and an adaptation feature corresponding to the style information is generated via a linear adapter. The dimensions, sizes, etc. of the adaptation characterizations are not limited by the embodiments of the present application. Illustratively, the adaptation token is a text vector, a text matrix, or the like.
In some embodiments, the text coding network codes the description text in the supplementary training sample to obtain a text representation corresponding to the description text; image coding is carried out on style information through an image coder, so that image coding characteristics are obtained; performing linear adaptation on the image encoder through a linear adapter to obtain an adaptation characteristic corresponding to style information; obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text; the text token is updated with the integrated text token.
In some embodiments, the adaptation features corresponding to the style information and the text tokens corresponding to the descriptive text are spliced to obtain a comprehensive text token, and the comprehensive text token can replace the previous text token to participate in a subsequent denoising process. In some embodiments, the parameters of the linear adapter are adjusted simultaneously with the adjustment of the control network and the adapter.
In some embodiments, as shown at 1000 in fig. 10, a schematic diagram of an adapter and control network is shown. The first visual information passes through the first adapter to obtain the adapting characteristic corresponding to the first visual information. And the second visual information passes through the first adapter to obtain the adapting characteristics corresponding to the second visual information. And the third visual information passes through the first adapter to obtain the adapting characteristic corresponding to the third visual information. In some embodiments, the adapters include a plurality of types of adapters, and the corresponding adapters are selected for processing according to the type of visual information input. In some embodiments, the three types of visual information are added to each other to obtain the output X of the adapter. In some embodiments, the style information is passed through an image encoding network and then through a linear adapter to obtain an adaptation feature of the style information, and the descriptive text is passed through a text encoding network to obtain a text representation. And splicing the text characterization and the adaptation characteristics of the style information to obtain the comprehensive text characterization. And inputting the comprehensive text representation into a denoising network and an attention layer in a control network, denoising the hidden space representation according to the output characteristics of the adapter, and finally obtaining the prediction video.
In some embodiments, as shown at 1100 of FIG. 11, a schematic diagram of a training process for a video generation model is shown. Visual information is input to the denoising network through the condition adapter (namely the adapter) and the control network to participate in the denoising process. The descriptive text passes through a text coding network to obtain text representation, and the text representation is input to a denoising network to participate in a denoising process. The sample video is subjected to a coding network to obtain a coded video feature Z0, a diffusion forward process, namely a noise adding network is performed to obtain a hidden space representation Zt, the hidden space representation Zt is input into a denoising network to obtain a denoised hidden space representation, and a decoding network is performed to obtain a predicted video. And according to the difference between the sample video and the predicted video, carrying out parameter adjustment on a denoising network, a control network and an adapter of the model.
According to the video generation method and device, the style information is introduced, so that the style of the generated video can be guided, and the mode of generating the video is more flexible.
The flow of the model use side is exemplarily described below.
Referring to fig. 12, a flowchart of a video generating method based on a video generating model according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be the model-using device described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may include at least one of the following steps (1210-1230).
In step 1210, descriptive text for generating a video is obtained.
Step 1220, a hidden space token is obtained, the hidden space token being used to characterize the noise distribution prior to video generation.
In some embodiments, the hidden space representation is fixed during the model training process for use in a subsequent model generation process. In the model generation process, the step of generating the hidden space representation is not performed any more.
In other embodiments, random noise video is generated based on random numbers. In some embodiments, the hidden space representation corresponding to the random noise video, that is, the hidden space representation, is generated through a forward process of a diffusion model. In some embodiments, the random noise video is encoded through an encoding network to obtain an initial feature vector of the random noise video; and (3) carrying out T times of noise adding on the initial feature vector through the forward process of the diffusion model, generating hidden space representation corresponding to the noisy sample video, wherein T is a positive integer. The form of the hidden space representation can be a representation of a vector form or a representation of a matrix form. In some embodiments, the forward process of the diffusion model performs T times of noise addition on the initial feature vector, generating a hidden spatial representation corresponding to the random noise video. Of course, the hidden space representation may be generated in advance.
Step 1230, denoising the hidden space representation according to the description text by at least one denoising unit, and then decoding to obtain a prediction video corresponding to the description text; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit.
In some embodiments, the description text is encoded through a text encoding network, and a text representation corresponding to the description text is obtained. Denoising the hidden space representation according to the text representation by at least one denoising unit to obtain a denoised hidden space representation; and obtaining the predicted video according to the denoised hidden space representation through the decoding network.
In some embodiments, the denoising unit includes a convolution layer, where the convolution layer includes a first convolution sub-layer and at least one second convolution sub-layer, where the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms; the first convolution sublayer is used for carrying out convolution processing of a time dimension on the characteristics input into the convolution layer and then carrying out convolution processing of a space dimension; the second convolution sub-layer is used for compressing the characteristics input into the convolution layer in the time dimension and then carrying out convolution processing on the information of the compressed characteristics in the space dimension; or, the second convolution sub-layer is used for compressing the characteristics input into the convolution layer in the space dimension and then carrying out convolution processing on the information of the compressed characteristics in the time dimension; or the second convolution sub-layer is used for acquiring difference information between two adjacent frames in the time dimension of the characteristics input into the convolution layer.
In some embodiments, the video generation model further comprises a control network and an adapter; the steps further comprise: visual information for generating a video is acquired. Performing feature adaptation on the visual information by using an adapter to obtain an adaptation feature corresponding to the visual information; obtaining visual characteristics corresponding to the visual information through a control network according to the adaptive characteristics corresponding to the visual information; and denoising the hidden space representation according to the text representation and the visual characteristics through at least one denoising unit to obtain the denoised hidden space representation.
In some embodiments, the video generation model further comprises an image encoder and a linear adapter; the steps further comprise: coding the description text through a text coding network to obtain a text representation corresponding to the description text; image coding is carried out on style information through an image coder, so that image coding characteristics are obtained; performing linear adaptation on the image encoder through a linear adapter to obtain an adaptation characteristic corresponding to style information; obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text; the text token is updated with the integrated text token.
In some embodiments, a schematic diagram of the usage flow of the video generation model is shown as 1300 in fig. 13. Visual information is input to the denoising network through the condition adapter (namely the adapter) and the control network to participate in the denoising process. The descriptive text passes through a text coding network to obtain text representation, and the text representation is input to a denoising network to participate in a denoising process. The hidden space representation Zt is also input into a denoising network to obtain a denoised hidden space representation Z0, and a predicted video is obtained through a decoding network.
Technical details of the model use side in the embodiment of the present application are not mentioned, and reference is made to the model training side, which is not described here.
In the technical scheme provided by the embodiment of the application, in the video generation process, after the hidden space characterization is denoised according to the description text by at least one denoising unit, the prediction video is obtained by decoding; at least two convolution mechanisms aiming at hidden space characterization are adopted in the denoising unit. By improving the convolution mechanism of the denoising unit in the denoising network, the matching degree between the generated video and the description text can be improved, and the effect of generating the video is improved.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Referring to fig. 14, a block diagram of a training apparatus for a video generating model according to an embodiment of the present application is shown, where the video generating model includes a denoising network, and the denoising network includes at least one denoising unit. As shown in fig. 14, the apparatus 1400 may include: a sample acquisition module 1410, a video noise adding module 1420, a video generation module 1430, and a parameter adjustment module 1440.
The sample obtaining module 1410 is configured to obtain at least one training sample, where each training sample includes a sample video and a description text corresponding to the sample video.
And the video denoising module 1420 is configured to denoise the sample video to obtain a hidden space representation corresponding to the sample video.
The video generating module 1430 is configured to denoise the hidden space representation according to the description text by the at least one denoising unit, and then decode the denoised hidden space representation to obtain a first prediction video; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
And the parameter adjustment module 1440 is configured to adjust parameters of the video generation model according to the difference between the sample video and the first prediction video, so as to obtain a trained video generation model.
In some embodiments, the video generation model further comprises a text encoding network and a decoding network.
In some embodiments, the video denoising module 1420 is further configured to encode the description text through the text encoding network, so as to obtain a text representation corresponding to the description text.
In some embodiments, the video generating module 1430 is configured to denoise the hidden space representation according to the text representation by the at least one denoising unit, to obtain a first denoised hidden space representation; and obtaining the first prediction video according to the first denoised hidden space representation through the decoding network.
In some embodiments, the denoising unit includes a convolution layer, where the convolution layer includes a first convolution sub-layer and at least one second convolution sub-layer, and the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms; the first convolution sublayer is used for carrying out convolution processing of a time dimension on the characteristics input into the convolution layer and then carrying out convolution processing of a space dimension; the second convolution sublayer is used for compressing the characteristics input into the convolution layer in the time dimension and then carrying out convolution processing on the information of the compressed characteristics in the space dimension; or, the second convolution sublayer is configured to compress the feature input into the convolution layer in the spatial dimension, and then convolve the compressed feature with information in the time dimension; or the second convolution sub-layer is used for acquiring difference information between two adjacent frames in the time dimension of the characteristics input into the convolution layer.
In some embodiments, the second convolution sub-layer performs a time-dimensional shift process on the features input into the convolution layer before processing the features input into the convolution layer, and performs a subsequent process on the shifted features.
In some embodiments, the denoising unit further includes an attention layer, where the attention layer includes a temporal attention sub-layer and a spatial attention sub-layer.
In some embodiments, the time attention sub-layer is configured to perform a time attention operation on a feature input into the attention layer, and obtain an operation result of the time attention sub-layer through linear transformation of a plurality of first feedforward neural networks after grouping; the spatial attention sub-layer is used for performing spatial attention operation on the operation result of the temporal attention sub-layer, grouping the operation result of the spatial attention sub-layer, and then performing linear transformation on a plurality of second feedforward neural networks to obtain the operation result of the spatial attention sub-layer, wherein the operation result of the spatial attention sub-layer is the operation result of the attention layer.
In some embodiments, the parameter adjustment module 1440 is configured to adjust parameters of the video generation model according to the difference between the sample video and the first prediction video, to obtain an adjusted video generation model; adding a space-time compensation module in a decoding network of the adjusted video generation model to obtain an updated video generation model, wherein the space-time compensation module is used for supplementing time sequence information lost in the encoding process, and the parameters of the space-time compensation module are initialization parameters; and training the updated video generation model by using the training sample to obtain the trained video generation model.
In some embodiments, the updated video generation model includes an encoding network, the encoding network is configured to encode the sample video to obtain encoded video features, the encoded video features are noisy to obtain the hidden space representation, the encoding network includes at least one encoder, and the decoding network includes at least one decoder, where each decoder corresponds to one space-time compensation module.
In some embodiments, the parameter adjustment module 1440 is configured to encode the output characteristic of the i-1 th encoder by the i-th encoder to obtain the output characteristic of the i-th encoder, the input characteristic of the 1 st encoder is the sample video, the output characteristic of the last encoder is the encoded video characteristic, and i is a positive integer; adding and denoising the coded video features to obtain a second denoised hidden space representation; the method comprises the steps that time discrimination is carried out on output characteristics of an ith decoder through a space-time compensation module corresponding to the ith decoder, and time sequence compensated output characteristics corresponding to the ith decoder-1 are obtained; determining the output characteristic corresponding to the ith decoder by the ith decoder according to the time sequence compensated output characteristic corresponding to the ith-1 decoder and the output characteristic of the ith-1 decoder, wherein the input characteristic of the space-time compensation module corresponding to the 1 st decoder is the hidden space representation after the second denoising; determining a second predicted video based on the output characteristics of the last decoder; determining a first loss function value according to the difference between the sample video and the second predicted video; determining a second loss function value according to the output characteristics respectively corresponding to the at least one encoder and the output characteristics respectively corresponding to the at least one decoder; and training the updated video generation model according to the first loss function value and the second loss function value to obtain the trained video generation model.
In some embodiments, the parameter adjustment module 1440 is configured to add a control network and an adapter to the adjusted video generation model to obtain a control video generation model, where the control network includes at least one denoising unit; adjusting the control video generation model by using at least one supplementary training sample to obtain an adjusted control video generation model, wherein the supplementary training sample is used for supplementing the sample video in the training sample with visual information of the sample video; and updating the adjusted video generation model by using the adjusted control video generation model.
In some embodiments, the parameter adjustment module 1440 is configured to perform feature adaptation on the visual information in the supplemental training sample by using the adapter, so as to obtain an adaptation feature corresponding to the visual information; obtaining visual characteristics corresponding to the visual information through the control network according to the adaptive characteristics corresponding to the visual information; coding the description text in the supplemental training sample through the text coding network to obtain a text representation corresponding to the description text; denoising the hidden space representation according to the text representation and the visual features through the denoising network to obtain a third denoised hidden space representation; obtaining a third prediction video according to the third denoised hidden space representation through the decoding network; and adjusting the control video generation model according to the difference between the sample video in the supplementary training sample and the third prediction video to obtain the adjusted control video generation model.
In some embodiments, a parameter adjustment module 1440 is configured to adjust parameters of the adapter and the control network in the control video generation model according to a difference between the sample video in the supplemental training sample and the third prediction video, so as to obtain the adjusted control video generation model.
In some embodiments, the control video generation model further includes an image encoder and a linear adapter, and the supplemental training samples further include style information corresponding to the sample video.
In some embodiments, the video denoising module 1420 is configured to encode, through the text encoding network, descriptive text in the supplemental training samples, and obtain text representations corresponding to the descriptive text; image coding is carried out on the style information through the image coder, so that image coding characteristics are obtained; performing linear adaptation on the image encoder through the linear adapter to obtain an adaptation characteristic corresponding to the style information; obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text; and updating the text representation by using the comprehensive text representation.
Referring to fig. 15, a block diagram of a video generating apparatus based on a video generating model according to an embodiment of the present application is shown, where the video generating model includes a denoising network, and the denoising network includes at least one denoising unit. As shown in fig. 15, the apparatus 1500 may include: the acquisition module 1510, the characterization acquisition module 1520, and the video generation module 1530.
An obtaining module 1510 is configured to obtain descriptive text for generating a video.
A token acquisition module 1520 for acquiring a hidden space token that characterizes a noise distribution prior to video generation.
The video generating module 1530 is configured to denoise the hidden space representation according to the description text by using the at least one denoising unit, and then decode the denoised hidden space representation to obtain a predicted video corresponding to the description text; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
In some embodiments, the video generation model further comprises a text encoding network and a decoding network.
In some embodiments, the video generating module 1530 is further configured to encode the description text through the text encoding network, so as to obtain a text representation corresponding to the description text.
In some embodiments, the video generating module 1530 is configured to denoise, by the at least one denoising unit, the hidden space representation according to the text representation, to obtain a denoised hidden space representation; and obtaining the prediction video through the decoding network according to the denoised hidden space representation.
In some embodiments, the denoising unit includes a convolution layer, where the convolution layer includes a first convolution sub-layer and at least one second convolution sub-layer, and the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms.
In some embodiments, the first convolution sublayer is configured to perform convolution processing in a time dimension on a feature input into the convolution layer and then perform convolution processing in a space dimension; the second convolution sublayer is used for compressing the characteristics input into the convolution layer in the time dimension and then carrying out convolution processing on the information of the compressed characteristics in the space dimension; or, the second convolution sublayer is configured to compress the feature input into the convolution layer in the spatial dimension, and then convolve the compressed feature with information in the time dimension; or the second convolution sub-layer is used for acquiring difference information between two adjacent frames in the time dimension of the characteristics input into the convolution layer.
In some embodiments, the video generation model further comprises a control network and an adapter.
In some embodiments, the acquisition module 1510 is further configured to acquire visual information for generating a video.
In some embodiments, the video generating module 1530 is further configured to perform feature adaptation on the visual information by using the adapter, so as to obtain an adaptation feature corresponding to the visual information; and obtaining the visual characteristics corresponding to the visual information through the control network according to the adaptive characteristics corresponding to the visual information.
In some embodiments, the video generating module 1530 is further configured to denoise, by the at least one denoising unit, the hidden space representation according to the text representation and the visual feature, to obtain the denoised hidden space representation.
In some embodiments, the video generation model further comprises an image encoder and a linear adapter.
In some embodiments, the obtaining module 1510 is further configured to obtain style information for generating the video.
In some embodiments, the video generating module 1530 is further configured to encode the description text through the text encoding network, so as to obtain a text representation corresponding to the description text; image coding is carried out on the style information through the image coder, so that image coding characteristics are obtained; performing linear adaptation on the image encoder through the linear adapter to obtain an adaptation characteristic corresponding to the style information; obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text; and updating the text representation by using the comprehensive text representation.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the content structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to FIG. 16, a block diagram of a computer device 1600 provided in one embodiment of the present application is shown. The computer device 1600 may be any electronic device that has data computing, processing, and storage capabilities. The computer device 1600 may be used to implement the training method of the video generation model provided in the above embodiments, or the video generation method based on the video generation model.
In general, the computer device 1600 includes: a processor 1601, and a memory 1602.
Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing ), an FPGA (Field Programmable Gate Array, field programmable gate array), a PLA (Programmable Logic Array ). The processor 1601 may also include a host processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor) for use in responsible for rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 1601 may also include an AI processor for processing computing operations related to machine learning.
Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1602 is used to store a computer program configured to be executed by one or more processors to implement the training method of the video generation model described above, or a video generation method based on the video generation model.
Those skilled in the art will appreciate that the architecture shown in fig. 16 is not limiting as to the computer device 1600, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.
In an exemplary embodiment, a computer readable storage medium is also provided, in which a computer program is stored which, when being executed by a processor, implements the above-described training method of the video generation model, or a video generation method based on the video generation model. Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random Access Memory ), SSD (Solid State Drives, solid state disk), or optical disk, etc. The random access memory may include, among other things, reRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ).
In an exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute the training method of the video generation model or the video generation method based on the video generation model.
It should be noted that, in the present application, the collection and processing of related data (including descriptive text, sample video, visual information, style information, etc.) should strictly obtain informed consent or independent consent of the personal information body according to requirements of relevant national laws and regulations during the application of the examples, and develop subsequent data use and processing behaviors within the scope of the laws and regulations and the authorization of the personal information body.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.
The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims (20)

1. A method of training a video generation model, the video generation model comprising a denoising network including at least one denoising unit, the method comprising:
acquiring at least one training sample, wherein each training sample comprises a sample video and a description text corresponding to the sample video;
the hidden space representation corresponding to the sample video is obtained by adding noise to the sample video;
denoising the hidden space representation according to the description text by the at least one denoising unit, and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit;
and adjusting parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model.
2. The method of claim 1, wherein the video generation model further comprises a text encoding network and a decoding network; the method further comprises the steps of:
Encoding the description text through the text encoding network to obtain a text representation corresponding to the description text;
the denoising the hidden space representation by the at least one denoising unit according to the description text, and then decoding to obtain a first prediction video, which comprises the following steps:
denoising the hidden space representation according to the text representation by the at least one denoising unit to obtain a first denoised hidden space representation;
and obtaining the first prediction video according to the first denoised hidden space representation through the decoding network.
3. The method according to claim 2, wherein the denoising unit comprises a convolution layer, wherein the convolution layer comprises a first convolution sub-layer and at least one second convolution sub-layer, and wherein the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms;
the first convolution sublayer is used for carrying out convolution processing of a time dimension on the characteristics input into the convolution layer and then carrying out convolution processing of a space dimension;
the second convolution sublayer is used for compressing the characteristics input into the convolution layer in the time dimension and then carrying out convolution processing on the information of the compressed characteristics in the space dimension; or alternatively, the first and second heat exchangers may be,
The second convolution sublayer is used for compressing the characteristics input into the convolution layer in the space dimension and then carrying out convolution processing on the information of the compressed characteristics in the time dimension; or alternatively, the first and second heat exchangers may be,
and the second convolution sub-layer is used for acquiring difference information between two adjacent frames in the time dimension of the characteristics input into the convolution layer.
4. A method according to claim 3, wherein the second convolution sub-layer performs a time-dimensional shift process on features input into the convolution layer before processing the features input into the convolution layer, and performs a subsequent process on the shifted features.
5. The method according to claim 2, wherein the denoising unit further comprises an attention layer, and the attention layer comprises a temporal attention sub-layer and a spatial attention sub-layer;
the time attention sub-layer is used for carrying out time attention operation on the characteristics input into the attention layer, and obtaining an operation result of the time attention sub-layer through linear transformation of a plurality of first feedforward neural networks after grouping;
the spatial attention sub-layer is used for performing spatial attention operation on the operation result of the temporal attention sub-layer, grouping the operation result of the spatial attention sub-layer, and then performing linear transformation on a plurality of second feedforward neural networks to obtain the operation result of the spatial attention sub-layer, wherein the operation result of the spatial attention sub-layer is the operation result of the attention layer.
6. The method according to claim 2, wherein adjusting parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model comprises:
according to the difference between the sample video and the first prediction video, adjusting parameters of the video generation model to obtain an adjusted video generation model;
adding a space-time compensation module in a decoding network of the adjusted video generation model to obtain an updated video generation model, wherein the space-time compensation module is used for supplementing time sequence information lost in the encoding process, and the parameters of the space-time compensation module are initialization parameters;
and training the updated video generation model by using the training sample to obtain the trained video generation model.
7. The method of claim 6, wherein the updated video generation model includes an encoding network, the encoding network is configured to encode the sample video to obtain encoded video features, the encoded video features are noisy to obtain the hidden spatial representation, the encoding network includes at least one encoder, and the decoding network includes at least one decoder, each decoder corresponds to a space-time compensation module;
Training the updated video generation model by using the training sample to obtain the trained video generation model, including:
encoding output characteristics of an ith-1 encoder through the ith encoder to obtain the output characteristics of the ith encoder, wherein the input characteristics of the 1 st encoder are the sample video, the output characteristics of the last encoder are the encoded video characteristics, and i is a positive integer;
adding and denoising the coded video features to obtain a second denoised hidden space representation;
the method comprises the steps that time discrimination is carried out on output characteristics of an ith decoder through a space-time compensation module corresponding to the ith decoder, and time sequence compensated output characteristics corresponding to the ith decoder-1 are obtained;
determining the output characteristic corresponding to the ith decoder by the ith decoder according to the time sequence compensated output characteristic corresponding to the ith-1 decoder and the output characteristic of the ith-1 decoder, wherein the input characteristic of the space-time compensation module corresponding to the 1 st decoder is the hidden space representation after the second denoising;
determining a second predicted video based on the output characteristics of the last decoder;
Determining a first loss function value according to the difference between the sample video and the second predicted video;
determining a second loss function value according to the output characteristics respectively corresponding to the at least one encoder and the output characteristics respectively corresponding to the at least one decoder after time sequence compensation;
and training the updated video generation model according to the first loss function value and the second loss function value to obtain the trained video generation model.
8. The method of claim 6, wherein adding a spatio-temporal compensation module to the decoding network of the adjusted video generation model, prior to obtaining an updated video generation model, further comprises:
adding a control network and an adapter into the adjusted video generation model to obtain a control video generation model, wherein the control network comprises at least one denoising unit;
adjusting the control video generation model by using at least one supplementary training sample to obtain an adjusted control video generation model, wherein the supplementary training sample is used for supplementing the sample video in the training sample with visual information of the sample video;
And updating the adjusted video generation model by using the adjusted control video generation model.
9. The method of claim 8, wherein adjusting the control video generation model using at least one supplemental training sample results in an adjusted control video generation model, comprising:
performing feature adaptation on visual information in the supplemental training sample by using the adapter to obtain an adaptation feature corresponding to the visual information;
obtaining visual characteristics corresponding to the visual information through the control network according to the adaptive characteristics corresponding to the visual information;
coding the description text in the supplemental training sample through the text coding network to obtain a text representation corresponding to the description text;
denoising the hidden space representation according to the text representation and the visual features through the denoising network to obtain a third denoised hidden space representation;
obtaining a third prediction video according to the third denoised hidden space representation through the decoding network;
and adjusting the control video generation model according to the difference between the sample video in the supplementary training sample and the third prediction video to obtain the adjusted control video generation model.
10. The method according to claim 9, wherein adjusting the control video generation model according to the difference between the sample video in the supplemental training samples and the third prediction video to obtain the adjusted control video generation model comprises:
and adjusting parameters of the adapter and the control network in the control video generation model according to the difference between the sample video in the supplementary training sample and the third prediction video to obtain the adjusted control video generation model.
11. The method of claim 9, wherein the control video generation model further comprises an image encoder and a linear adapter, and wherein the supplemental training samples further comprise style information corresponding to the sample video;
the text coding network codes the description text in the supplementary training sample to obtain text representation corresponding to the description text, and the text representation method comprises the following steps:
coding the description text in the supplemental training sample through the text coding network to obtain a text representation corresponding to the description text;
image coding is carried out on the style information through the image coder, so that image coding characteristics are obtained;
Performing linear adaptation on the image encoder through the linear adapter to obtain an adaptation characteristic corresponding to the style information;
obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text;
and updating the text representation by using the comprehensive text representation.
12. A video generation method based on a video generation model, wherein the video generation model comprises a denoising network, the denoising network comprises at least one denoising unit, and the method comprises:
acquiring descriptive text for generating video;
acquiring a hidden space representation, wherein the hidden space representation is used for representing noise distribution before video generation;
denoising the hidden space representation according to the description text by using the at least one denoising unit, and then decoding to obtain a prediction video corresponding to the description text; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
13. The method of claim 12, wherein the video generation model further comprises a text encoding network and a decoding network; the method further comprises the steps of:
Encoding the description text through the text encoding network to obtain a text representation corresponding to the description text;
the denoising unit performs denoising on the hidden space representation according to the description text, and then decodes the hidden space representation to obtain a prediction video corresponding to the description text, including:
denoising the hidden space representation according to the text representation by the at least one denoising unit to obtain a denoised hidden space representation;
and obtaining the prediction video through the decoding network according to the denoised hidden space representation.
14. The method of claim 13, wherein the denoising unit comprises a convolution layer, wherein the convolution layer comprises a first convolution sub-layer and at least one second convolution sub-layer, and wherein the first convolution sub-layer and the second convolution sub-layer correspond to different convolution mechanisms;
the first convolution sublayer is used for carrying out convolution processing of a time dimension on the characteristics input into the convolution layer and then carrying out convolution processing of a space dimension;
the second convolution sublayer is used for compressing the characteristics input into the convolution layer in the time dimension and then carrying out convolution processing on the information of the compressed characteristics in the space dimension; or alternatively, the first and second heat exchangers may be,
The second convolution sublayer is used for compressing the characteristics input into the convolution layer in the space dimension and then carrying out convolution processing on the information of the compressed characteristics in the time dimension; or alternatively, the first and second heat exchangers may be,
and the second convolution sub-layer is used for acquiring difference information between two adjacent frames in the time dimension of the characteristics input into the convolution layer.
15. The method of claim 13, wherein the video generation model further comprises a control network and an adapter; the method further comprises the steps of:
acquiring visual information for generating a video;
performing feature adaptation on the visual information by using the adapter to obtain an adaptation feature corresponding to the visual information;
obtaining visual characteristics corresponding to the visual information through the control network according to the adaptive characteristics corresponding to the visual information;
denoising the hidden space representation according to the text representation by the at least one denoising unit to obtain a denoised hidden space representation, wherein the denoising unit comprises:
and denoising the hidden space representation according to the text representation and the visual features by the at least one denoising unit to obtain the denoised hidden space representation.
16. The method of claim 13, wherein the video generation model further comprises an image encoder and a linear adapter; the method further comprises the steps of:
acquiring style information for generating a video;
the text coding network codes the description text to obtain text representation corresponding to the description text, and the text representation method comprises the following steps:
encoding the description text through the text encoding network to obtain a text representation corresponding to the description text;
image coding is carried out on the style information through the image coder, so that image coding characteristics are obtained;
performing linear adaptation on the image encoder through the linear adapter to obtain an adaptation characteristic corresponding to the style information;
obtaining comprehensive text characterization according to the adaptation features corresponding to the style information and the text characterization corresponding to the descriptive text;
and updating the text representation by using the comprehensive text representation.
17. A training apparatus for a video generation model, the video generation model comprising a denoising network including at least one denoising unit therein, the apparatus comprising:
the sample acquisition module is used for acquiring at least one training sample, and each training sample comprises a sample video and a description text corresponding to the sample video;
The video noise adding module is used for obtaining hidden space representation corresponding to the sample video by adding noise to the sample video;
the video generation module is used for denoising the hidden space representation according to the description text through the at least one denoising unit and then decoding to obtain a first prediction video; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit;
and the parameter adjustment module is used for adjusting the parameters of the video generation model according to the difference between the sample video and the first prediction video to obtain a trained video generation model.
18. A video generation apparatus based on a video generation model, the video generation model comprising a denoising network including at least one denoising unit therein, the apparatus comprising:
the acquisition module is used for acquiring descriptive text for generating video;
the representation acquisition module is used for acquiring a hidden space representation, wherein the hidden space representation is used for representing noise distribution before video generation;
the video generation module is used for denoising the hidden space representation according to the description text through the at least one denoising unit and then decoding to obtain a prediction video corresponding to the description text; at least two convolution mechanisms aiming at the hidden space characterization are adopted in the denoising unit.
19. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of training a video generation model according to any one of claims 1 to 11 or to implement the method of video generation based on a video generation model according to any one of claims 12 to 16.
20. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the training method of the video generation model according to any one of claims 1 to 11 or to implement the video generation method based on the video generation model according to any one of claims 12 to 16.
CN202311486824.XA 2023-11-08 2023-11-08 Training method, device, equipment and storage medium of video generation model Pending CN117499711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311486824.XA CN117499711A (en) 2023-11-08 2023-11-08 Training method, device, equipment and storage medium of video generation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311486824.XA CN117499711A (en) 2023-11-08 2023-11-08 Training method, device, equipment and storage medium of video generation model

Publications (1)

Publication Number Publication Date
CN117499711A true CN117499711A (en) 2024-02-02

Family

ID=89675985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311486824.XA Pending CN117499711A (en) 2023-11-08 2023-11-08 Training method, device, equipment and storage medium of video generation model

Country Status (1)

Country Link
CN (1) CN117499711A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117880444A (en) * 2024-03-12 2024-04-12 之江实验室 Human body rehabilitation exercise video data generation method guided by long-short time features

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117880444A (en) * 2024-03-12 2024-04-12 之江实验室 Human body rehabilitation exercise video data generation method guided by long-short time features
CN117880444B (en) * 2024-03-12 2024-05-24 之江实验室 Human body rehabilitation exercise video data generation method guided by long-short time features

Similar Documents

Publication Publication Date Title
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN110796111B (en) Image processing method, device, equipment and storage medium
CN113259665B (en) Image processing method and related equipment
CN111079532A (en) Video content description method based on text self-encoder
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN116721334B (en) Training method, device, equipment and storage medium of image generation model
CN115619743A (en) Construction method and application of OLED novel display device surface defect detection model
CN117499711A (en) Training method, device, equipment and storage medium of video generation model
CN114330736A (en) Latent variable generative model with noise contrast prior
KR20240065281A (en) Vector-quantized image modeling
CN116704079A (en) Image generation method, device, equipment and storage medium
Uddin et al. A perceptually inspired new blind image denoising method using $ L_ {1} $ and perceptual loss
WO2023068953A1 (en) Attention-based method for deep point cloud compression
CN114529785A (en) Model training method, video generation method and device, equipment and medium
CN116912367B (en) Method and system for generating image based on lightweight dynamic refinement text
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN115829962B (en) Medical image segmentation device, training method, and medical image segmentation method
CN116095183A (en) Data compression method and related equipment
CN115512368B (en) Cross-modal semantic generation image model and method
TW202348029A (en) Operation of a neural network with clipped input data
CN114501031B (en) Compression coding and decompression method and device
Hah et al. Information‐based boundary equilibrium generative adversarial networks with interpretable representation learning
WO2023177318A1 (en) Neural network with approximated activation function
Weiss et al. Deep learning-based upscaling for in situ volume visualization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication