CN113934890A - Method and system for automatically generating scene video by characters - Google Patents
Method and system for automatically generating scene video by characters Download PDFInfo
- Publication number
- CN113934890A CN113934890A CN202111538104.4A CN202111538104A CN113934890A CN 113934890 A CN113934890 A CN 113934890A CN 202111538104 A CN202111538104 A CN 202111538104A CN 113934890 A CN113934890 A CN 113934890A
- Authority
- CN
- China
- Prior art keywords
- image
- vector
- generated
- dynamic
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention relates to the field of video production, in particular to a method and a system for automatically generating scene video by characters, wherein the system comprises the following steps: the composition logic generation module is used for generating a composition template image of a composition according to the input text description; the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image; and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video. The method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the efficiency of short video production is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.
Description
Technical Field
The invention relates to the field of video production, in particular to a method and a system for automatically generating scene videos by characters.
Background
With the development of the internet, short videos are produced at the same time. Short videos rapidly occupy people's lives as a new way of content recording and media presentation.
In the field of video production. The traditional video production process is complicated, and specific materials need to be searched or shot for video creation. Meanwhile, in order to avoid the repeatability of video materials and ensure the novelty of videos, designers need to be continuously created and designed.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides the following specific technical inventions:
a method for automatically generating scene videos by characters comprises the following steps:
the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;
step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;
step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;
step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;
step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.
Further, the step one specifically includes: template image to be composedSent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encodingImage coding vector with dimension dAnd an image encoding token,。
further, the vector quantization variation automatic Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, the two modules share the codebook module, and specifically, the Encoder encodes an imageIs coded into,The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and thenConverted into the discrete code e closest thereto, i.e. output asImage coding vector with dimension dAnd a corresponding token,and will beSending into Decoder module, decoding to generate composition template image。
Further, the training mode of the vector quantization variation automatic encoder VQ-VAE is as follows: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.
Further, the second step specifically includes: coding the input language description through a pre-training neural network language model to generate k word vectors,and k word tokens,。
further, the third step specifically includes the following steps:
(3.1) mixingFlattening the image code vectors to generate g image code vectors, whereinG is a fixed value, and position embedding is added to g image coding vectors;
(3.2) splicing the k word vectors and the g image coding vectors to obtain the textAnd imageAlso performing a splicing operation to generateAn embedded representation vector;
(3.4) dividing the f vectorsAnd sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set is f code vectorsAnd the corresponding image and word token thereof, sending the f encoding vectors into a GPT model, predicting the next token to appear by the GPT model according to the previously input vector, and reducing the softmax classification loss function through a random gradient descent back propagation algorithm;
during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image。
Further, the fourth step specifically includes: combining the composition template images generated in the third stepAnd sending random noise into the style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, and the discriminator distinguishes whether the live-action picture is real or generated by the generator. In the network training, the generator and the discriminator are jointly trained, and the composition template image generated in the third step is input in the predictionTo the generator, outputting the generated live-action picture。
Further, the fifth step specifically includes the following steps:
(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;
(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step fourSending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;
(5.3) rendering the live-action pictureCorresponding patterned template mapMarking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areasAnd controlling noise input generated by static region;
(54) noise input to control dynamic region generationPerforming m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static regionAnd inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.
Further, the dynamic transformation in the step (5.4) is specifically: controlling noise in dynamic regions, converting input noise, noise generated by controlling dynamic regionsTo simulate its motion.
A system for automatically text-generating a video of a scene, comprising:
the composition logic generation module is used for generating a composition template image of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
The invention has the advantages that:
the method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the short video generation efficiency is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is an overall architecture diagram of the composition logic generation module;
FIG. 4 is an overall architecture diagram of an image content generation module;
FIG. 5 is a diagram illustrating the overall architecture of the image dynamizer;
FIG. 6 is a graph showing the results of an example of the present invention;
fig. 7 is a hardware configuration diagram of an arbitrary device having data processing capability in which the apparatus for automatically text-generating a scene video according to the present invention is located.
Detailed Description
In order to make the objects, technical inventions and technical effects of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, a system for automatically generating a scene video with text according to the present invention includes:
the composition logic generation module is used for generating a composition template picture of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
As shown in fig. 2, a method for automatically generating a scene video with text according to the present invention includes:
The invention adopts a vector quantization variational automatic encoder VQ-VAE to compress and quantize a composition template, and the VQ-VAE adopts a conventional algorithm, and specifically comprises the following steps:
the VQ-VAE mainly comprises three parts, an Encoder module, a codebook module and a Decoder module.
Inputting an imageThrough Encoder, output is。Is composed ofVector coding with dimension d. The codebook contains k d-dimensional vectors, denoted by C,。finding out the vector with the shortest Euclidean distance in C, replacing and outputting e, sending e into Decoder, and outputting image. Where d =256, h = 32, w = 32, k = 1024.
During training, the input of the network isA size ofWhere C =1, H =256, W = 256. The output of the network is a composition template imageA size of。
The Encoder structure and the Decode structure are respectively constructed by adopting Encoder parts and Decode parts and respectively adopting a conventional convolution function, BatchNorm parts and a residual error network. The Encoder contains n resnet _ blocks and down-samples m times spatially. Convolution and upsampling operations in the Decoder block so that input x and outputThe sizes are consistent.
The loss function of VQ-VAE is as follows:
training: with the Adam optimizer, the initial learning rate was 0.001.
And (3) prediction: inputting any composition template image for the VQ-VAE trained according to the stepsOutput through Encoder module,Finding the nearest vector in codebook for replacement to obtainImage coding vector with dimension dAnd a corresponding image token.
And 102, inputting text description to a pre-training neural network language model Bert and outputting a text token.
As shown in FIG. 3, the input language description is fed into BeGenerating k word vectors in rtAnd a number k of tokens to be used,. And performing pad operation on the condition that less than k word vectors are output, so that the number of the word vectors is k. Where k = 256.A size ofWhere t = k =256 and d = 256.
And 103, jointly modeling the text image based on the GPT model.
And flattening the image coding vector, and then sending the image coding vector and the text coding vector into a GPT model for autoregressive training.
In order to suggest a mapping relation between text information and image information, the invention adopts the output of a language pre-training model Bert and codes after VQ-VAE quantization to be sent into a GPT model for relational modeling. The basic structure of the GPT model is described as follows:
the GPT model consists of a self-orientation block stacked by m layers, wherein the self-orientation block consists of structural units of Multi-Head orientation, Feed Forward and Add & Norm, and m =8 in the invention.
The input of GPT is obtained by adding Position Embedding and Token Embedding。A size ofWhere f = 1280 and d = 256.
The objective function of the GPT training model is as follows:
The calculation process of the GPT model is as follows:
whereinWhen the first k word vectors are expressed and the prediction probability of each word is calculated, only the vocabulary information of the size of the left window is considered.
Where n represents the number of layers of the self-orientation block.A matrix of word vectors is represented that is,indicating position embedding.
Training: during training, an Adam optimizer is adopted, and the initial learning rate is 0.0003.
And (3) prediction: after the text and the image are jointly modeled, a section of text description is input to a pre-trained neural network language model Bert, the output of the Bert is input to a GPT, and token of the image is generated step by step. Looking up the corresponding vector from the token of the image to the VQ-VAE trained in the step 101 and outputting the vectore, e are input into Decoder module to output reconstructed image。
The generation part of the composition template to the live-action picture adopts the most classical pixel2pixel in the style migration GAN network, and the pixel2pixel is a cGAN, so that the conversion from the composition template image to the live-action image can be completed. Pixel2 pixels are largely divided into two blocks, a generator and a discriminator. The overall structure of the Pixel2Pixel generator adopts a U-Net network structure, a symmetrical jump connection structure of Unet is adopted to directly copy low-layer information to a high-layer characteristic diagram, the ith layer is spliced to the (n-i) th layer, and n is the total number of network layers. The discriminator adopts L1 loss and patchGAN discriminator. L1 loss learns low frequency information as an aid, and the patch gan determines whether or not only one N × N image block is real at a time, and averages the results corresponding to each small block as a determination result of the one image. In the invention, N =9 and N = 70.
The training time generator and the arbiter are alternately trained.
The objective function for Pixel2Pixel optimization is as follows:
wherein
Where z is noise, the network learns the mapping from the template image x to the live-action image y. G denotes a generator, and D denotes a discriminator. During training, G, D is trained alternately.
During training: the generator and the discriminator adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
In prediction, the composition template image reconstructed in step 103 is usedInput to the generator to generate the live-action picture。
As shown in fig. 6, according to the input text description, continuous multiframe images are generated by the method of the invention, and dynamic videos are generated.
This embodiment employs a deep-landscape network.
The deep-landscaper generator can control the image characteristics through the latent vector and simultaneously realize the control of random attributes of the image by combining the noise injected into the network. The task of the discriminator is to distinguish whether the picture is real or generated by the generator.
And generating a series of image images by the trained deep-landscapes, and simultaneously storing a latent vector code corresponding to the image. When the Encoder is trained, the Encoder can be a Resnet network, a picture generated for deep-landscapes is input, and a size following code vector is output.
The live-action picture generated in the step 104 is processedStatic and dynamic areas are marked, such as: the blue sky sea belongs to a static state, and the rest part belongs to a dynamic region. And combining the latent vector template code and the input noise with the picture composition template picture to perform fine tuning optimization. Optimizing latent vector latentcode output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic region information, and respectively inputting the noise generated for controlling the dynamic regionAnd controlling the static areaDomain generated noise input。
Training deep-landscape: the generator and the discriminator both adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
And (3) prediction: noise input to control dynamic region generationPerforming m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static regionAnd inputting the images into a generator of the trained deep-landscapes to generate m new images.And m new live-action pictures generated subsequently are the final generated video, and m =200 in the invention.
And transforming the dynamic background information in the scene graph to generate a dynamic scene video. deep-landscapes are mainly divided into two parts, a generator and a discriminator. The generator is characterized in that a network structure comprises two parts, wherein the first part is Mapping network, the process of generating an intermediate hidden variable w by a hidden variable z is performed, the second part is Synthesis network, affine transformation obtained by converting w is used for controlling the style of generated images, and meanwhile random noise input in the network is used for enriching details of the generated images.
The discriminator is a classification network consisting of n convolution modules, the convolution modules carry out convolution and downsampling operations, and n =9 in the invention.
The style of the generated image is controlled by the adaptive instance normalization AdaIN, and the specific formula is as follows:
whereinAndrespectively representing input characteristicsAverage difference and standard deviation of; ,is the affine transformed generated scaling and bias values of w, ,the style may be applied to the ith spatial feature map.
The method comprises the steps of coding a latent vector of the overall layout of colors and scenes in a scene graph and coding a latent vector of illumination brightness transformation in a generated video; the input noise in the coding-generated video is divided into noise encoding static object details and shapes and noise encoding dynamic object details and shapes.
In summary, the invention relates to basic network models such as VQ-VAE, GPT, pixel2pixel and styleGAN, etc., VQ-VAE can effectively utilize potential space, and can model important features which usually span multiple dimensions of data space; the method comprises the steps of compressing and quantizing a composition template image by using the VQ-VAE discretization characteristic to generate a serialized token, adopting a GPT autoregressive model for generating a formula task, completing mapping from a language to an image by using a pre-trained Bert language model and the serialized token of the composition template image, generating a series of tokens corresponding to the composition template image according to language input during prediction, and completing the translation from the image to the image by using pixel2pixel network. Based on the above, the invention provides a new material acquisition mode for video creation, reduces the video creation cost and improves the video creation efficiency.
Corresponding to the embodiment of the method for automatically generating the scene video by the characters, the invention also provides an embodiment of a device for automatically generating the scene video by the characters.
Referring to fig. 7, an apparatus for automatically generating a scene video with text according to an embodiment of the present invention includes one or more processors, and is configured to implement the method for automatically generating a scene video with text in the foregoing embodiment.
The embodiment of the apparatus for automatically generating scene video by text of the invention can be applied to any equipment with data processing capability, and the equipment with data processing capability can be equipment or apparatus such as a computer. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability where the apparatus for automatically generating scene videos by text is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for automatically generating the scene video by the characters in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the invention as described in the foregoing examples, or that certain features may be substituted in the same way. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.
Claims (10)
1. A method for automatically generating scene video by characters is characterized by comprising the following steps:
the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;
step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;
step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;
step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;
step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.
2. The method for automatically generating scene videos by using characters as claimed in claim 1, wherein the first step specifically comprises: template image to be composedSent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encodingImage coding vector with dimension dAnd an image encoding token,。
3. the method of claim 2, wherein the vector quantization variation auto-Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, and the two modules share the codebook module, and in particular, the Encoder encodes the imageIs coded into,The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and thenConverted into the discrete code e closest thereto, i.e. output asImage coding vector with dimension dAnd a corresponding token,and will beIs sent intoDecoder module for decoding and generating composition template image。
4. The method of claim 3, wherein the vector quantization variation auto-encoder VQ-VAE is trained in a manner of: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.
6. the method for automatically generating scene videos by characters according to claim 5, wherein the third step specifically comprises the following steps:
(3.1) mixingFlattening the image code vectors to generate g image code vectors, whereinG is a fixed value, and position embedding is added to g image coding vectors;
(3.2) splicing the k word vectors and the g image coding vectors to obtain the textAnd imageAlso performing a splicing operation to generateAn embedded representation vector;
(3.4) dividing the f vectorsAnd sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set comprises f coding vectors and corresponding images and word tokens, the f coding vectors are sent into a GPT model, the GPT model predicts the next token to appear according to the vector input in front, and the softmax classification loss function is reduced through a random gradient descent back propagation algorithm;
during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image。
7. The method for automatically generating scene videos by characters according to claim 6, wherein the fourth step specifically comprises: combining the composition template images generated in the third stepAnd sending random noise into a style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, the discriminator distinguishes whether the live-action picture is real or generated by the generator, the generator and the discriminator are jointly trained during network training, and during prediction, the composition template image generated in the third step is inputTo the generator, outputting the generated live-action picture。
8. The method for automatically generating scene videos by using texts as claimed in claim 7, wherein the step five specifically comprises the following steps:
(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;
(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step fourSending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;
(5.3) rendering the live-action pictureCorresponding patterned template mapMarking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areasAnd controlling noise input generated by static region;
(5.4) noise input to control dynamic region GenerationPerforming m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static regionAnd inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.
10. A system for automatically text-generating a video of a scene as claimed in any one of claims 1 to 9, comprising:
the composition logic generation module is used for generating a composition template image of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111538104.4A CN113934890B (en) | 2021-12-16 | 2021-12-16 | Method and system for automatically generating scene video by characters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111538104.4A CN113934890B (en) | 2021-12-16 | 2021-12-16 | Method and system for automatically generating scene video by characters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113934890A true CN113934890A (en) | 2022-01-14 |
CN113934890B CN113934890B (en) | 2022-04-15 |
Family
ID=79289156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111538104.4A Active CN113934890B (en) | 2021-12-16 | 2021-12-16 | Method and system for automatically generating scene video by characters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113934890B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114610935A (en) * | 2022-05-12 | 2022-06-10 | 之江实验室 | Method and system for synthesizing semantic image of text control image style |
CN115249062A (en) * | 2022-09-22 | 2022-10-28 | 武汉大学 | Network model, method and device for generating video by text |
CN115511969A (en) * | 2022-11-22 | 2022-12-23 | 阿里巴巴(中国)有限公司 | Image processing and data rendering method, apparatus and medium |
CN115880158A (en) * | 2023-01-30 | 2023-03-31 | 西安邮电大学 | Blind image super-resolution reconstruction method and system based on variational self-coding |
WO2023154192A1 (en) * | 2022-02-14 | 2023-08-17 | Snap Inc. | Video synthesis via multimodal conditioning |
CN117496025A (en) * | 2023-10-19 | 2024-02-02 | 四川大学 | Multi-mode scene generation method based on relation and style perception |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110572696A (en) * | 2019-08-12 | 2019-12-13 | 浙江大学 | variational self-encoder and video generation method combining generation countermeasure network |
US20200019863A1 (en) * | 2018-07-12 | 2020-01-16 | International Business Machines Corporation | Generative Adversarial Network Based Modeling of Text for Natural Language Processing |
CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-oriented text-generated image network model |
-
2021
- 2021-12-16 CN CN202111538104.4A patent/CN113934890B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200019863A1 (en) * | 2018-07-12 | 2020-01-16 | International Business Machines Corporation | Generative Adversarial Network Based Modeling of Text for Natural Language Processing |
CN110572696A (en) * | 2019-08-12 | 2019-12-13 | 浙江大学 | variational self-encoder and video generation method combining generation countermeasure network |
CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-oriented text-generated image network model |
Non-Patent Citations (2)
Title |
---|
WILSON YAN 等: "VideoGPT: Video Generation using VQ-VAE and Transformers", 《ARXIV》 * |
庄兴旺 等: "多维度注意力和语义再生的文本生成图像模型", 《计算机技术与发展》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023154192A1 (en) * | 2022-02-14 | 2023-08-17 | Snap Inc. | Video synthesis via multimodal conditioning |
CN114610935A (en) * | 2022-05-12 | 2022-06-10 | 之江实验室 | Method and system for synthesizing semantic image of text control image style |
CN115249062A (en) * | 2022-09-22 | 2022-10-28 | 武汉大学 | Network model, method and device for generating video by text |
CN115249062B (en) * | 2022-09-22 | 2023-02-03 | 武汉大学 | Network model, method and device for generating video by text |
CN115511969A (en) * | 2022-11-22 | 2022-12-23 | 阿里巴巴(中国)有限公司 | Image processing and data rendering method, apparatus and medium |
CN115880158A (en) * | 2023-01-30 | 2023-03-31 | 西安邮电大学 | Blind image super-resolution reconstruction method and system based on variational self-coding |
CN115880158B (en) * | 2023-01-30 | 2023-10-27 | 西安邮电大学 | Blind image super-resolution reconstruction method and system based on variation self-coding |
CN117496025A (en) * | 2023-10-19 | 2024-02-02 | 四川大学 | Multi-mode scene generation method based on relation and style perception |
CN117496025B (en) * | 2023-10-19 | 2024-06-04 | 四川大学 | Multi-mode scene generation method based on relation and style perception |
Also Published As
Publication number | Publication date |
---|---|
CN113934890B (en) | 2022-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113934890B (en) | Method and system for automatically generating scene video by characters | |
CN107979764B (en) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework | |
Wu et al. | Nüwa: Visual synthesis pre-training for neural visual world creation | |
WO2024051445A1 (en) | Image generation method and related device | |
CN113901894A (en) | Video generation method, device, server and storage medium | |
CN114610935B (en) | Method and system for synthesizing semantic image of text control image style | |
CN109996073B (en) | Image compression method, system, readable storage medium and computer equipment | |
CN114390218B (en) | Video generation method, device, computer equipment and storage medium | |
CN113961736A (en) | Method and device for generating image by text, computer equipment and storage medium | |
CN116205820A (en) | Image enhancement method, target identification method, device and medium | |
CN113781324A (en) | Old photo repairing method | |
CN115880762A (en) | Scalable human face image coding method and system for human-computer mixed vision | |
WO2023068953A1 (en) | Attention-based method for deep point cloud compression | |
US20230319223A1 (en) | Method and system for deep learning based face swapping with multiple encoders | |
Lee et al. | A Brief Survey of text driven image generation and maniulation | |
CN117499711A (en) | Training method, device, equipment and storage medium of video generation model | |
CN117115713A (en) | Dynamic image generation method, device, equipment and medium thereof | |
CN114283181B (en) | Dynamic texture migration method and system based on sample | |
CN113780209B (en) | Attention mechanism-based human face attribute editing method | |
US20230316587A1 (en) | Method and system for latent-space facial feature editing in deep learning based face swapping | |
CN113781376B (en) | High-definition face attribute editing method based on divide-and-congress | |
CN113411615B (en) | Virtual reality-oriented latitude self-adaptive panoramic image coding method | |
Teng et al. | Blind face restoration via multi-prior collaboration and adaptive feature fusion | |
Wang et al. | Facial Landmarks and Generative Priors Guided Blind Face Restoration | |
CN118551074B (en) | Cross-modal music generation method and device for video soundtrack |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |