CN113934890A - Method and system for automatically generating scene video by characters - Google Patents

Method and system for automatically generating scene video by characters Download PDF

Info

Publication number
CN113934890A
CN113934890A CN202111538104.4A CN202111538104A CN113934890A CN 113934890 A CN113934890 A CN 113934890A CN 202111538104 A CN202111538104 A CN 202111538104A CN 113934890 A CN113934890 A CN 113934890A
Authority
CN
China
Prior art keywords
image
vector
generated
dynamic
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111538104.4A
Other languages
Chinese (zh)
Other versions
CN113934890B (en
Inventor
马诗洁
王俊彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111538104.4A priority Critical patent/CN113934890B/en
Publication of CN113934890A publication Critical patent/CN113934890A/en
Application granted granted Critical
Publication of CN113934890B publication Critical patent/CN113934890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention relates to the field of video production, in particular to a method and a system for automatically generating scene video by characters, wherein the system comprises the following steps: the composition logic generation module is used for generating a composition template image of a composition according to the input text description; the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image; and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video. The method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the efficiency of short video production is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.

Description

Method and system for automatically generating scene video by characters
Technical Field
The invention relates to the field of video production, in particular to a method and a system for automatically generating scene videos by characters.
Background
With the development of the internet, short videos are produced at the same time. Short videos rapidly occupy people's lives as a new way of content recording and media presentation.
In the field of video production. The traditional video production process is complicated, and specific materials need to be searched or shot for video creation. Meanwhile, in order to avoid the repeatability of video materials and ensure the novelty of videos, designers need to be continuously created and designed.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides the following specific technical inventions:
a method for automatically generating scene videos by characters comprises the following steps:
the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;
step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;
step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;
step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;
step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.
Further, the step one specifically includes: template image to be composed
Figure 931301DEST_PATH_IMAGE001
Sent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encoding
Figure 628868DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 674184DEST_PATH_IMAGE003
And an image encoding token,
Figure 600552DEST_PATH_IMAGE004
further, the vector quantization variation automatic Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, the two modules share the codebook module, and specifically, the Encoder encodes an image
Figure 946083DEST_PATH_IMAGE001
Is coded into
Figure 50436DEST_PATH_IMAGE005
Figure 645365DEST_PATH_IMAGE005
The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and then
Figure 375424DEST_PATH_IMAGE005
Converted into the discrete code e closest thereto, i.e. output as
Figure 830588DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 355110DEST_PATH_IMAGE003
And a corresponding token,
Figure 640598DEST_PATH_IMAGE004
and will be
Figure 908768DEST_PATH_IMAGE003
Sending into Decoder module, decoding to generate composition template image
Figure 714045DEST_PATH_IMAGE006
Further, the training mode of the vector quantization variation automatic encoder VQ-VAE is as follows: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.
Further, the second step specifically includes: coding the input language description through a pre-training neural network language model to generate k word vectors,
Figure 409468DEST_PATH_IMAGE007
and k word tokens,
Figure 182252DEST_PATH_IMAGE008
further, the third step specifically includes the following steps:
(3.1) mixing
Figure 988534DEST_PATH_IMAGE002
Flattening the image code vectors to generate g image code vectors, wherein
Figure 412431DEST_PATH_IMAGE009
G is a fixed value, and position embedding is added to g image coding vectors;
(3.2) splicing the k word vectors and the g image coding vectors to obtain the text
Figure 278756DEST_PATH_IMAGE007
And image
Figure 335573DEST_PATH_IMAGE003
Also performing a splicing operation to generate
Figure 696279DEST_PATH_IMAGE010
An embedded representation vector
Figure 459835DEST_PATH_IMAGE011
(3.4) dividing the f vectors
Figure 762641DEST_PATH_IMAGE011
And sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set is f code vectors
Figure 244438DEST_PATH_IMAGE011
And the corresponding image and word token thereof, sending the f encoding vectors into a GPT model, predicting the next token to appear by the GPT model according to the previously input vector, and reducing the softmax classification loss function through a random gradient descent back propagation algorithm;
during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image
Figure 907369DEST_PATH_IMAGE012
Further, the fourth step specifically includes: combining the composition template images generated in the third step
Figure 587749DEST_PATH_IMAGE012
And sending random noise into the style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, and the discriminator distinguishes whether the live-action picture is real or generated by the generator. In the network training, the generator and the discriminator are jointly trained, and the composition template image generated in the third step is input in the prediction
Figure 61456DEST_PATH_IMAGE012
To the generator, outputting the generated live-action picture
Figure 781281DEST_PATH_IMAGE013
Further, the fifth step specifically includes the following steps:
(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;
(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step four
Figure 733057DEST_PATH_IMAGE013
Sending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;
(5.3) rendering the live-action picture
Figure 471206DEST_PATH_IMAGE013
Corresponding patterned template map
Figure 115814DEST_PATH_IMAGE012
Marking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areas
Figure 821470DEST_PATH_IMAGE014
And controlling noise input generated by static region
Figure 576937DEST_PATH_IMAGE015
(54) noise input to control dynamic region generation
Figure 966330DEST_PATH_IMAGE014
Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region
Figure 532571DEST_PATH_IMAGE015
And inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.
Further, the dynamic transformation in the step (5.4) is specifically: controlling noise in dynamic regions, converting input noise, noise generated by controlling dynamic regions
Figure 210677DEST_PATH_IMAGE016
To simulate its motion.
A system for automatically text-generating a video of a scene, comprising:
the composition logic generation module is used for generating a composition template image of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
The invention has the advantages that:
the method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the short video generation efficiency is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is an overall architecture diagram of the composition logic generation module;
FIG. 4 is an overall architecture diagram of an image content generation module;
FIG. 5 is a diagram illustrating the overall architecture of the image dynamizer;
FIG. 6 is a graph showing the results of an example of the present invention;
fig. 7 is a hardware configuration diagram of an arbitrary device having data processing capability in which the apparatus for automatically text-generating a scene video according to the present invention is located.
Detailed Description
In order to make the objects, technical inventions and technical effects of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, a system for automatically generating a scene video with text according to the present invention includes:
the composition logic generation module is used for generating a composition template picture of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
As shown in fig. 2, a method for automatically generating a scene video with text according to the present invention includes:
step 101, realizing token serialization of the composition template image through a vector quantization variation automatic encoder.
The invention adopts a vector quantization variational automatic encoder VQ-VAE to compress and quantize a composition template, and the VQ-VAE adopts a conventional algorithm, and specifically comprises the following steps:
the VQ-VAE mainly comprises three parts, an Encoder module, a codebook module and a Decoder module.
Inputting an image
Figure 769835DEST_PATH_IMAGE001
Through Encoder, output is
Figure 522722DEST_PATH_IMAGE017
Figure 243553DEST_PATH_IMAGE017
Is composed of
Figure 674534DEST_PATH_IMAGE002
Vector coding with dimension d. The codebook contains k d-dimensional vectors, denoted by C,
Figure 506224DEST_PATH_IMAGE018
Figure 89783DEST_PATH_IMAGE019
finding out the vector with the shortest Euclidean distance in C, replacing and outputting e, sending e into Decoder, and outputting image
Figure 981516DEST_PATH_IMAGE020
. Where d =256, h = 32, w = 32, k = 1024.
During training, the input of the network is
Figure 899793DEST_PATH_IMAGE001
A size of
Figure 846759DEST_PATH_IMAGE021
Where C =1, H =256, W = 256. The output of the network is a composition template image
Figure 268513DEST_PATH_IMAGE020
A size of
Figure 596726DEST_PATH_IMAGE021
The Encoder structure and the Decode structure are respectively constructed by adopting Encoder parts and Decode parts and respectively adopting a conventional convolution function, BatchNorm parts and a residual error network. The Encoder contains n resnet _ blocks and down-samples m times spatially. Convolution and upsampling operations in the Decoder block so that input x and output
Figure 471141DEST_PATH_IMAGE006
The sizes are consistent.
The loss function of VQ-VAE is as follows:
Figure 926524DEST_PATH_IMAGE022
where sg denotes stopping the back propagation.
Figure 468364DEST_PATH_IMAGE023
Is a hyper-parameter.
In the present invention, n = 4, m = 4,
Figure 967478DEST_PATH_IMAGE023
= 0.25。
training: with the Adam optimizer, the initial learning rate was 0.001.
And (3) prediction: inputting any composition template image for the VQ-VAE trained according to the steps
Figure 640774DEST_PATH_IMAGE001
Output through Encoder module
Figure 883537DEST_PATH_IMAGE017
Figure 14304DEST_PATH_IMAGE017
Finding the nearest vector in codebook for replacement to obtain
Figure 684319DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 346376DEST_PATH_IMAGE003
And a corresponding image token.
And 102, inputting text description to a pre-training neural network language model Bert and outputting a text token.
As shown in FIG. 3, the input language description is fed into BeGenerating k word vectors in rt
Figure 861671DEST_PATH_IMAGE007
And a number k of tokens to be used,
Figure 424108DEST_PATH_IMAGE008
. And performing pad operation on the condition that less than k word vectors are output, so that the number of the word vectors is k. Where k = 256.
Figure 999446DEST_PATH_IMAGE007
A size of
Figure 398066DEST_PATH_IMAGE024
Where t = k =256 and d = 256.
And 103, jointly modeling the text image based on the GPT model.
And flattening the image coding vector, and then sending the image coding vector and the text coding vector into a GPT model for autoregressive training.
In order to suggest a mapping relation between text information and image information, the invention adopts the output of a language pre-training model Bert and codes after VQ-VAE quantization to be sent into a GPT model for relational modeling. The basic structure of the GPT model is described as follows:
the GPT model consists of a self-orientation block stacked by m layers, wherein the self-orientation block consists of structural units of Multi-Head orientation, Feed Forward and Add & Norm, and m =8 in the invention.
The input of GPT is obtained by adding Position Embedding and Token Embedding
Figure 467785DEST_PATH_IMAGE011
Figure 635461DEST_PATH_IMAGE011
A size of
Figure 647279DEST_PATH_IMAGE025
Where f = 1280 and d = 256.
The objective function of the GPT training model is as follows:
Figure 726006DEST_PATH_IMAGE026
Figure 114262DEST_PATH_IMAGE027
the word vector representing the left window.
The calculation process of the GPT model is as follows:
Figure 339707DEST_PATH_IMAGE028
Figure 256847DEST_PATH_IMAGE029
Figure 584054DEST_PATH_IMAGE030
wherein
Figure 510422DEST_PATH_IMAGE031
When the first k word vectors are expressed and the prediction probability of each word is calculated, only the vocabulary information of the size of the left window is considered.
Where n represents the number of layers of the self-orientation block.
Figure 387111DEST_PATH_IMAGE032
A matrix of word vectors is represented that is,
Figure 255579DEST_PATH_IMAGE033
indicating position embedding.
Training: during training, an Adam optimizer is adopted, and the initial learning rate is 0.0003.
And (3) prediction: after the text and the image are jointly modeled, a section of text description is input to a pre-trained neural network language model Bert, the output of the Bert is input to a GPT, and token of the image is generated step by step. Looking up the corresponding vector from the token of the image to the VQ-VAE trained in the step 101 and outputting the vectore, e are input into Decoder module to output reconstructed image
Figure 788191DEST_PATH_IMAGE012
Step 104, as shown in fig. 4, inputting the composition template image into the style migration GAN network to synthesize the live-action image.
The generation part of the composition template to the live-action picture adopts the most classical pixel2pixel in the style migration GAN network, and the pixel2pixel is a cGAN, so that the conversion from the composition template image to the live-action image can be completed. Pixel2 pixels are largely divided into two blocks, a generator and a discriminator. The overall structure of the Pixel2Pixel generator adopts a U-Net network structure, a symmetrical jump connection structure of Unet is adopted to directly copy low-layer information to a high-layer characteristic diagram, the ith layer is spliced to the (n-i) th layer, and n is the total number of network layers. The discriminator adopts L1 loss and patchGAN discriminator. L1 loss learns low frequency information as an aid, and the patch gan determines whether or not only one N × N image block is real at a time, and averages the results corresponding to each small block as a determination result of the one image. In the invention, N =9 and N = 70.
The training time generator and the arbiter are alternately trained.
The objective function for Pixel2Pixel optimization is as follows:
Figure 518250DEST_PATH_IMAGE034
wherein
Figure 178DEST_PATH_IMAGE035
,
Figure 790280DEST_PATH_IMAGE036
Where z is noise, the network learns the mapping from the template image x to the live-action image y. G denotes a generator, and D denotes a discriminator. During training, G, D is trained alternately.
During training: the generator and the discriminator adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
In prediction, the composition template image reconstructed in step 103 is used
Figure 810188DEST_PATH_IMAGE012
Input to the generator to generate the live-action picture
Figure 78358DEST_PATH_IMAGE013
Step 105, as shown in fig. 5, the live-action image generates a series of video frames through the image dynamic GAN network.
As shown in fig. 6, according to the input text description, continuous multiframe images are generated by the method of the invention, and dynamic videos are generated.
This embodiment employs a deep-landscape network.
The deep-landscaper generator can control the image characteristics through the latent vector and simultaneously realize the control of random attributes of the image by combining the noise injected into the network. The task of the discriminator is to distinguish whether the picture is real or generated by the generator.
And generating a series of image images by the trained deep-landscapes, and simultaneously storing a latent vector code corresponding to the image. When the Encoder is trained, the Encoder can be a Resnet network, a picture generated for deep-landscapes is input, and a size following code vector is output.
The live-action picture generated in the step 104 is processed
Figure 382170DEST_PATH_IMAGE013
Static and dynamic areas are marked, such as: the blue sky sea belongs to a static state, and the rest part belongs to a dynamic region. And combining the latent vector template code and the input noise with the picture composition template picture to perform fine tuning optimization. Optimizing latent vector latentcode output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic region information, and respectively inputting the noise generated for controlling the dynamic region
Figure 139910DEST_PATH_IMAGE014
And controlling the static areaDomain generated noise input
Figure 647115DEST_PATH_IMAGE015
Training deep-landscape: the generator and the discriminator both adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
And (3) prediction: noise input to control dynamic region generation
Figure 469708DEST_PATH_IMAGE014
Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region
Figure 644338DEST_PATH_IMAGE015
And inputting the images into a generator of the trained deep-landscapes to generate m new images.
Figure 245083DEST_PATH_IMAGE013
And m new live-action pictures generated subsequently are the final generated video, and m =200 in the invention.
And transforming the dynamic background information in the scene graph to generate a dynamic scene video. deep-landscapes are mainly divided into two parts, a generator and a discriminator. The generator is characterized in that a network structure comprises two parts, wherein the first part is Mapping network, the process of generating an intermediate hidden variable w by a hidden variable z is performed, the second part is Synthesis network, affine transformation obtained by converting w is used for controlling the style of generated images, and meanwhile random noise input in the network is used for enriching details of the generated images.
The discriminator is a classification network consisting of n convolution modules, the convolution modules carry out convolution and downsampling operations, and n =9 in the invention.
The style of the generated image is controlled by the adaptive instance normalization AdaIN, and the specific formula is as follows:
Figure 505163DEST_PATH_IMAGE037
wherein
Figure 161142DEST_PATH_IMAGE038
And
Figure 190278DEST_PATH_IMAGE039
respectively representing input characteristics
Figure 227504DEST_PATH_IMAGE040
Average difference and standard deviation of;
Figure 725612DEST_PATH_IMAGE041
Figure 873697DEST_PATH_IMAGE042
is the affine transformed generated scaling and bias values of w,
Figure 757339DEST_PATH_IMAGE043
Figure 231046DEST_PATH_IMAGE042
the style may be applied to the ith spatial feature map.
The method comprises the steps of coding a latent vector of the overall layout of colors and scenes in a scene graph and coding a latent vector of illumination brightness transformation in a generated video; the input noise in the coding-generated video is divided into noise encoding static object details and shapes and noise encoding dynamic object details and shapes.
In summary, the invention relates to basic network models such as VQ-VAE, GPT, pixel2pixel and styleGAN, etc., VQ-VAE can effectively utilize potential space, and can model important features which usually span multiple dimensions of data space; the method comprises the steps of compressing and quantizing a composition template image by using the VQ-VAE discretization characteristic to generate a serialized token, adopting a GPT autoregressive model for generating a formula task, completing mapping from a language to an image by using a pre-trained Bert language model and the serialized token of the composition template image, generating a series of tokens corresponding to the composition template image according to language input during prediction, and completing the translation from the image to the image by using pixel2pixel network. Based on the above, the invention provides a new material acquisition mode for video creation, reduces the video creation cost and improves the video creation efficiency.
Corresponding to the embodiment of the method for automatically generating the scene video by the characters, the invention also provides an embodiment of a device for automatically generating the scene video by the characters.
Referring to fig. 7, an apparatus for automatically generating a scene video with text according to an embodiment of the present invention includes one or more processors, and is configured to implement the method for automatically generating a scene video with text in the foregoing embodiment.
The embodiment of the apparatus for automatically generating scene video by text of the invention can be applied to any equipment with data processing capability, and the equipment with data processing capability can be equipment or apparatus such as a computer. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability where the apparatus for automatically generating scene videos by text is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for automatically generating the scene video by the characters in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the invention as described in the foregoing examples, or that certain features may be substituted in the same way. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims (10)

1. A method for automatically generating scene video by characters is characterized by comprising the following steps:
the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;
step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;
step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;
step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;
step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.
2. The method for automatically generating scene videos by using characters as claimed in claim 1, wherein the first step specifically comprises: template image to be composed
Figure 465566DEST_PATH_IMAGE001
Sent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encoding
Figure 666609DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 404757DEST_PATH_IMAGE003
And an image encoding token,
Figure 580524DEST_PATH_IMAGE004
3. the method of claim 2, wherein the vector quantization variation auto-Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, and the two modules share the codebook module, and in particular, the Encoder encodes the image
Figure 381121DEST_PATH_IMAGE001
Is coded into
Figure 126135DEST_PATH_IMAGE005
Figure 984369DEST_PATH_IMAGE005
The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and then
Figure 799879DEST_PATH_IMAGE005
Converted into the discrete code e closest thereto, i.e. output as
Figure 743564DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 584612DEST_PATH_IMAGE003
And a corresponding token,
Figure 297353DEST_PATH_IMAGE004
and will be
Figure 18185DEST_PATH_IMAGE003
Is sent intoDecoder module for decoding and generating composition template image
Figure 495171DEST_PATH_IMAGE006
4. The method of claim 3, wherein the vector quantization variation auto-encoder VQ-VAE is trained in a manner of: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.
5. The method for automatically generating scene videos by characters according to claim 3, wherein the second step specifically comprises: coding the input language description through a pre-training neural network language model to generate k word vectors,
Figure 592440DEST_PATH_IMAGE007
and k word tokens,
Figure 894108DEST_PATH_IMAGE008
6. the method for automatically generating scene videos by characters according to claim 5, wherein the third step specifically comprises the following steps:
(3.1) mixing
Figure 598890DEST_PATH_IMAGE002
Flattening the image code vectors to generate g image code vectors, wherein
Figure 517168DEST_PATH_IMAGE009
G is a fixed value, and position embedding is added to g image coding vectors;
(3.2) splicing the k word vectors and the g image coding vectors to obtain the text
Figure 152549DEST_PATH_IMAGE007
And image
Figure 89149DEST_PATH_IMAGE003
Also performing a splicing operation to generate
Figure 417363DEST_PATH_IMAGE010
An embedded representation vector
Figure 557357DEST_PATH_IMAGE011
(3.4) dividing the f vectors
Figure 996428DEST_PATH_IMAGE011
And sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set comprises f coding vectors and corresponding images and word tokens, the f coding vectors are sent into a GPT model, the GPT model predicts the next token to appear according to the vector input in front, and the softmax classification loss function is reduced through a random gradient descent back propagation algorithm;
during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image
Figure 289001DEST_PATH_IMAGE012
7. The method for automatically generating scene videos by characters according to claim 6, wherein the fourth step specifically comprises: combining the composition template images generated in the third step
Figure 788115DEST_PATH_IMAGE012
And sending random noise into a style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, the discriminator distinguishes whether the live-action picture is real or generated by the generator, the generator and the discriminator are jointly trained during network training, and during prediction, the composition template image generated in the third step is input
Figure 415406DEST_PATH_IMAGE012
To the generator, outputting the generated live-action picture
Figure 392589DEST_PATH_IMAGE013
8. The method for automatically generating scene videos by using texts as claimed in claim 7, wherein the step five specifically comprises the following steps:
(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;
(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step four
Figure 100520DEST_PATH_IMAGE013
Sending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;
(5.3) rendering the live-action picture
Figure 770535DEST_PATH_IMAGE013
Corresponding patterned template map
Figure 885122DEST_PATH_IMAGE012
Marking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areas
Figure 151149DEST_PATH_IMAGE014
And controlling noise input generated by static region
Figure 667581DEST_PATH_IMAGE015
(5.4) noise input to control dynamic region Generation
Figure 508498DEST_PATH_IMAGE014
Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region
Figure 110381DEST_PATH_IMAGE015
And inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.
9. The method of claim 8, wherein the dynamic transformation in the step (5.4) is specifically: controlling noise in dynamic regions, converting input noise, noise generated by controlling dynamic regions
Figure 735092DEST_PATH_IMAGE014
To simulate its motion.
10. A system for automatically text-generating a video of a scene as claimed in any one of claims 1 to 9, comprising:
the composition logic generation module is used for generating a composition template image of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
CN202111538104.4A 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters Active CN113934890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111538104.4A CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111538104.4A CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Publications (2)

Publication Number Publication Date
CN113934890A true CN113934890A (en) 2022-01-14
CN113934890B CN113934890B (en) 2022-04-15

Family

ID=79289156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111538104.4A Active CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Country Status (1)

Country Link
CN (1) CN113934890B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610935A (en) * 2022-05-12 2022-06-10 之江实验室 Method and system for synthesizing semantic image of text control image style
CN115249062A (en) * 2022-09-22 2022-10-28 武汉大学 Network model, method and device for generating video by text
CN115511969A (en) * 2022-11-22 2022-12-23 阿里巴巴(中国)有限公司 Image processing and data rendering method, apparatus and medium
CN115880158A (en) * 2023-01-30 2023-03-31 西安邮电大学 Blind image super-resolution reconstruction method and system based on variational self-coding
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN117496025A (en) * 2023-10-19 2024-02-02 四川大学 Multi-mode scene generation method based on relation and style perception

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 variational self-encoder and video generation method combining generation countermeasure network
US20200019863A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Generative Adversarial Network Based Modeling of Text for Natural Language Processing
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019863A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Generative Adversarial Network Based Modeling of Text for Natural Language Processing
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 variational self-encoder and video generation method combining generation countermeasure network
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WILSON YAN 等: "VideoGPT: Video Generation using VQ-VAE and Transformers", 《ARXIV》 *
庄兴旺 等: "多维度注意力和语义再生的文本生成图像模型", 《计算机技术与发展》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN114610935A (en) * 2022-05-12 2022-06-10 之江实验室 Method and system for synthesizing semantic image of text control image style
CN115249062A (en) * 2022-09-22 2022-10-28 武汉大学 Network model, method and device for generating video by text
CN115249062B (en) * 2022-09-22 2023-02-03 武汉大学 Network model, method and device for generating video by text
CN115511969A (en) * 2022-11-22 2022-12-23 阿里巴巴(中国)有限公司 Image processing and data rendering method, apparatus and medium
CN115880158A (en) * 2023-01-30 2023-03-31 西安邮电大学 Blind image super-resolution reconstruction method and system based on variational self-coding
CN115880158B (en) * 2023-01-30 2023-10-27 西安邮电大学 Blind image super-resolution reconstruction method and system based on variation self-coding
CN117496025A (en) * 2023-10-19 2024-02-02 四川大学 Multi-mode scene generation method based on relation and style perception
CN117496025B (en) * 2023-10-19 2024-06-04 四川大学 Multi-mode scene generation method based on relation and style perception

Also Published As

Publication number Publication date
CN113934890B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN113934890B (en) Method and system for automatically generating scene video by characters
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
Wu et al. Nüwa: Visual synthesis pre-training for neural visual world creation
WO2024051445A1 (en) Image generation method and related device
CN113901894A (en) Video generation method, device, server and storage medium
CN114610935B (en) Method and system for synthesizing semantic image of text control image style
CN109996073B (en) Image compression method, system, readable storage medium and computer equipment
CN114390218B (en) Video generation method, device, computer equipment and storage medium
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN116205820A (en) Image enhancement method, target identification method, device and medium
CN113781324A (en) Old photo repairing method
CN115880762A (en) Scalable human face image coding method and system for human-computer mixed vision
WO2023068953A1 (en) Attention-based method for deep point cloud compression
US20230319223A1 (en) Method and system for deep learning based face swapping with multiple encoders
Lee et al. A Brief Survey of text driven image generation and maniulation
CN117499711A (en) Training method, device, equipment and storage medium of video generation model
CN117115713A (en) Dynamic image generation method, device, equipment and medium thereof
CN114283181B (en) Dynamic texture migration method and system based on sample
CN113780209B (en) Attention mechanism-based human face attribute editing method
US20230316587A1 (en) Method and system for latent-space facial feature editing in deep learning based face swapping
CN113781376B (en) High-definition face attribute editing method based on divide-and-congress
CN113411615B (en) Virtual reality-oriented latitude self-adaptive panoramic image coding method
Teng et al. Blind face restoration via multi-prior collaboration and adaptive feature fusion
Wang et al. Facial Landmarks and Generative Priors Guided Blind Face Restoration
CN118551074B (en) Cross-modal music generation method and device for video soundtrack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant