CN113934890A - Method and system for automatically generating scene video by characters - Google Patents

Method and system for automatically generating scene video by characters Download PDF

Info

Publication number
CN113934890A
CN113934890A CN202111538104.4A CN202111538104A CN113934890A CN 113934890 A CN113934890 A CN 113934890A CN 202111538104 A CN202111538104 A CN 202111538104A CN 113934890 A CN113934890 A CN 113934890A
Authority
CN
China
Prior art keywords
image
vector
input
video
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111538104.4A
Other languages
Chinese (zh)
Other versions
CN113934890B (en
Inventor
马诗洁
王俊彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111538104.4A priority Critical patent/CN113934890B/en
Publication of CN113934890A publication Critical patent/CN113934890A/en
Application granted granted Critical
Publication of CN113934890B publication Critical patent/CN113934890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本发明涉及视频制作领域,尤其涉及一种自动文字生成场景视频的方法及系统,该系统包括:构图逻辑生成模块,用于根据输入的文本描述,生成构图的构图模板图像;图像内容生成模块,输入为构图逻辑生成模块生成的构图模板图像,输出为渲染后的实景图;图像动态化模块,将图像内容生成模块输出的实景图,变换为连续的多帧图像,生成动态视频。本发明基于自然语言预训练模型和计算机视觉技术,通过给定的语言输入自动生成短视频,而不需要第三方人工干预,大大提高了短视频制作的效率,同时生成的短视频具有真实性和多样性,保证了生成视频的质量和视频素材的新颖。

Figure 202111538104

The invention relates to the field of video production, and in particular to a method and system for automatically generating scene videos from text. The system includes: a composition logic generation module for generating a composition template image for composition according to an input text description; an image content generation module, The input is the composition template image generated by the composition logic generation module, and the output is the rendered real image; the image dynamic module transforms the real image output by the image content generation module into a continuous multi-frame image to generate a dynamic video. Based on a natural language pre-training model and computer vision technology, the present invention automatically generates a short video through a given language input without manual intervention by a third party, thereby greatly improving the efficiency of short video production, and at the same time, the generated short video has authenticity and quality. Diversity ensures the quality of the generated video and the novelty of the video material.

Figure 202111538104

Description

Method and system for automatically generating scene video by characters
Technical Field
The invention relates to the field of video production, in particular to a method and a system for automatically generating scene videos by characters.
Background
With the development of the internet, short videos are produced at the same time. Short videos rapidly occupy people's lives as a new way of content recording and media presentation.
In the field of video production. The traditional video production process is complicated, and specific materials need to be searched or shot for video creation. Meanwhile, in order to avoid the repeatability of video materials and ensure the novelty of videos, designers need to be continuously created and designed.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides the following specific technical inventions:
a method for automatically generating scene videos by characters comprises the following steps:
the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;
step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;
step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;
step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;
step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.
Further, the step one specifically includes: template image to be composed
Figure 931301DEST_PATH_IMAGE001
Sent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encoding
Figure 628868DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 674184DEST_PATH_IMAGE003
And an image encoding token,
Figure 600552DEST_PATH_IMAGE004
further, the vector quantization variation automatic Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, the two modules share the codebook module, and specifically, the Encoder encodes an image
Figure 946083DEST_PATH_IMAGE001
Is coded into
Figure 50436DEST_PATH_IMAGE005
Figure 645365DEST_PATH_IMAGE005
The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and then
Figure 375424DEST_PATH_IMAGE005
Converted into the discrete code e closest thereto, i.e. output as
Figure 830588DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 355110DEST_PATH_IMAGE003
And a corresponding token,
Figure 640598DEST_PATH_IMAGE004
and will be
Figure 908768DEST_PATH_IMAGE003
Sending into Decoder module, decoding to generate composition template image
Figure 714045DEST_PATH_IMAGE006
Further, the training mode of the vector quantization variation automatic encoder VQ-VAE is as follows: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.
Further, the second step specifically includes: coding the input language description through a pre-training neural network language model to generate k word vectors,
Figure 409468DEST_PATH_IMAGE007
and k word tokens,
Figure 182252DEST_PATH_IMAGE008
further, the third step specifically includes the following steps:
(3.1) mixing
Figure 988534DEST_PATH_IMAGE002
Flattening the image code vectors to generate g image code vectors, wherein
Figure 412431DEST_PATH_IMAGE009
G is a fixed value, and position embedding is added to g image coding vectors;
(3.2) splicing the k word vectors and the g image coding vectors to obtain the text
Figure 278756DEST_PATH_IMAGE007
And image
Figure 335573DEST_PATH_IMAGE003
Also performing a splicing operation to generate
Figure 696279DEST_PATH_IMAGE010
An embedded representation vector
Figure 459835DEST_PATH_IMAGE011
(3.4) dividing the f vectors
Figure 762641DEST_PATH_IMAGE011
And sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set is f code vectors
Figure 244438DEST_PATH_IMAGE011
And the corresponding image and word token thereof, sending the f encoding vectors into a GPT model, predicting the next token to appear by the GPT model according to the previously input vector, and reducing the softmax classification loss function through a random gradient descent back propagation algorithm;
during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image
Figure 907369DEST_PATH_IMAGE012
Further, the fourth step specifically includes: combining the composition template images generated in the third step
Figure 587749DEST_PATH_IMAGE012
And sending random noise into the style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, and the discriminator distinguishes whether the live-action picture is real or generated by the generator. In the network training, the generator and the discriminator are jointly trained, and the composition template image generated in the third step is input in the prediction
Figure 61456DEST_PATH_IMAGE012
To the generator, outputting the generated live-action picture
Figure 781281DEST_PATH_IMAGE013
Further, the fifth step specifically includes the following steps:
(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;
(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step four
Figure 733057DEST_PATH_IMAGE013
Sending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;
(5.3) rendering the live-action picture
Figure 471206DEST_PATH_IMAGE013
Corresponding patterned template map
Figure 115814DEST_PATH_IMAGE012
Marking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areas
Figure 821470DEST_PATH_IMAGE014
And controlling noise input generated by static region
Figure 576937DEST_PATH_IMAGE015
(54) noise input to control dynamic region generation
Figure 966330DEST_PATH_IMAGE014
Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region
Figure 532571DEST_PATH_IMAGE015
And inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.
Further, the dynamic transformation in the step (5.4) is specifically: controlling noise in dynamic regions, converting input noise, noise generated by controlling dynamic regions
Figure 210677DEST_PATH_IMAGE016
To simulate its motion.
A system for automatically text-generating a video of a scene, comprising:
the composition logic generation module is used for generating a composition template image of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
The invention has the advantages that:
the method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the short video generation efficiency is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is an overall architecture diagram of the composition logic generation module;
FIG. 4 is an overall architecture diagram of an image content generation module;
FIG. 5 is a diagram illustrating the overall architecture of the image dynamizer;
FIG. 6 is a graph showing the results of an example of the present invention;
fig. 7 is a hardware configuration diagram of an arbitrary device having data processing capability in which the apparatus for automatically text-generating a scene video according to the present invention is located.
Detailed Description
In order to make the objects, technical inventions and technical effects of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, a system for automatically generating a scene video with text according to the present invention includes:
the composition logic generation module is used for generating a composition template picture of a composition according to the input text description;
the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;
and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.
As shown in fig. 2, a method for automatically generating a scene video with text according to the present invention includes:
step 101, realizing token serialization of the composition template image through a vector quantization variation automatic encoder.
The invention adopts a vector quantization variational automatic encoder VQ-VAE to compress and quantize a composition template, and the VQ-VAE adopts a conventional algorithm, and specifically comprises the following steps:
the VQ-VAE mainly comprises three parts, an Encoder module, a codebook module and a Decoder module.
Inputting an image
Figure 769835DEST_PATH_IMAGE001
Through Encoder, output is
Figure 522722DEST_PATH_IMAGE017
Figure 243553DEST_PATH_IMAGE017
Is composed of
Figure 674534DEST_PATH_IMAGE002
Vector coding with dimension d. The codebook contains k d-dimensional vectors, denoted by C,
Figure 506224DEST_PATH_IMAGE018
Figure 89783DEST_PATH_IMAGE019
finding out the vector with the shortest Euclidean distance in C, replacing and outputting e, sending e into Decoder, and outputting image
Figure 981516DEST_PATH_IMAGE020
. Where d =256, h = 32, w = 32, k = 1024.
During training, the input of the network is
Figure 899793DEST_PATH_IMAGE001
A size of
Figure 846759DEST_PATH_IMAGE021
Where C =1, H =256, W = 256. The output of the network is a composition template image
Figure 268513DEST_PATH_IMAGE020
A size of
Figure 596726DEST_PATH_IMAGE021
The Encoder structure and the Decode structure are respectively constructed by adopting Encoder parts and Decode parts and respectively adopting a conventional convolution function, BatchNorm parts and a residual error network. The Encoder contains n resnet _ blocks and down-samples m times spatially. Convolution and upsampling operations in the Decoder block so that input x and output
Figure 471141DEST_PATH_IMAGE006
The sizes are consistent.
The loss function of VQ-VAE is as follows:
Figure 926524DEST_PATH_IMAGE022
where sg denotes stopping the back propagation.
Figure 468364DEST_PATH_IMAGE023
Is a hyper-parameter.
In the present invention, n = 4, m = 4,
Figure 967478DEST_PATH_IMAGE023
= 0.25。
training: with the Adam optimizer, the initial learning rate was 0.001.
And (3) prediction: inputting any composition template image for the VQ-VAE trained according to the steps
Figure 640774DEST_PATH_IMAGE001
Output through Encoder module
Figure 883537DEST_PATH_IMAGE017
Figure 14304DEST_PATH_IMAGE017
Finding the nearest vector in codebook for replacement to obtain
Figure 684319DEST_PATH_IMAGE002
Image coding vector with dimension d
Figure 346376DEST_PATH_IMAGE003
And a corresponding image token.
And 102, inputting text description to a pre-training neural network language model Bert and outputting a text token.
As shown in FIG. 3, the input language description is fed into BeGenerating k word vectors in rt
Figure 861671DEST_PATH_IMAGE007
And a number k of tokens to be used,
Figure 424108DEST_PATH_IMAGE008
. And performing pad operation on the condition that less than k word vectors are output, so that the number of the word vectors is k. Where k = 256.
Figure 999446DEST_PATH_IMAGE007
A size of
Figure 398066DEST_PATH_IMAGE024
Where t = k =256 and d = 256.
And 103, jointly modeling the text image based on the GPT model.
And flattening the image coding vector, and then sending the image coding vector and the text coding vector into a GPT model for autoregressive training.
In order to suggest a mapping relation between text information and image information, the invention adopts the output of a language pre-training model Bert and codes after VQ-VAE quantization to be sent into a GPT model for relational modeling. The basic structure of the GPT model is described as follows:
the GPT model consists of a self-orientation block stacked by m layers, wherein the self-orientation block consists of structural units of Multi-Head orientation, Feed Forward and Add & Norm, and m =8 in the invention.
The input of GPT is obtained by adding Position Embedding and Token Embedding
Figure 467785DEST_PATH_IMAGE011
Figure 635461DEST_PATH_IMAGE011
A size of
Figure 647279DEST_PATH_IMAGE025
Where f = 1280 and d = 256.
The objective function of the GPT training model is as follows:
Figure 726006DEST_PATH_IMAGE026
Figure 114262DEST_PATH_IMAGE027
the word vector representing the left window.
The calculation process of the GPT model is as follows:
Figure 339707DEST_PATH_IMAGE028
Figure 256847DEST_PATH_IMAGE029
Figure 584054DEST_PATH_IMAGE030
wherein
Figure 510422DEST_PATH_IMAGE031
When the first k word vectors are expressed and the prediction probability of each word is calculated, only the vocabulary information of the size of the left window is considered.
Where n represents the number of layers of the self-orientation block.
Figure 387111DEST_PATH_IMAGE032
A matrix of word vectors is represented that is,
Figure 255579DEST_PATH_IMAGE033
indicating position embedding.
Training: during training, an Adam optimizer is adopted, and the initial learning rate is 0.0003.
And (3) prediction: after the text and the image are jointly modeled, a section of text description is input to a pre-trained neural network language model Bert, the output of the Bert is input to a GPT, and token of the image is generated step by step. Looking up the corresponding vector from the token of the image to the VQ-VAE trained in the step 101 and outputting the vectore, e are input into Decoder module to output reconstructed image
Figure 788191DEST_PATH_IMAGE012
Step 104, as shown in fig. 4, inputting the composition template image into the style migration GAN network to synthesize the live-action image.
The generation part of the composition template to the live-action picture adopts the most classical pixel2pixel in the style migration GAN network, and the pixel2pixel is a cGAN, so that the conversion from the composition template image to the live-action image can be completed. Pixel2 pixels are largely divided into two blocks, a generator and a discriminator. The overall structure of the Pixel2Pixel generator adopts a U-Net network structure, a symmetrical jump connection structure of Unet is adopted to directly copy low-layer information to a high-layer characteristic diagram, the ith layer is spliced to the (n-i) th layer, and n is the total number of network layers. The discriminator adopts L1 loss and patchGAN discriminator. L1 loss learns low frequency information as an aid, and the patch gan determines whether or not only one N × N image block is real at a time, and averages the results corresponding to each small block as a determination result of the one image. In the invention, N =9 and N = 70.
The training time generator and the arbiter are alternately trained.
The objective function for Pixel2Pixel optimization is as follows:
Figure 518250DEST_PATH_IMAGE034
wherein
Figure 178DEST_PATH_IMAGE035
,
Figure 790280DEST_PATH_IMAGE036
Where z is noise, the network learns the mapping from the template image x to the live-action image y. G denotes a generator, and D denotes a discriminator. During training, G, D is trained alternately.
During training: the generator and the discriminator adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
In prediction, the composition template image reconstructed in step 103 is used
Figure 810188DEST_PATH_IMAGE012
Input to the generator to generate the live-action picture
Figure 78358DEST_PATH_IMAGE013
Step 105, as shown in fig. 5, the live-action image generates a series of video frames through the image dynamic GAN network.
As shown in fig. 6, according to the input text description, continuous multiframe images are generated by the method of the invention, and dynamic videos are generated.
This embodiment employs a deep-landscape network.
The deep-landscaper generator can control the image characteristics through the latent vector and simultaneously realize the control of random attributes of the image by combining the noise injected into the network. The task of the discriminator is to distinguish whether the picture is real or generated by the generator.
And generating a series of image images by the trained deep-landscapes, and simultaneously storing a latent vector code corresponding to the image. When the Encoder is trained, the Encoder can be a Resnet network, a picture generated for deep-landscapes is input, and a size following code vector is output.
The live-action picture generated in the step 104 is processed
Figure 382170DEST_PATH_IMAGE013
Static and dynamic areas are marked, such as: the blue sky sea belongs to a static state, and the rest part belongs to a dynamic region. And combining the latent vector template code and the input noise with the picture composition template picture to perform fine tuning optimization. Optimizing latent vector latentcode output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic region information, and respectively inputting the noise generated for controlling the dynamic region
Figure 139910DEST_PATH_IMAGE014
And controlling the static areaDomain generated noise input
Figure 647115DEST_PATH_IMAGE015
Training deep-landscape: the generator and the discriminator both adopt an Adam optimizer during training, and the initial learning rate is 0.0001.
And (3) prediction: noise input to control dynamic region generation
Figure 469708DEST_PATH_IMAGE014
Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region
Figure 644338DEST_PATH_IMAGE015
And inputting the images into a generator of the trained deep-landscapes to generate m new images.
Figure 245083DEST_PATH_IMAGE013
And m new live-action pictures generated subsequently are the final generated video, and m =200 in the invention.
And transforming the dynamic background information in the scene graph to generate a dynamic scene video. deep-landscapes are mainly divided into two parts, a generator and a discriminator. The generator is characterized in that a network structure comprises two parts, wherein the first part is Mapping network, the process of generating an intermediate hidden variable w by a hidden variable z is performed, the second part is Synthesis network, affine transformation obtained by converting w is used for controlling the style of generated images, and meanwhile random noise input in the network is used for enriching details of the generated images.
The discriminator is a classification network consisting of n convolution modules, the convolution modules carry out convolution and downsampling operations, and n =9 in the invention.
The style of the generated image is controlled by the adaptive instance normalization AdaIN, and the specific formula is as follows:
Figure 505163DEST_PATH_IMAGE037
wherein
Figure 161142DEST_PATH_IMAGE038
And
Figure 190278DEST_PATH_IMAGE039
respectively representing input characteristics
Figure 227504DEST_PATH_IMAGE040
Average difference and standard deviation of;
Figure 725612DEST_PATH_IMAGE041
Figure 873697DEST_PATH_IMAGE042
is the affine transformed generated scaling and bias values of w,
Figure 757339DEST_PATH_IMAGE043
Figure 231046DEST_PATH_IMAGE042
the style may be applied to the ith spatial feature map.
The method comprises the steps of coding a latent vector of the overall layout of colors and scenes in a scene graph and coding a latent vector of illumination brightness transformation in a generated video; the input noise in the coding-generated video is divided into noise encoding static object details and shapes and noise encoding dynamic object details and shapes.
In summary, the invention relates to basic network models such as VQ-VAE, GPT, pixel2pixel and styleGAN, etc., VQ-VAE can effectively utilize potential space, and can model important features which usually span multiple dimensions of data space; the method comprises the steps of compressing and quantizing a composition template image by using the VQ-VAE discretization characteristic to generate a serialized token, adopting a GPT autoregressive model for generating a formula task, completing mapping from a language to an image by using a pre-trained Bert language model and the serialized token of the composition template image, generating a series of tokens corresponding to the composition template image according to language input during prediction, and completing the translation from the image to the image by using pixel2pixel network. Based on the above, the invention provides a new material acquisition mode for video creation, reduces the video creation cost and improves the video creation efficiency.
Corresponding to the embodiment of the method for automatically generating the scene video by the characters, the invention also provides an embodiment of a device for automatically generating the scene video by the characters.
Referring to fig. 7, an apparatus for automatically generating a scene video with text according to an embodiment of the present invention includes one or more processors, and is configured to implement the method for automatically generating a scene video with text in the foregoing embodiment.
The embodiment of the apparatus for automatically generating scene video by text of the invention can be applied to any equipment with data processing capability, and the equipment with data processing capability can be equipment or apparatus such as a computer. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability where the apparatus for automatically generating scene videos by text is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for automatically generating the scene video by the characters in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the invention as described in the foregoing examples, or that certain features may be substituted in the same way. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims (10)

1.一种自动文字生成场景视频的方法,其特征在于,包括以下步骤:1. a method for automatic text generation scene video, is characterized in that, comprises the following steps: 步骤一:通过矢量量化变分自动编码器VQ-VAE,将构图模板图像进行压缩和量化操作,生成图像编码向量和图像token;Step 1: Compress and quantize the composition template image through the vector quantization variational auto-encoder VQ-VAE to generate an image encoding vector and an image token; 步骤二:对输入的语言描述通过预训练神经网络语言模型进行编码,得到词向量和token;Step 2: Encode the input language description through a pre-trained neural network language model to obtain word vectors and tokens; 步骤三:将步骤一中的图像编码向量展平后与步骤二中的词向量进行拼接后输入GPT模型中进行自回归训练,建立语言和构图模板图像直接的关系,建模关系后输入一句语言描述,生成对应的构图模板图像;Step 3: Flatten the image coding vector in Step 1 and splicing it with the word vector in Step 2, then input it into the GPT model for autoregressive training, establish a direct relationship between language and composition template image, and input a sentence after modeling the relationship. description, generate the corresponding composition template image; 步骤四:基于风格迁移GAN网络将步骤三生成的语构图模板图像生成实景图;Step 4: Generate a real image from the language composition template image generated in Step 3 based on the style transfer GAN network; 步骤五:基于图像动态化GAN网络将步骤四生成的实景图生成后续的一系列图像帧,生成视频。Step 5: Generate a subsequent series of image frames based on the image dynamic GAN network to generate a video from the reality map generated in step 4. 2.如权利要求1所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤一, 具体为:将构图模板图像
Figure 465566DEST_PATH_IMAGE001
送入训练好的矢量量化变分自动编码器VQ-VAE中,转换为离 散隐性空间的序列,离散编码后为
Figure 666609DEST_PATH_IMAGE002
个维度为d的图像编码向量
Figure 404757DEST_PATH_IMAGE003
和图像编码 token,
Figure 580524DEST_PATH_IMAGE004
2. The method for automatic text generation scene video according to claim 1, wherein the step 1 is specifically: the composition template image
Figure 465566DEST_PATH_IMAGE001
It is sent to the trained vector quantization variational auto-encoder VQ-VAE, and converted into a sequence of discrete recessive spaces. After discrete encoding, it is
Figure 666609DEST_PATH_IMAGE002
image encoding vectors of dimension d
Figure 404757DEST_PATH_IMAGE003
and image encoding token,
Figure 580524DEST_PATH_IMAGE004
.
3.如权利要求2所述的一种自动文字生成场景视频的方法,其特征在于,所述矢量量化 变分自动编码器VQ-VAE主要分为三个模块,Encoder,codebook和Decoder模块,其中, Encoder模块对输入的构图模板图像进行编码,Decoder模块将构图模板图像进行解码,两 者共用codebook模块,具体的,Encoder将图像
Figure 381121DEST_PATH_IMAGE001
编码成
Figure 126135DEST_PATH_IMAGE005
Figure 984369DEST_PATH_IMAGE005
中的向量根 据它和coodbook向量之间的欧式距离进行量化,即通过最近邻查表的方法在codebook中找 到最近的向量,将
Figure 799879DEST_PATH_IMAGE005
转化成距离它最近的离散编码e,即输出为
Figure 743564DEST_PATH_IMAGE002
个维度为d的图 像编码向量
Figure 584612DEST_PATH_IMAGE003
及对应的token,
Figure 297353DEST_PATH_IMAGE004
,并将
Figure 18185DEST_PATH_IMAGE003
送入Decoder模块,解码生成构 图模板图像
Figure 495171DEST_PATH_IMAGE006
3. the method for a kind of automatic text generation scene video as claimed in claim 2, is characterized in that, described vector quantization variational automatic encoder VQ-VAE is mainly divided into three modules, Encoder, codebook and Decoder module, wherein , the Encoder module encodes the input composition template image, and the Decoder module decodes the composition template image. The two share the codebook module. Specifically, the Encoder encodes the image
Figure 381121DEST_PATH_IMAGE001
encoded into
Figure 126135DEST_PATH_IMAGE005
,
Figure 984369DEST_PATH_IMAGE005
The vector in is quantized according to the Euclidean distance between it and the coodbook vector, that is, the nearest vector is found in the codebook by the method of the nearest neighbor look-up table, and the
Figure 799879DEST_PATH_IMAGE005
Converted to the nearest discrete code e, that is, the output is
Figure 743564DEST_PATH_IMAGE002
image encoding vectors of dimension d
Figure 584612DEST_PATH_IMAGE003
and the corresponding token,
Figure 297353DEST_PATH_IMAGE004
, and will
Figure 18185DEST_PATH_IMAGE003
Send it to the Decoder module to decode and generate a composition template image
Figure 495171DEST_PATH_IMAGE006
.
4.如权利要求3所述的一种自动文字生成场景视频的方法,其特征在于,所述矢量量化变分自动编码器VQ-VAE的训练方式为:采用语义分割图像作为训练数据集,通过Adam随机梯度反向传播算法降低VQ-VAE训练损失函数值,得到模型最优参数。4. the method for a kind of automatic text generation scene video as claimed in claim 3 is characterized in that, the training mode of described vector quantization variational automatic encoder VQ-VAE is: adopt semantic segmentation image as training data set, by The Adam stochastic gradient backpropagation algorithm reduces the value of the VQ-VAE training loss function to obtain the optimal parameters of the model. 5.如权利要求3所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤二, 具体为:对输入的语言描述通过预训练神经网络语言模型进行编码,生成k个词向量,
Figure 592440DEST_PATH_IMAGE007
,和k个词token,
Figure 894108DEST_PATH_IMAGE008
5. The method for automatic text generation scene video according to claim 3, wherein the step 2 is specifically: encoding the input language description through a pre-trained neural network language model, and generating k words vector,
Figure 592440DEST_PATH_IMAGE007
, and k word tokens,
Figure 894108DEST_PATH_IMAGE008
.
6.如权利要求5所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤三,具体为包括以下步骤:6. a kind of method for automatic text generation scene video as claimed in claim 5, is characterized in that, described step 3, specifically comprises the following steps: (3.1)将
Figure 598890DEST_PATH_IMAGE002
个图像编码向量展平,生成g个图像编码向量,其中
Figure 517168DEST_PATH_IMAGE009
,g为固定 值,对g个图像编码向量加上position embedding;
(3.1) will
Figure 598890DEST_PATH_IMAGE002
Flatten the image coding vectors to generate g image coding vectors, where
Figure 517168DEST_PATH_IMAGE009
, g is a fixed value, add position embedding to g image encoding vectors;
(3.2)将k个词向量和g个图像编码向量进行拼接操作,将文本
Figure 152549DEST_PATH_IMAGE007
和图像
Figure 89149DEST_PATH_IMAGE003
也进行拼接操作,生成
Figure 417363DEST_PATH_IMAGE010
个嵌入表示向量
Figure 557357DEST_PATH_IMAGE011
(3.2) The k word vectors and g image encoding vectors are spliced, and the text
Figure 152549DEST_PATH_IMAGE007
and images
Figure 89149DEST_PATH_IMAGE003
The splicing operation is also performed to generate
Figure 417363DEST_PATH_IMAGE010
embedding representation vector
Figure 557357DEST_PATH_IMAGE011
;
(3.4)将f个向量
Figure 996428DEST_PATH_IMAGE011
,送入GPT模型中进行自回归训练,建立文本词向量和图像 编码向量之间的关系,其中,GPT模型的训练具体为:训练集为f个编码向量及其对应的图像 及词token,将f个编码向量送入GPT模型中,GPT模型根据前面输入的向量来预测下一个应 该出现的token,通过随机梯度下降反向传播算法降低softmax分类损失函数;
(3.4) convert the f vectors
Figure 996428DEST_PATH_IMAGE011
, send it into the GPT model for autoregressive training, and establish the relationship between the text word vector and the image encoding vector. The training of the GPT model is as follows: the training set is f encoding vectors and their corresponding images and word tokens. The f encoding vectors are sent into the GPT model. The GPT model predicts the next token that should appear according to the previously input vector, and reduces the softmax classification loss function through the stochastic gradient descent backpropagation algorithm;
预测时,将语言输入的词向量输入GPT模型,逐步预测构图图像压缩编码token,将生成 的token送入codebook中找到对应的向量,再将形成的压缩编码向量图送入Decoder模块中 生成构图模板图像
Figure 289001DEST_PATH_IMAGE012
When predicting, input the word vector of language input into the GPT model, gradually predict the composition image compression coding token, send the generated token to the codebook to find the corresponding vector, and then send the formed compressed coding vector image to the Decoder module to generate the composition template image
Figure 289001DEST_PATH_IMAGE012
.
7.如权利要求6所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤四, 具体为:将步骤三中生成的构图模板图像
Figure 788115DEST_PATH_IMAGE012
及随机噪声送入风格迁移GAN网络,所述风 格迁移GAN网络包括生成器和判别器,生成器利用注入网络的噪声,实现对实景图随机属性 的控制,判别器区分实景图是真实的还是生成器生成的,在网络训练时,生成器和判别器联 合训练,预测时,输入步骤三生成的构图模板图像
Figure 415406DEST_PATH_IMAGE012
到生成器,输出生成的实景图
Figure 392589DEST_PATH_IMAGE013
7. The method for automatic text generation scene video according to claim 6, wherein the step 4 is specifically: the composition template image generated in the step 3
Figure 788115DEST_PATH_IMAGE012
And random noise is sent to the style transfer GAN network, the style transfer GAN network includes a generator and a discriminator, the generator uses the noise injected into the network to control the random attributes of the real image, and the discriminator distinguishes whether the real image is real or generated. When the network is trained, the generator and the discriminator are jointly trained. When predicting, the composition template image generated in step 3 is input.
Figure 415406DEST_PATH_IMAGE012
To the generator, output the generated reality map
Figure 392589DEST_PATH_IMAGE013
.
8.如权利要求7所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤五,具体包括以下步骤:8. the method for a kind of automatic text generation scene video as claimed in claim 7, is characterized in that, described step 5, specifically comprises the following steps: (5.1)训练图像动态化GAN网络,网络主要分为两个模块:生成器和判别器,生成器通过潜向量控制视频特征,同时结合注入网络的噪声,实现对视频随机属性的控制,判别器区分视频是真实的还是生成器生成的,其中,潜向量分为两组:编码颜色和图像布局的潜向量,编码视频整体亮度的潜向量;注入网络的噪声也分为两组:编码视频中静态实体的细节和形状的以及编码视频中动态实体的细节和形状的;(5.1) Train image dynamic GAN network. The network is mainly divided into two modules: generator and discriminator. The generator controls the video features through the latent vector, and at the same time combines the noise injected into the network to control the random attributes of the video. The discriminator Distinguish whether the video is real or generated by the generator, where latent vectors are divided into two groups: latent vectors encoding color and image layout, and latent vectors encoding the overall brightness of the video; noise injected into the network is also divided into two groups: encoding video in Details and shapes of static entities and dynamic entities in encoded video; (5.2)训练Encoder 模型,Encoder模型以实景图作为输入,将图片映射到GAN网络的潜 在空间,将步骤四输出的实景图
Figure 100520DEST_PATH_IMAGE013
送入训练好的Encoder模型,得到该实景图对应的潜 向量latent code;
(5.2) Train the Encoder model. The Encoder model takes the real image as input, maps the image to the latent space of the GAN network, and maps the real image output in step 4
Figure 100520DEST_PATH_IMAGE013
Send the trained Encoder model to get the latent vector latent code corresponding to the real scene image;
(5.3)将实景图
Figure 770535DEST_PATH_IMAGE013
对应的构图模板图
Figure 885122DEST_PATH_IMAGE012
,标记图像中的静态区域和动态区域, 根据重建误差优化Encoder模型输出的潜向量latent code,并根据标记的静态动态区域信 息生成对应的输入噪声,分别为控制动态区域生成的噪声输入
Figure 151149DEST_PATH_IMAGE014
和控制静态区域生 成的噪声输入
Figure 667581DEST_PATH_IMAGE015
(5.3) The real picture
Figure 770535DEST_PATH_IMAGE013
The corresponding composition template
Figure 885122DEST_PATH_IMAGE012
, mark the static area and dynamic area in the image, optimize the latent vector latent code output by the Encoder model according to the reconstruction error, and generate the corresponding input noise according to the marked static dynamic area information, which are the noise input generated by the control dynamic area respectively
Figure 151149DEST_PATH_IMAGE014
and control the noise input generated by the static region
Figure 667581DEST_PATH_IMAGE015
;
(5.4)将控制动态区域生成的噪声输入
Figure 508498DEST_PATH_IMAGE014
进行m次动态变换,并结合之前生成的 潜向量latent code及控制静态区域生成的噪声
Figure 110381DEST_PATH_IMAGE015
,输入预训练图像动态化GAN网络 模型的生成器中,输出m帧图像,原实景图和生成的m帧图像组成一个帧数为m+1的动态的场 景视频。
(5.4) will control the noise input generated by the dynamic region
Figure 508498DEST_PATH_IMAGE014
Perform m dynamic transformations, and combine the previously generated latent vector latent code and control the noise generated by the static area
Figure 110381DEST_PATH_IMAGE015
, input the generator of the pre-trained image dynamic GAN network model, output m frames of images, the original real scene image and the generated m frames of images form a dynamic scene video with a frame number of m+1.
9.如权利要求8所述的一种自动文字生成场景视频的方法,其特征在于,所述步骤 (5.4)中的动态变换具体为:控制动态区域的噪声,对于输入噪声的变换,通过对控制动态 区域生成的噪声
Figure 735092DEST_PATH_IMAGE014
的线性或非线性变换,来模拟其运动。
9 . The method for automatically generating a scene video from text according to claim 8 , wherein the dynamic transformation in the step (5.4) is specifically: controlling the noise in the dynamic region, and for the transformation of the input noise, by changing the input noise. 10 . Control the noise generated by dynamic regions
Figure 735092DEST_PATH_IMAGE014
linear or nonlinear transformation to simulate its motion.
10.采用如权利要求1-9之一的一种自动文字生成场景视频的方法的系统,其特征在于,包括:10. adopt the system of the method for a kind of automatic text generation scene video as one of claim 1-9, is characterized in that, comprises: 构图逻辑生成模块,用于根据输入的文本描述,生成构图的构图模板图像;The composition logic generation module is used to generate the composition template image of the composition according to the input text description; 图像内容生成模块,输入为构图逻辑生成模块生成的构图模板图像,输出为渲染后的实景图;Image content generation module, the input is the composition template image generated by the composition logic generation module, and the output is the rendered real image; 图像动态化模块,将图像内容生成模块输出的实景图,变换为连续的多帧图像,生成动态视频。The image dynamic module transforms the real image output by the image content generation module into continuous multi-frame images to generate dynamic video.
CN202111538104.4A 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters Active CN113934890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111538104.4A CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111538104.4A CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Publications (2)

Publication Number Publication Date
CN113934890A true CN113934890A (en) 2022-01-14
CN113934890B CN113934890B (en) 2022-04-15

Family

ID=79289156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111538104.4A Active CN113934890B (en) 2021-12-16 2021-12-16 Method and system for automatically generating scene video by characters

Country Status (1)

Country Link
CN (1) CN113934890B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610935A (en) * 2022-05-12 2022-06-10 之江实验室 Method and system for synthesizing semantic image of text control image style
CN115249062A (en) * 2022-09-22 2022-10-28 武汉大学 Network model, method and device for generating video by text
CN115511969A (en) * 2022-11-22 2022-12-23 阿里巴巴(中国)有限公司 Image processing and data rendering method, apparatus and medium
CN115880158A (en) * 2023-01-30 2023-03-31 西安邮电大学 A blind image super-resolution reconstruction method and system based on variational self-encoding
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN117496025A (en) * 2023-10-19 2024-02-02 四川大学 A multimodal scene generation method based on relationship and style awareness

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 A Video Generation Method Combining Variational Autoencoders and Generative Adversarial Networks
US20200019863A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Generative Adversarial Network Based Modeling of Text for Natural Language Processing
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019863A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Generative Adversarial Network Based Modeling of Text for Natural Language Processing
CN110572696A (en) * 2019-08-12 2019-12-13 浙江大学 A Video Generation Method Combining Variational Autoencoders and Generative Adversarial Networks
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WILSON YAN 等: "VideoGPT: Video Generation using VQ-VAE and Transformers", 《ARXIV》 *
庄兴旺 等: "多维度注意力和语义再生的文本生成图像模型", 《计算机技术与发展》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN114610935A (en) * 2022-05-12 2022-06-10 之江实验室 Method and system for synthesizing semantic image of text control image style
CN115249062A (en) * 2022-09-22 2022-10-28 武汉大学 Network model, method and device for generating video by text
CN115249062B (en) * 2022-09-22 2023-02-03 武汉大学 Network model, method and device for generating video by text
CN115511969A (en) * 2022-11-22 2022-12-23 阿里巴巴(中国)有限公司 Image processing and data rendering method, apparatus and medium
CN115880158A (en) * 2023-01-30 2023-03-31 西安邮电大学 A blind image super-resolution reconstruction method and system based on variational self-encoding
CN115880158B (en) * 2023-01-30 2023-10-27 西安邮电大学 A blind image super-resolution reconstruction method and system based on variational autoencoding
CN117496025A (en) * 2023-10-19 2024-02-02 四川大学 A multimodal scene generation method based on relationship and style awareness
CN117496025B (en) * 2023-10-19 2024-06-04 四川大学 Multi-mode scene generation method based on relation and style perception

Also Published As

Publication number Publication date
CN113934890B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN113934890B (en) Method and system for automatically generating scene video by characters
CN109379550B (en) Convolutional neural network-based video frame rate up-conversion method and system
CN109218727B (en) Video processing method and device
CN113901894A (en) A video generation method, device, server and storage medium
CN109996073B (en) An image compression method, system, readable storage medium and computer device
Li et al. Region-of-interest and channel attention-based joint optimization of image compression and computer vision
CN116600119B (en) Video encoding method, video decoding method, video encoding device, video decoding device, computer equipment and storage medium
CN112132158A (en) A visual image information embedding method based on self-encoding network
CN116091978A (en) Video description method based on advanced semantic information feature coding
EP4388451A1 (en) Attention-based method for deep point cloud compression
CN116205820A (en) Image enhancement method, target identification method, device and medium
WO2024124261A2 (en) Learned image compression by ai generated content
Zhang et al. Exploring resolution fields for scalable image compression with uncertainty guidance
CN116612416A (en) Method, device and equipment for dividing video target and readable storage medium
CN118334160A (en) Fine-granularity fashion text-guided clothing image generation method
CN115049541B (en) Reversible gray scale method, system and device based on neural network and image steganography
CN113780209B (en) Attention mechanism-based human face attribute editing method
CN114283181B (en) Dynamic texture migration method and system based on sample
CN113781376B (en) High-definition face attribute editing method based on divide-and-congress
KR20240064698A (en) Feature map encoding and decoding method and device
CN115147317A (en) A method and system for enhancing color quality of point cloud based on convolutional neural network
CN114222124B (en) Encoding and decoding method and device
WO2024060161A1 (en) Encoding method, decoding method, encoder, decoder and storage medium
Vaswani et al. Image transformer
Chiang et al. ANFPCGC++: Point Cloud Geometry Coding Using Augmented Normalizing Flows and Transformer-based Entropy Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant