CN117219050A

CN117219050A - Text generation video system based on depth generation countermeasure network

Info

Publication number: CN117219050A
Application number: CN202311154604.7A
Authority: CN
Inventors: 李雪健; 陈永强; 王育欣; 高泽夫; 马宏斌; 焦义文; 马宏; 吴涛; 刘杨; 李超; 腾飞; 卢志伟; 陈雨迪; 宋雨珂
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-12

Abstract

The invention discloses a text generation video system based on a depth generation countermeasure network, which can generate clear voice of a target person, solve the problem of asynchronous audio and video, and improve the image quality of a synthesized video. The system comprises a voice generation module and a video generation module. The speech generation module takes a reference speech signal and a text part of a generated object as inputs, and comprises three independently trained neural networks: a speaker encoder, a sequence synthesizer, an autoregressive WaveNet vocoder; finally, the voice features are generated. The video generation module takes the picture of the generated object and the voice characteristics as inputs, and adopts a 3D face recognition unit to determine an initial reference expression coefficient and an initial reference head posture coefficient according to the picture of the generated object. The expression unit generates an expression coefficient of the associated speech. The head pose unit obtains a head pose coefficient. The 3D face rendering unit maps the face key points by using the expression coefficients and the head posture coefficients of the associated voices to generate videos.

Description

Text generation video system based on depth generation countermeasure network

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a text generation video system based on a depth generation countermeasure network.

Background

With the continuous development of digital human concept fire explosion and generation technology, it is no longer a difficult problem to make the characters in the photo move along with the input of audio. Inspired by the technology, if the technology is used in the public opinion field, by utilizing any text, the speech video of the specific character is generated by extracting the voiceprint feature and the depth visual feature of the specific character target, the effect of spurious and spurious to the enemy and the effect of flaring the human heart are achieved, and the technology has extremely strong military significance in the front battlefield and the post-enemy battlefield.

Currently, few techniques for generating video by voice of deep generation countermeasure network (GAN) are available, and the following presents a solution which is closely related to the summary of the invention and has recently been published in literature.

Currently existing methods are text-generated video methods (2022.10, peri, computer aided design and graphic school report) based on multi-condition generation of a countermeasure network. The text generation video method comprises three modules, namely a text processing module, a pose modeling and converting module and a video frame generation and optimizing module. The text processing module combines the traditional generation method (search and supervised learning method) with the generation model to establish an action search database, so that the controllability of the generation process is improved; the pose modeling and conversion module is used for extracting pose information and performing three-dimensional modeling; the video frame generation and optimization module utilizes the multi-condition generation countermeasure network to synthesize and optimize the video frames. The text processing module utilizes an action retrieval database for storing action sequences corresponding to semantic information, and the action retrieval database is a database for constructing action actions meeting semantic requirements. Besides ensuring the perfection of the action retrieval database, in order to effectively improve the retrieval capability of actions, a retrieval mode of combining a bus type topological structure and a tree type topological structure is adopted. Firstly, according to a tree topology structure, carrying out branch search on a character block, a time block, a state block and an action block in a search library, and respectively selecting an action reference module with highest matching degree. And secondly, screening out an action block with the highest matching value by combining the retrieval mode of the bus type topological structure with the retrieval mode with the highest matching degree in branches on each tree type topological structure.

The pose modeling and converting module can extract image features from the source image more deeply, and the motion features of all objects are expressed by training a single model. Meanwhile, 3D portrait modeling is carried out by using the human parameter statistical model, so that the human motions in the generated video can meet the human structural motion characteristics as much as possible.

The pose modeling and converting module comprises two parts, wherein the first part is 3D pose modeling, and a model from 2D to 3D image modeling end to end is built by using a human body generation model. By means of 2D image information, 3D gesture and shape parameters are predicted, image precision and efficiency can be well generated in a balanced mode, and then a 3D action model which is richer and more real in motion is generated. The reference image and the source image are encoded through a residual network, convolution characteristics of the 2D image are obtained, and the obtained convolution characteristics are transferred to an iterative 3D regression model to generate 3D portrait modeling information (posture and morphology) and projection relation of a camera and a 2D joint. The camera field angle provides a parameter that measures the distance between the 3D model modeled portrait and the camera to avoid the appearance of a very large or very small 3D portrait model. Finally, the parameter information of the generated model is input into the judging model by utilizing the micro 3D human body modeling and GAN structure so as to judge whether the generated 3D model meets the normal human body behavior activity. SMPL is a parameterized mannequin that represents the shape and posture of the human body in a data-based manner. Meanwhile, the modeling method can simulate the bulge and the recess of the human muscle in the movement process, avoid the phenomenon of surface distortion of the human muscle in the movement process, and accurately model the muscle stretching and contraction movement form of the human. In this way, a realistic animated human body can be created, different body shapes can naturally deform with the posture, and exhibit soft tissue movements similar to those of a real person.

The second part is a pose conversion module, a micro neural network rendering module is used for mapping 2 obtained 3D models (a reference image 3D model (3 Dref) and a source image 3D model (3 Dsrc)) and calculating a transformation matrix through projection vertexes of the model, and the transformation matrix T is used for carrying out specific action conversion on the source 3D model.

The video frame generation and optimization module adopts a Resunate structure, namely a combination of a residual neural network (residual networks, resNet) and a CNN, and a discriminant model framework structure in Pix2Pix is used in the discriminant model. The video frame optimization module removes batch normalization (BN layer) in the original network on the basis of tradition, and obtains the optimal solution of the image space by generating the mutual antagonism of the model and the discriminant model, thereby obtaining the high-resolution video frame.

The existing methods have the following defects: when processing a source image, there are drawbacks in that the speech definition of a generated target person is not high, the sound and the picture are not synchronous, and the video image quality is not high.

Disclosure of Invention

In view of the above, the invention provides a text generation video system based on a depth generation countermeasure network, which can generate clear voice of a target person, solve the problem of asynchronous audio and video, and improve the image quality of a synthesized video.

In order to achieve the above purpose, the technical scheme of the invention is as follows: the system comprises a voice generation module and a video generation module.

The voice generating module takes a reference voice signal and a text part of a generated object as input, and comprises three independently trained neural networks, namely:

and a speaker encoder for calculating an embedding vector of a fixed dimension from the reference speech signal of the generated object.

A sequence synthesizer predicts a mel-spectrogram based on a grapheme or phoneme input sequence, subject to the generation of an embedded vector of the object.

The autoregressive WaveNet vocoder is used for converting the Mel spectrogram into a time domain waveform, finally generating voice characteristics and inputting the voice characteristics to the video generation module.

The video generation module takes a picture and voice characteristics of a generated object as input, and comprises a 3D face recognition unit, an expression unit, a head gesture unit and a 3D face rendering unit.

The 3D face recognition unit is used for carrying out 3D face recognition according to the generated image of the object and determining an initial reference expression coefficient and an initial reference head posture coefficient.

The expression unit calculates a motion coefficient of a face of the generation target, and generates an expression coefficient of the associated speech.

And the head posture unit calculates the motion coefficient of the whole head to obtain the head posture coefficient.

The 3D face rendering unit maps the face key points by using the expression coefficients and the head posture coefficients of the associated voices to generate a final video.

Further, the loudspeaker encoder is trained by a neural network, and the network calculates a Log-Mel spectrogram sequence from a reference voice signal with any length and maps the Log-Mel spectrogram sequence into an embedded vector with fixed dimension; during the training process, the speaker encoder trains the sample including the voice video instance and the speaker identity tag which are divided into 1.6 s; the training network is a Log-Mel spectrogram which reaches a long-short-period memory recurrent neural network LSTM consisting of a plurality of units through a plurality of transmission channels, and finally, the output is normalized.

Further, the sequence synthesizer comprises an encoder, a synthesizer and a decoder; training the input of a sequence synthesizer on a target audio of text transcription, and firstly mapping the text into a series of phonemes at an input end, wherein the factors are minimum speech units; and synthesizing a series of phonemes and the reference voice through the coding vectors output by the pre-training speaker coder, and finally inputting the synthesized voice codes to a decoder for decoding to finally generate a synthesized Mel spectrogram which is the same as the reference voice.

Further, the expression unit includes an audio encoder, a mapping network, a Wav2Lip model, and a 3DMM coefficient estimator.

The audio encoder is a residual neural network ResNet, the input of the audio encoder is audio, and the output is an audio encoding result.

The mapping network is a linear layer for decoding the expression coefficients, and the inputs to the mapping network include three: the first is the audio coding result output after the audio passes through the audio coder, and the second is the reference expression coefficient beta from the reference image ₀ Third is blink control signal z _blink ∈[0,1]And corresponding eye mark loss; the output of the mapping network is the expression coefficient of the t frame.

The input of the Wav2Lip model is audio, the audio passes through the Wav2Lip network to obtain a preliminary Lip expression coefficient, and the output of the Wav2Lip model is the preliminary Lip expression coefficient; the preliminary lip expression coefficients are input to a 3DMM coefficient estimator.

The 3DMM coefficient estimator is a monocular three-dimensional face reconstruction model used for learning real expression coefficients.

Further, the head pose unit includes a VAE encoder and a VAE decoder based on a VAE model.

The VAE encoder and VAE decoder in the VAE model are both two-layer MLPs.

First, the head pose ρ of the first frame is set ₀ Identity style identification Z _style Audio a {1,..the, t }, residual head pose Δρ {1,..t } = ρ {1,..the, t } - ρ ₀ The data is input into a VAE encoder to be encoded, the mean value and the variance are obtained, the mean value and the variance are mapped into a Gaussian distribution, potential vectors are obtained by sampling the Gaussian distribution, and the sampled potential vectors pass through the VAE decoder to generate new data similar to the original data distribution.

Finally, obtaining residual error Deltaρ ' {1, & gt, t } after one iteration, further, compensating and correcting the generated motion attitude coefficient by calculating residual error Deltaρ ' {1, & gt, returning to the VAE encoder, and repeating for a plurality of times until Deltaρ ' {1, & gt, t } meets a threshold value smaller than 0.1, stopping iteration, and finally obtaining the real head motion attitude coefficient ρ after compensation and correction.

Further, the 3D face rendering unit includes an appearance encoder, a typical key point extraction unit, a 3D face recognition unit, a mapping network, and a video generator.

Given an original image, preliminary face coefficients are generated by an appearance encoder and a typical key point extraction unit of a 3D face, and an initial reference expression coefficient and an initial head pose coefficient of the image are determined by a face recognition unit, and the coefficients and a voice signal are input into an expression unit and a head pose generation unit to generate an expression coefficient and a head pose coefficient of a final video.

The initial reference expression coefficient and the initial head posture coefficient and the generated expression coefficient and the head posture coefficient of the final video are respectively input into a pre-trained mapping network, and the 3D face key point space output by the mapping network is used as input into a video generator together with the output of the appearance encoder and typical key points of the 3D face to generate the final video.

The mapping network is a convolutional neural network, the input of the mapping network is an expression coefficient, the head posture coefficient and the output of the mapping network is a facial key point, and the mapping network is trained by using real data.

The appearance encoder includes coefficients that relate to the appearance of the face of the generated object still image.

The typical key points extracted by the typical key point extraction module of the 3D face comprise coefficients of key parts such as lips and eyes, the coefficients are weighted and summed together with the reference face key point coefficient obtained by the 3D face recognition unit and the actual face key point coefficient matched with voice to obtain the coefficient of each frame, and finally, the final video is formed after the multi-frame coefficients are calculated.

The beneficial effects are that:

1: the invention provides a text generation video system based on a depth generation countermeasure network, which is a set of text generation video system. The independent speaker encoder based on the neural network generates a voice system, learns the speaking habit of a speaker, and further generates high-quality speaking voice. The invention adopts the independent speaker encoder to train the reference voice and learn the speaking habit of the speaker, so that high-quality voice can be generated. The speaker habit of the speaker is embedded through the independently trained speaker encoder network, the problem of low voice definition of the generated target character is solved, the problem of natural head movement and vivid expression is realized through designing the expression unit and the head gesture unit to calculate the 3D movement coefficient, the problem of asynchronous audio and video is solved, and the image quality of the synthesized video is improved.

2: according to the invention, the 3D motion coefficients are introduced to express the key points of the face, so that the expression unit and the head posture unit are built to calculate the 3D motion coefficients, and the expression of the face and the posture of the head can be more vivid and natural. And constructing a 3D face rendering module based on the expression unit and the head gesture unit, and connecting the 3D motion coefficient with the face key point by adopting a mapping network, so that the final generation of the video can be realized. The system can generate clear voice of the target person, solves the problem of asynchronous audio and video by using the expression unit consisting of the wavlip and the 3DMM, and improves the image quality of the synthesized video.

Drawings

FIG. 1 is a block diagram of a text-generating video system based on a depth-generating countermeasure network provided by the invention;

FIG. 2 is a block diagram of a sequence synthesizer;

FIG. 3 is a block diagram of an expression unit;

FIG. 4 is a block diagram of a head pose unit;

fig. 5 is a block diagram of a 3D face rendering unit.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides a TEXT generation video system based on a depth generation countermeasure network, which is shown in figure 1.

The system consists of two parts, one part is a voice generating module and the other part is a video generating module.

The inputs to the overall system are two: firstly, inputting characters, namely, wanted words; and secondly, inputting an image, namely inputting head image data of a generated object. There are two outputs: the voice is generated as soon as the voice is output in the middle; and secondly, generating a final video.

The workflow of the system is as follows:

first, text and images are input.

Secondly, determining a reference voice of a generated object according to a given identity tag by the pre-trained speaker encoder, and outputting a coding vector to a sequence synthesizer; at the same time, the 3D face recognition recognizes the input image to obtain an initial reference expression coefficient and a head posture coefficient, and the coefficients are respectively input into the expression module, the head posture module and the 3D face rendering unit.

Thirdly, the sequence synthesizer synthesizes the input characters and the coding vectors output by the loudspeaker coder, and finally outputs a Log-Mel spectrogram, and then the generated spectrogram is output to the vocoder.

Fourth, the vocoder converts the synthesized Log-Mel spectrogram output from the sequence synthesizer network into a time domain waveform, and finally generates voice, which is then input into the expression unit and the head gesture unit.

Fifth, the expression unit and the head pose unit train the speech and the initial reference expression coefficient and the head pose coefficient, generate the expression coefficient and the head pose coefficient of the associated speech, and input the generated coefficients into the 3D face rendering unit.

Sixth, the 3D face rendering unit generates a final video from the initial reference expression coefficient, the head pose coefficient, and the expression coefficient of the associated voice.

The specific embodiment of each part is as follows:

speaker encoder

The speaker encoder is a relatively independent module that functions to capture the speech characteristics of the object from the reference speech and is used to adjust the synthesis network (sequence synthesizer) based on the reference speech signal from the desired target speaker (the desired speech effect). The speaker encoder network is trained from a neural network that computes Log-Mel (base 10 logarithm) spectrogram sequences from arbitrary length reference speech and maps them into embedded vectors of fixed dimensions. The training network can optimize the voice loss of the speaker so that the generated voice has high similarity with the original voice of the same person and has great difference with the voice of different persons. The sound loss calculated by the speaker encoder directly adjusts the sequence synthesizer network to optimize the final generated speech.

Training of the module is composed of a voice video example divided into 1.6s and a speaker identity tag; the training network is a Log-Mel spectrogram which reaches an LSTM (long-short-term memory recurrent neural network) consisting of a plurality of units through a plurality of transmission channels, and finally, the output is normalized. The network is not set (either not set or set) with the optimization learning network, and because it is an embedded module of the sequence synthesizer, no optimization iterative feedback is set.

The sequence synthesizer is shown in fig. 2. The sequence synthesizer consists of an encoder, a synthesizer and a decoder. The sequence synthesizer trains on the target audio of the text transcription, and at the input, first maps the text into a series of phonemes (minimum phonetic units), which can converge more quickly and can improve pronunciation of words and proper nouns. The phonemes are then synthesized with the reference speech by the encoded vectors of the output of the pre-trained speaker encoder, and the synthesized speech is finally encoded and input to a decoder for decoding to finally produce a synthesized Mel spectrogram of the same and high quality as the reference speech.

Vocoder (vocoder)

The vocoder is also a relatively independent module that converts the synthesized Mel spectrogram output by the sequence synthesizer network into a time domain waveform using sample-by-sample autoregressive WaveNet as the vocoder. The network consists of about 30 extended convolutional layers and the output of the network is related to the output of the speaker encoder and the output of the sequence synthesizer.

Expression unit

The expression module as shown in fig. 3, the audio a { 1..the, t } (speech feature) generates the expression coefficients β { 1..the, t }, which are t frames generated by a training network, wherein the audio feature of each frame is a Mel spectrogram of 0.2s, wherein the training network comprises an audio encoder based on a res net (residual neural network, the core idea of which is to learn the variation of the features by residual connection so that the network can be more easily optimized), and a mapping network, which is one of the classical network models in deep learning, is a linear layer for decoding the expression coefficients, the first is the output of the audio after passing the audio encoder, and the second is the reference expression coefficients β from the reference image ₀ The effect of this factor is to reduce identity uncertainty, and the third is blinkControl signal z _blink ∈[0,1]And corresponding eye mark loss, in order to prevent the use of only lip coefficients in training from causing the unauthentic of the final effect, resulting in a controlled blinking effect.

The training network may be formulated as:

β{1,...,t}＝M{A(a{1,...,t}),z _blink ,β ₀ } (1)

the output is β {1,..and, t }, i.e., the expression coefficient of the t frame. M is training network, A is audio encoder.

The second path is through Wav2Lip (based on GAN Lip movement migration algorithm, wav2Lip model realizes Lip shape and input voice synchronization) pre-training network and depth three-dimensional reconstruction (after synchronization is completed, lip three-dimensional reconstruction is carried out to generate video), lip motion coefficients are only used as coefficient targets, audio frequency is passed through Wav2Lip network to obtain preliminary Lip expression coefficients, the preliminary Lip expression coefficients are used for enabling the generated Lip expression coefficients to be more accurate, and the first frame I of Lip images output by 3D face recognition is introduced ₀ As the target expression coefficient, because the lip-related motion is only contained, the influence of posture change and other facial expressions except for lip motion is reduced, and the lip motion can be more stable and smooth by training the target expression coefficient

The preliminary lip expression coefficients are then input to a 3DMM (3D Morphable Model,3D deformable model/parameterized model) coefficient estimator M1 for training, M1 being a monocular three-dimensional face reconstruction model for learning the true expression coefficients. The output of M1 is a more realistic expression coefficient, where the coefficient is divided into two parts, one part being the coefficient M1 related to Lip expression (Wav 2Lip (I) ₀ A {1,..the., t }), the other part is the other coefficient, coefficient M1 relating Lip expression (Wav 2Lip (I ₀ A { 1..the., t }) and the output beta { 1..the., t } of the first training network, and obtaining a difference value L by comparing the difference value with the output beta { 1..the., t }) _distill The boundary loss L of eye blinking can be obtained by other coefficients as input through the M2 network (the M2 network is a differentiable three-dimensional face rendering network without learning parameters) together with the output β {1,.. _lks Can be used for measuringThe extent of eye blinking and the accuracy of the overall expression. Coefficient M1 relating Lip expression (Wav 2Lip (I ₀ A {1,., t }) and other coefficients as inputs through the M2 network may result in a lip coefficient loss L _read To maintain a perceived lip quality.

Lip coefficient loss L _read With M1 (Wav 2Lip (I) ₀ A {1,., t }) to obtain a true lip expression coefficient; the output of the expression unit is β' {1,..t } +l. _read {1,...,t}。

Head posture unit

The head pose unit is shown in fig. 4 and includes encoder and decoder sampling modules based on VAEs (VAEs are a generation model that is a variant of Autoencoder, VAEs generate new samples by learning the potential distribution of data) in order to learn the real identity style head motion, resulting in the head motion coefficients ρ.

The VAE encoder and decoder are two-layered MLPs (multi-layered perceptron, commonly referred to as an artificial neural network model, which consists of multiple layers of neurons, each with full connectivity between its layers, MLPs are often used to solve classification and regression problems), where the input contains one continuous t-frame head pose. In the VAE decoder, the network learns to generate a residual of t frame poses from the sample distribution. Note that this module does not directly generate the pose, but learns the head condition pose ρ of the first frame ₀ The method is able to generate longer, stable, continuous head movements under the conditions of the first frame.

First, the head pose ρ of the first frame is set ₀ Identity style identification Z _style Audio a {1,..the, t }, residual head pose Δρ {1,..t } = ρ {1,..the, t } - ρ ₀ The data is input into a VAE encoder to be encoded to obtain a mean value and a variance, the mean value and the variance are mapped into a Gaussian distribution, namely the mean value and the variance of the Gaussian distribution are equal to the mean value and the variance of the Gaussian distribution, potential vectors are obtained by sampling the Gaussian distribution, and the sampled potential vectors pass through a decoder to generate new data similar to the original data distribution.

Finally getResidual error delta rho' {1, & gt, t } after one iteration is further calculated, the generated motion attitude coefficient is compensated and corrected, the generated authenticity and stability are mastered, and an L is obtained by means of the mean value and the variance _KL The divergence is used to measure the distribution of the resulting head movements. And (3) iterating for a plurality of times until the delta rho' {1, & gt, t } meets a threshold value smaller than 0.1, stopping iterating, and finally obtaining the real head motion attitude coefficient after compensation and correction.

3D face rendering unit

The 3D face rendering unit renders the final video by designing a 3D face rendering module after generating the motion coefficients that are more realistic in the front, as shown in fig. 5.

This module gets inspiration from the literature (Ting-Chun Wang, arun Mallya, and mig-Yu liu. One-shot free-view neural talking-head synthesis for video conferenc-in CVPR, 2021) because it implicitly learns 3D information from a single image. In their method, real video is used as a motion driving signal. While this module we designed uses 3D motion coefficient driving, we propose to use a mapping network to learn the relationship between 3D motion coefficients and 3D facial keypoints. The mapping network is built up by several one-dimensional convolutional layers, smoothed using time coefficients from a time window.

Given an original image, generating preliminary face coefficients through an appearance encoder and typical key points (typical key points of a 3D face), determining initial reference expression coefficients and head posture coefficients of the image through face recognition, inputting the coefficients and voice signals into an expression module and a head posture generation module to generate expression coefficients and head posture coefficients of a final video, respectively inputting the initial coefficients and the generated coefficients into a pre-trained mapping network, and generating a final video by using a 3D face key point space output by the mapping network together with the output of the appearance encoder and the typical key points as input to a video generator. The mapping network (convolutional neural network) is input into an expression coefficient, a head posture coefficient and output into a face key point, and training is carried out by using real data;

the appearance encoder comprises coefficients related to the appearance of the face of the static image of the object, typical key points comprise coefficients of key parts such as lips and eyes, the two coefficients are weighted and summed together with a reference face key point coefficient obtained by 3D face recognition and an actual face key point coefficient matched with voice to obtain a coefficient of each frame, and finally, the final video is formed after multi-frame coefficients are calculated.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text generation video system based on a depth generation countermeasure network, which is characterized by comprising a voice generation module and a video generation module;

a speaker encoder for calculating an embedding vector of a fixed dimension from the reference speech signal of the generated object;

a sequence synthesizer for predicting a mel spectrogram based on a grapheme or phoneme input sequence on the condition that the embedded vector of the object is generated;

the autoregressive WaveNet vocoder is used for converting the Mel spectrogram into a time domain waveform, finally generating voice characteristics and inputting the voice characteristics to the video generation module;

the video generation module takes a picture of a generated object and the voice characteristics as input, and comprises a 3D face recognition unit, an expression unit, a head gesture unit and a 3D face rendering unit;

the 3D face recognition unit is used for carrying out 3D face recognition according to the generated image of the object and determining an initial reference expression coefficient and an initial reference head posture coefficient;

the expression unit calculates the motion coefficient of the face of the generating object to generate an expression coefficient of the associated voice;

the head posture unit calculates the motion coefficient of the whole head to obtain a head posture coefficient;

2. A depth generation countermeasure network based text generation video system of claim 1 in which the speaker encoder, speaker encoder network is trained from a neural network that computes Log-Mel spectrogram sequences from arbitrary length reference speech signals and maps them into embedded vectors of fixed dimensions;

in the training process of the speaker encoder, a training sample comprises a voice video example and a speaker identity tag, wherein the voice video example is divided into 1.6 s; the training network is a Log-Mel spectrogram which reaches a long-short-period memory recurrent neural network LSTM consisting of a plurality of units through a plurality of transmission channels, and finally, the output is normalized.

3. A depth generation countermeasure network based text generation video system of claim 1, wherein the sequence synthesizer includes an encoder, synthesizer, and decoder;

training the input of the sequence synthesizer on a target audio of text transcription, and firstly mapping the text into a series of phonemes at an input end, wherein the factors are minimum speech units; and synthesizing the series of phonemes with the reference voice through the coding vector output by the pre-training speaker coder, and finally inputting the synthesized voice codes to a decoder for decoding to finally generate a synthesized Mel spectrogram which is the same as the reference voice.

4. The depth generation countermeasure network based text generation video system of claim 1, wherein the expression unit includes an audio encoder, a mapping network, a Wav2Lip model, and a 3DMM coefficient estimator;

the audio encoder is a residual neural network ResNet, the input of the audio encoder is audio, and the output is an audio encoding result;

the mapping network is a linear layer for decoding the expression coefficients, and the inputs of the mapping network include three: the first is the audio coding result output after the audio passes through the audio coder, and the second is the reference expression coefficient beta from the reference image ₀ Third is blink control signal z _blink ∈[0,1]And corresponding eye mark loss; the output of the mapping network is the expression coefficient of t frames;

the input of the Wav2Lip model is audio, the audio passes through a Wav2Lip network to obtain a preliminary Lip expression coefficient, and the output of the Wav2Lip model is the preliminary Lip expression coefficient; the preliminary lip expression coefficient is input to a 3DMM coefficient estimator;

the 3DMM coefficient estimator is a monocular three-dimensional face reconstruction model and is used for learning real expression coefficients.

5. The depth generation countermeasure network based text generation video system of claim 1, wherein the head pose unit includes a VAE encoder and a VAE decoder based on a VAE model;

the VAE encoder and the VAE decoder in the VAE model are both two-layer MLPs;

first, the head pose ρ of the first frame is set ₀ Identity style identification Z _style Audio a {1,..the, t }, residual head pose Δρ {1,..t } = ρ {1,..the, t } - ρ ₀ Inputting the data into a VAE encoder to encode to obtain a mean value and a variance, mapping the mean value and the variance into a Gaussian distribution, sampling the Gaussian distribution to obtain potential vectors, and then passing the sampled potential vectors through the VAE decoder to generate new data similar to the original data distribution;

6. The depth generation countermeasure network based text generation video system of claim 1, wherein the 3D face rendering unit includes an appearance encoder, a representative keypoint extraction unit, a 3D face recognition unit, a mapping network, and a video generator;

given an original image, generating a preliminary face coefficient through an appearance encoder and a typical key point extraction unit of a 3D face, determining an initial reference expression coefficient and an initial head posture coefficient of the image through a face recognition unit, and inputting the coefficients and a voice signal into an expression unit and a head posture generation unit to generate an expression coefficient and a head posture coefficient of a final video;

the initial reference expression coefficient and the initial head posture coefficient and the generated expression coefficient and head posture coefficient of the final video are respectively input into a pre-trained mapping network, and the 3D face key point space output by the mapping network and the output of the appearance encoder and typical key points of the 3D face are used as input into a video generator to generate the final video;

the mapping network is a convolutional neural network, the input of the mapping network is an expression coefficient, the head posture coefficient and the output of the mapping network are facial key points, and the mapping network is trained by using real data;

the appearance encoder includes coefficients relating to the appearance of the face of the object;