CN114677312A

CN114677312A - Face video synthesis method based on deep learning

Info

Publication number: CN114677312A
Application number: CN202210241632.1A
Authority: CN
Inventors: 刘奕; 周建伟; 舒佳根
Original assignee: Suzhou Vocational University
Current assignee: Suzhou Vocational University
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-28

Abstract

The invention provides a face video synthesis method based on deep learning, which comprises the following steps: s1, acquiring a source picture and a driving video synthesized by a face, and storing the source picture and the driving video in an original material library; s2, performing data processing on the source picture and the driving video; s3, building a StyleGAN2 model and optimizing the model; s4, training a StyleGAN2 model; s5, inputting the face picture sample and the frame picture of the first person into a StyleGAN2 model for face fusion processing; and S6, taking the target face picture data set frame as a video. The method of the invention has high voice recognition accuracy. The external contour shape of the target face can be effectively positioned, the internal contour information of the target face can be effectively extracted and positioned, and the target face can be positioned more accurately.

Description

Face video synthesis method based on deep learning

Technical Field

The invention relates to the field of artificial intelligence voice recognition, in particular to a face video synthesis method based on deep learning.

Background

Because of its application and technical value, face synthesis is one of the hot spots in the field of machine vision, the field of face synthesis has a high liveness, and with the breakthrough progress of deep learning technology in recent years, this kind of technology is also undergoing rapid development, and has been widely applied in various fields such as privacy protection, movie and television animation, entertainment and business, and the like.

The vivid human face synthesis technology can realize the work of copying or reprinting video clips or dubbing and the like at lower cost. Still other videos have been synthesized by such techniques to age or younger versions of actors, saving shooting costs. When high-risk special effect shots need to be presented in a video, a high-quality face image is synthesized through the technology, so that the personal safety of a task can be ensured, and the extra cost caused by using an expensive special effect is avoided.

At present, most of various face video synthesis modes adopt manual image processing, and synthetic images and videos are obtained by replacing and repairing human faces in images and videos. The method needs a large amount of manpower and material resources, has a very slow processing speed, and cannot be developed in the current short video industry. Therefore, the artificial intelligence processing method based on deep learning can greatly reduce the processing difficulty of pictures and videos and improve the processing efficiency of the face synthetic video.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a face video synthesis method based on deep learning, and the quality of the generated image is obviously better by adopting an improved StyleGAN2 model. By adopting the ASM algorithm, the external contour shape of the target face can be effectively positioned, the internal contour information of the target face can be effectively extracted and positioned, and the target face can be positioned more accurately.

In order to realize the purpose, the invention is realized by the following technical scheme:

a face video synthesis method based on deep learning comprises the following steps:

s1, obtaining a source picture and a driving video of face synthesis, storing the source picture and the driving video in a raw material library, taking a first face picture as the source picture, taking a video of a second person performing face action as the driving video, driving the first person in the source picture to move by using the face action of the second person in the driving video, and synthesizing the face of the first person to the body of the second person;

s2, performing data processing on the source picture and the driving video;

s3, building a StyleGAN2 model, and optimizing the model;

s4, training a StyleGAN2 model;

s5, inputting the face picture sample and the frame picture of the first person into a StyleGAN2 model for face fusion processing;

and S6, taking the target face picture data set frame as a video.

Further, the specific method of step S1 is:

s11, acquiring a picture containing the whole face of a first person from a picture library, wherein the face image on the picture is a target face to be synthesized into a video, and the defined 7 facial gestures are respectively that a left eye opens a right eye and opens a mouth, a left eye opens a right eye and closes the mouth to open the right eye, a left eye closes the right eye and opens the mouth, a left eye closes the right eye and closes the mouth to open the mouth, and a left eye closes the right eye and closes the mouth to close the mouth;

S12, acquiring a driving video for synthesis: acquiring a second person speech video from a video library, and intercepting the video into a driving video with the duration of 1 minute, wherein the video contains the actual face appearance to be used;

and S13, storing the acquired source picture and the acquired driving video in an original material library.

Further, the specific method of step S2 is:

s21, processing a source picture, adjusting the size of the picture, and enhancing the picture, wherein the source picture is adjusted according to the picture input size of the deep learning model by adopting a picture processing tool, and the size of the picture is adjusted to 256x 256; the method comprises the steps that the middle point of the two brows of the face of a first person in a source picture is taken as the center point of the picture, the picture is subjected to original scaling, and when the length of one of the height and the width of the source picture is reduced to 256, scaling operation is stopped; selecting the zoomed source picture by adopting a 256x256 square frame, and cutting the picture part beyond the square range to finally obtain a cut face picture sample of the first person;

performing picture enhancement on the cut picture sample, specifically adopting a color image enhancement technology, converting a color image into a gray image, performing histogram equalization on the gray image, restoring the gray image with the enhanced effect into the color image, and storing the color image in a sample picture folder;

S22, processing the video, extracting pictures frame by frame, specifically, zooming and cutting the driving video to 256x256 size, zooming the video to proper size, placing a human face in the center of a 256x256 dotted line selection frame for cutting, and outputting a cut driving video sample;

extracting the video frame by adopting opencv2, acquiring a video sample file, acquiring the total frame number of the video, reading each frame of the video, setting each frame to be acquired once, converting the frame into a picture to be output, setting a path for storing the frame picture, and storing the frame picture in a sample picture folder.

Further, the StyleGAN2 model is built in the step S3, the model is optimized, specifically, in the StyleGAN2 model, AdaIN is reconstructed into Weight Demodulation, and a training method of gradual evolution is adopted, namely, an image with small resolution is trained first, then the image is gradually transited to an image with higher resolution after the image is trained, then the current resolution is stably trained, and then the image is gradually transited to the next higher resolution; r1regularization is used when data distribution begins to deviate, and is not used at other times, and path length is regularized.

Further, the specific method in step S4 is:

S41, randomly generating a batch of latent codes, and obtaining a batch of false images fake _ images _ out through a generator G;

s42, taking a part of real images reals from the training data set;

s43, respectively delivering the true image and the false image to a discriminator D to calculate scores, calculating global cross entropy of the scores, judging the false image as the opposite number of the sum of the logarithm of the probability of false and the logarithm of the probability of true, and summing the global cross entropy and a regularization term to be used as a loss function of the discriminator D;

s44, the loss function of the generator G only considers the cross entropy of the generator G, the false image is judged to be the opposite number of the logarithm of the true probability, and then the sum of the false image and the perception path length regularization term is used as the loss function of the generator G;

s45, the optimization process is to make the loss functions of D and G obtain the minimum value through gradient descent.

Further, the specific method of step S5 is:

s51, acquiring a face image sample of a first person from a file for storing a sample image, and then acquiring a first frame image sample from the image folder acquired in the step S22, wherein the first frame image sample comprises a face image of a second person;

s52, aligning the face image sample of the first person with the face in the first frame image sample by adopting an ASM algorithm, wherein the ASM is a subjective shape model, namely, a target object is abstracted through the shape model, and the ASM algorithm comprises a training process and a searching process:

In the training phase: constructing position constraints of each characteristic point, and constructing local characteristics of each characteristic point;

in the stage of searching: iterating to corresponding feature points, firstly calculating the positions of eyes or mouths, making scale and rotation changes, and aligning the face; then, searching near each aligned point, and matching each local key point to obtain a preliminary shape; then correcting the matching result by using the average human face; iterating until convergence;

and S53, inputting the face image sample of the first person with the aligned face and the frame image into a StyleGAN2 model, fusing the face of the first person to the face position of a second person in the frame image through face feature fusion, outputting a fused target face image, and storing the target face image.

And S54, performing batch fusion processing on the rest parts in the picture folder obtained in the step S22 according to the steps to obtain a target face picture data set.

Further, the specific method of step S6 is:

s61, all target face pictures are obtained from the data set, each picture is 1 frame, the first frame is obtained sequentially from frame to frame, and each frame of picture is used as a parameter to be transmitted into an opencv network;

S62, setting a video name, wherein the frame rate is 30, and the video size is 256x 256;

s63, calling write () until the last frame of picture is traversed, and outputting a newly generated target face video to be output in a folder;

and S64, obtaining a final face synthetic video.

Has the beneficial effects that:

1. the improved StyleGAN2 model is adopted, so that the generated image quality is obviously better (the FID score is higher, and artifacts are reduced); a new method for replacing progressive growing is provided, so that the details of teeth, eyes and the like are more perfect; smoother interpolation (additional regularization); the training speed is faster.

2. The invention adopts ASM algorithm, not only can effectively position the external contour shape of the target face, but also can effectively extract and position the internal contour information of the target face. The biggest advantage is that a variable model instance can be generated, and a statistical model instance which changes within a certain range is obtained by changing the shape parameters. The target face can be positioned more accurately due to the change of the shape parameters in the model within a certain range. The method plays a key role in the face alignment, and calibrates the eyes, the nose tip and the corners of the mouth to the same position (total five points) to be a preprocessing link for face synthesis.

3. The invention realizes the enhancement of the picture sample through the histogram equalization and improves the definition and the brightness of the face image.

Drawings

FIG. 1 is a StyleGAN network structure;

FIG. 2 is a new network structure of StyleGAN 2;

fig. 3 shows key feature points of the face in the ASM algorithm.

Detailed Description

The method for synthesizing the face video based on the deep learning comprises the following steps:

step S1, the source picture and the driving video for face synthesis are acquired and stored in the source material library.

The first face picture is used as a source picture, the video of the face action of the second person is used as a driving video, the face action of the second person in the driving video is used for driving the first person in the source picture to move, and the face of the first person is synthesized on the body of the second person.

S11, acquiring a target source picture synthesized by the face

And acquiring a picture containing the complete face of the first person from the picture library, wherein the face image on the picture is the target face to be synthesized into the video. The image of the face on this picture is to be clearly visible, containing the face from which some facial gestures are intended to be taken. Under normal conditions, a person's left and right eyes have two states of opening and closing, respectively. The mouth also has two states, opening mouth and closing mouth. According to the principle of arrangement and combination, the number of facial gestures that we can define is 2 × 2 × 2 — 8. The 8 states are that the left eye opens the right eye and opens the mouth, the left eye opens the right eye and closes the mouth, the left eye closes the right eye and opens the mouth, the left eye closes the right eye and closes the mouth, and the left eye closes the right eye and closes the mouth. However, under normal circumstances, our face state is such that the left and right eyes are open and the mouth is closed, while respecting the reliability of the principles of gesture design, so we have removed this. Since we can define and recognize 7 facial gestures in total.

S12, acquiring driving video for synthesis

And acquiring a second person speech video from the video library, and intercepting the video into a driving video with the duration of 1 minute, wherein the video contains the actual face appearance to be used.

S13, storing the acquired source picture and the acquired driving video

And storing the first person source picture and the second person driving video acquired from the picture library and the video library in an original material library.

Step S2, data processing is performed on the source picture and the drive video.

S21, processing the source picture, adjusting the size of the picture, and performing image enhancement

And adjusting the source picture to 256x256 according to the picture input size of the deep learning model by adopting a picture processing tool. And (3) scaling the picture in the original proportion by taking the middle point of the two brows of the first person in the source picture as the center point of the picture, and stopping scaling operation when the length of one item of the height or the width of the source picture is reduced to 256. And selecting the zoomed source picture by adopting a 256x256 square frame, and clipping the picture part beyond the square range to finally obtain a clipped face picture sample of the first person.

And carrying out picture enhancement on the picture sample subjected to the cutting processing. The method comprises the steps of adopting a color image enhancement technology, converting a color image into a gray image, then carrying out histogram equalization processing on the gray image, finally restoring the gray image with the enhanced effect into the color image after the processing, and storing the color image in a sample picture folder.

The histogram is equalized to the I component of the HSI color space, and the histogram equalizes the luminance component. The image is enhanced in brightness. Remaining unchanged in saturation and color direction. The principle is to use the features of the HSI color space, with the I component representing the image brightness. The hue of the image is not changed after the processing.

Converting the picture sample from RGB to HSI; separating the HSI space, wherein the I component forms a single gray image; performing histogram equalization on the gray level image; replacing the original I component data with the equalized data; the HSI is converted back to RGB, and the equalized component images are restored to output picture samples. The enhancement of the picture sample is realized through the histogram equalization, and the definition and the brightness of the face image are improved.

And S22, processing the video, and extracting pictures frame by frame.

Scaling and cutting the driving video to the size of (256x256), scaling the video to the proper size, placing the human face in the center of a dotted line selection frame of (256x256) for cutting, and outputting a cut driving video sample.

And (3) extracting the video frame by adopting opencv2, acquiring a video sample file, and acquiring the total frame number of the video. The frame number of the video is generally 30 frames/second, the duration of the video sample file is 1 minute, and the total frame number of the video is 30x60 frames, namely 1800 frames. Reading each frame of the video, setting each frame to be acquired once, converting the frame into a picture to be output, setting a path for storing the frame picture, storing the frame picture in a sample picture folder, and obtaining 1800 frame pictures, wherein the frame pictures comprise a second person face image.

And step S3, building a StyleGAN2 model and optimizing the model.

The StyleGAN firstly uses Mapping Network to encode the input vector into an intermediate variable w for feature unwrapping, and then the intermediate variable is transmitted to the generation Network to obtain 18 control variables, so that different visual features can be controlled by different elements of the control variables. Visual features can be simply divided into three categories: (1) coarse-resolution does not exceed 82, affects posture, general hairstyle, facial shape, etc.; (2) medium-resolution 162 to 322, affecting finer facial features, hairstyle, opening or closing of eyes, etc.; (3) high quality-resolution 642 to 10242, affecting color (eyes, hair and skin) and microscopic features; the results are stable across many different data sets, except that the resulting image sometimes exhibits waterdrop-like artifacts.

Aiming at the defect of the water drop-shaped artifact, StyleGAN2 emphasizes and repairs the problems of artifacts and further improves the quality of the generated image, as shown in FIG. 2, in FIG. 2: (a) for StyleGAN master, (b) split AdaIN module for StyleGAN: example normalization and style modulation, (c) to remove the operation on the mean (normalization and modulation), modify the position where noise [ B ] is added. (d) The modification to the convolution weights is changed from the modification to the feature map. The normalization is modified to demodulation.

As shown in fig. 2 (a), an original StyleGAN synthesis network is shown, where a denotes affine transformation learned from W, resulting in a style vector, and B denotes a noise broadcasting operation. The AdaIN operation is divided into two parts, a normalization part and a modulation part, as shown in fig. 2 (b) and described in detail in fig. 2 (a), so as to expand the feature diagram into complete detail. The original StyleGAN adds bias and noise within the pattern block such that their relative impact is inversely proportional to the current pattern size. If the operation of adding bias and noise is moved outside the style block, more predictable results can be obtained, and it is no longer necessary to calculate the mean value just for the standard deviation, such a design architecture is shown in fig. 2 (c).

Major improvements of StyleGAN 2:

the generated image quality is obviously better (FID score is higher, artifacts are reduced; a new method for replacing progressive growing is provided, details of teeth, eyes and the like are more perfect, Style-missing is improved, smoother interpolation (extra regularization) is realized, and the training speed is higher.

Detailed improvement:

1.Weight Demodulation

in StyleGAN2, AdaIN is reconstructed as Weight Demodulation, as shown by the network structure in FIG. 2

The meaning of the AdaIN layer is similar to that of the BN layer, and the AdaIN layer aims to scale and shift the output result of the network middle layer so as to increase the learning effect of the network and avoid the disappearance of the gradient. Compared with the BN which is mean and variance for learning the current batch data, the instant Norm adopts a single picture. AdaIN uses the lernable scale and shift parameters to align different locations in the feature map.

The process flow of Weight demodulation is as follows:

(1) as shown in fig. 2, inner (c), Mod std after Conv 3x3 is used to scale the weights of the convolutional layers, where iii denotes the iii th feature map.

w′_ijk＝s_i×w_ijk

(2) Then demod the weights of the convolutional layers

Then, we get the new convolutional layer weight as:

a small e is added to avoid the denominator being 0, and the numerical stability is ensured. Although this approach is not mathematically equivalent to the Instance Norm, weight demodulation, like other normalization methods, results in output profiles with standard units and definitions.

Furthermore, stealing the scaling parameters as weights for convolutional layers allows for better parallelization of the computation paths.

Progressive growth (evolution)

Progressive growth refers to: training an image with small resolution, and gradually transitioning to an image with higher resolution after training. Then the current resolution is stably trained, and then the next higher resolution is gradually transited.

StyleGAN2 began to look for other designs to make the network deeper and have better training stability. StyleGAN2 uses a ResNet-like residual block connection structure, and StyleGAN2 designs a new framework to take advantage of the multiple scale information generated by images, which are mapped to the final generated image by a Resnet-like jump connection at low resolution features.

Lazy Regulartization (Regularization)

StyleGAN1 used R1 regularization on FFHQ datasets. Experimental results show that regularization is negligible when evaluating computational costs. In fact, even if regularization is used every 16 mini-lots, the model effect is still good, so in StyleGAN2, we can use the strategy of lazy regularization, i.e. R1 regularization is used when the data distribution starts to deviate, and not used at other times.

4. Path length regularization

The meaning of path length regularization is to make the interpolation of the late space more smooth and linear.

These tapers should be of nearly equal length, i.e., small displacements produce changes of the same magnitude, regardless of w or image spatial orientation. Indicating that the mapping from the potential space to the image space is good. The path length regularization not only improves the generation quality of the picture, but also enables the generator to be smoother and the generated picture to be more easily inverted back to the later code.

Step S4, training StyleGAN2 model

The StyleGAN2 is essentially a countermeasure between the false image generator and the true image discriminator that ultimately makes the discriminator unable to discriminate true from false (for a dataset consisting of both false and true images, the probability of the discriminator giving a correct label is 50%). The process is represented by the continuous adjustment of the weights and biases of the two neural networks, so that for the false images generated by the generator, the probability of the discriminator judging as false is minimal, namely: the average value of the characteristic expectation expressed by the operation matrix of the generator neural network (the measurement standard comprises FID, PPL, LPIPS and the like) approaches the average value of the real image sample, and the variance of the characteristic expectation is minimum; meanwhile, the probability that the mixed data set of the true and false images can be provided with the correct label by the discriminator is the maximum (namely, the true image is judged to be true, and the false image is judged to be false).

The process is roughly the following:

s41, randomly generating a batch of (minimatch _ size) latent codes, and obtaining a batch of false images fake _ images _ out through a generator G;

s42, a part of (minimatch) real image reals is taken from the training data set;

s43, delivering the true images and the false images to a discriminator D respectively to calculate scores (namely, the probabilities judged to be true are real _ scores _ out and fake _ scores _ out respectively), calculating the global cross entropy (-log (1-sigmoid (fake _ scores _ out)) -log (sigmoid (real _ scores _ out))) -the opposite number of the sum of the logarithm of the probability that the false images are judged to be false and the logarithm of the probability that the true images are judged to be true of the scores, and summing the global cross entropy and a regularization term to be used as a loss function of the discriminator D;

s44, simultaneously, the loss function of the generator G only considers the cross entropy (-log) of the generator G, namely the inverse number of the logarithm of the probability that the false image is judged to be true, and then the sum of the inverse number and a PPL (perceptual path length) regularization term is used as the loss function of the generator G;

In step S5, the sample of the face picture of the first person and the frame picture are input to the StyleGAN2 model to perform face fusion processing.

S51, obtaining a face picture sample and a frame picture of a first person

And acquiring a face picture sample of the first person from a file for storing the sample picture, and then acquiring a first frame picture sample from 1800 frames of pictures. The first frame picture sample contains a face image of a second person.

And S52, aligning the face image sample of the first person with the face in the first frame image sample by adopting an ASM algorithm.

Asm (active Shape model) refers to a subjective Shape model, i.e., an object is abstracted by a Shape model. ASM is an algorithm based on a Point Distribution Model (PDM). In PDM, the geometry of objects with similar shapes, such as human faces, can be represented by serially connecting the coordinates of several key feature points (landworks) to form a shape vector. The human face based on the ASM is usually described by 68 calibrated key feature points, and the ASM algorithm is divided into a training process and a searching process.

In the training phase: mainly constructing position constraint of each characteristic point and local characteristics of each characteristic point

In the stage of searching: mainly to the corresponding feature points. Calculating the position of eyes (or mouth), making simple scale and rotation change, and aligning the face; then, searching near each aligned point, and matching each local key point (usually adopting the Mahalanobis distance) to obtain a preliminary shape; then correcting the matching result by using the average human face (shape model); iterate until convergence.

As shown in fig. 3, white points in the figure are the above-mentioned key feature points, and the purpose of marking out the key feature points is to further determine the specific positions of the human face features on the basis of human face detection.

(3) And inputting the picture sample after face alignment into a StyleGAN2 model for face fusion processing to obtain a fused image.

Inputting the face picture sample of the first person after face alignment and the frame image into a StyleGAN2 model, fusing the face of the first person to the face position of the second person in the frame image through face feature fusion, outputting a fused target face picture, and storing the target face picture.

(4) And performing batch fusion processing on the rest parts in the 1800 frame pictures according to the steps to obtain a target face picture data set.

In step S6, the target face picture data set is framed as a video.

And acquiring 1800 target face pictures from the data set, and combining the pictures into a video through opencv to output the video.

Each picture is 1 frame, the first frame is obtained in sequence from frame to frame, and each frame of picture is taken as a parameter to be transmitted to the network.

Setting video name, frame rate of 30 and video size of 256x256

Write () is called until the end of traversing to the last frame picture, at which time the output newly generated target face video will be output in the folder.

(4) And obtaining a final face synthesis video.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A face video synthesis method based on deep learning is characterized by comprising the following steps:

s2, performing data processing on the source picture and the driving video;

s3, building a StyleGAN2 model and optimizing the model;

S4, training a StyleGAN2 model;

and S6, taking the target face picture data set frame as a video.

2. The method for synthesizing facial video based on deep learning of claim 1, wherein the specific method in step S1 is as follows:

s11, acquiring a picture containing the complete face of a first person from a picture library, wherein the face image on the picture is a target face to be synthesized into a video, and 7 defined facial gestures are respectively that a left eye opens a right eye and opens a mouth, a left eye opens a right eye and closes the mouth to open the mouth, a left eye opens the right eye and closes the mouth to close, a left eye closes the right eye and opens the mouth, a left eye closes the right eye and closes the mouth to open, and a left eye closes the right eye and closes the mouth to close;

3. The method for synthesizing a face video based on deep learning according to claim 1, wherein the specific method in step S2 is:

s21, processing the source picture, adjusting the size of the picture, and enhancing the picture, specifically, adjusting the size of the picture to 256x256 by adopting a picture processing tool according to the picture input size of the deep learning model; the method comprises the steps that the middle point of the two brows of the face of a first person in a source picture is taken as the center point of the picture, the picture is subjected to original scaling, and when the length of one of the height and the width of the source picture is reduced to 256, scaling operation is stopped; selecting the zoomed source picture by adopting a 256x256 square frame, and cutting the picture part beyond the square range to finally obtain a cut face picture sample of the first person;

S22, processing the video, extracting pictures frame by frame, specifically, scaling and cutting the driving video to 256x256 size, scaling the video to a proper size, placing a human face in the center of a 256x256 dotted line selection frame for cutting, and outputting a cut driving video sample;

extracting videos frame by adopting opencv2, acquiring a video sample file, acquiring the total frame number of the videos, reading each frame of the videos, setting each frame to be acquired once, converting the frames into pictures to be output, setting a path for storing the frame pictures, and storing the frame pictures in a sample picture folder.

4. The method for synthesizing facial video based on deep learning according to claim 1, wherein the StyleGAN2 model is built in step S3 and optimized, specifically, in the StyleGAN2 model, AdaIN is reconstructed as Weight Demodulation, and a training method of gradual evolution is adopted, that is, an image with small resolution is trained first, then the image with higher resolution is transited step by step after the training is completed, then the current resolution is trained stably, and then the image with next higher resolution is transited step by step; r1 regularization is used when the data distribution begins to drift, and is not used at other times, and the path length is normalized.

5. The method for synthesizing a face video based on deep learning according to claim 1, wherein the specific method in step S4 is:

s41, randomly generating a batch of latent codes latents, and obtaining a batch of false images fake _ images _ out through a generator G;

s42, taking a part of real images reals from the training data set;

s43, respectively delivering the true image and the false image to a discriminator D to calculate scores, calculating global cross entropy of the scores, and summing the global cross entropy and a regularization term to serve as a loss function of the discriminator D, wherein the logarithm of the probability of the false image judged as false and the logarithm of the probability of the true image judged as true are opposite;

s44, the loss function of the generator G only considers the cross entropy of the generator G, judges the false image as the opposite number of the logarithm of the true probability, and sums the false image with the perception path length regularization term to be used as the loss function of the generator G;

s45, the optimization process is to reduce through gradient so that the loss functions of D and G are minimum.

6. The method for synthesizing a face video based on deep learning of claim 1, wherein the specific method in step S5 is:

in the training phase: constructing position constraints of all the characteristic points, and constructing local characteristics of all the characteristic points;

7. The method for synthesizing facial video based on deep learning of claim 1, wherein the specific method of step S6 is:

and S64, obtaining the final face synthetic video.