WO2022205416A1

WO2022205416A1 - Generative adversarial network-based facial expression generation method

Info

Publication number: WO2022205416A1
Application number: PCT/CN2021/085263
Authority: WO
Inventors: 王蕊; 施璠; 曲强; 姜青山
Original assignee: 深圳先进技术研究院
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2022-10-06

Abstract

Disclosed is a generative adversarial network-based facial expression generation method. The method comprises: constructing a deep learning network model, which comprises a recurrent neural network, a generator, an image discriminator, a first video discriminator, and a second video discriminator, wherein the recurrent neural network produces a time-related movement vector for an input image, the generator takes the movement vector and the input image as input, and outputs a corresponding video frame, the image discriminator is used for determining the authenticity of each video frame, the first video discriminator determines the authenticity of a video and performs classification, and the second video discriminator controls the realness and smoothness of a generated video change; training the deep learning network model using sample images containing different expression types as input; and using the trained generator to generate a facial video in real time. The present invention is able to generate an expression while retaining a facial feature, a generated video preserves continuity and realness, and same has a generalization ability for different human faces.

Description

A Generative Adversarial Network-Based Facial Expression Generation Method

technical field

The invention relates to the technical field of computer vision, and more particularly, to a method for generating facial expressions based on a generative confrontation network.

Background technique

In terms of face generation, 3DMM (Face 3D Deformation Statistical Model) generates faces by changing parameters such as shape, texture, pose, and illumination. DRAW (Deep Recursive Writer) uses Recurrent Neural Network (RNN) for image generation, and Pixel CNN uses Convolutional Neural Network (CNN) instead of RNN to achieve pixel-by-pixel image generation.

Generative Adversarial Networks (GANs) have been widely used in image generation after the emergence, and more and more GAN-based models are applied to facial expression conversion. For example, ExprGAN (Intensity-Controlled Expression Editing) combines conditional generative adversarial networks and adversarial auto-decoders to achieve facial expression translation. Another example, Facelet-Bank, based on the fixed decoder and decoder, trains a network representing the difference between the two domains according to the target input domain and output domain, so as to realize face image editing.

Currently, one of the main methods for generating video from images is motion sequence prediction. For example, ConvLSTM (Convolutional Long Short-Term Memory Network) predicts future video frames through a combination of recurrent neural networks and convolutional neural networks; VGAN (Vondrick C, etc.) uses GAN to realize video recognition in addition to expressive video recognition. Generation; TGAN (Turing Test-based Generative Adversarial Model) points out that video can be jointly generated by temporal generator and image generator, that is, a set of time-related sequence frames are generated, in addition, TGAN uses WGAN (Wasserstein GAN) structure to make training More stable; HP (Villegas R, etc.) divides the generation of the video into two independent steps. The first step uses a recurrent neural network to predict the key points, and the second step realizes the video frame by frame according to the position of the predicted key points. generate.

Another way to achieve image-to-video generation is frame-by-frame generation of video. This method no longer needs to consider the relationship between the video frames before and after, that is, the problem of video generation is converted into a simpler image generation problem, and the degree of change of each frame is controlled by coefficients. ExprGAN can control the expression level in the facial expression editing experiment, and can generate expression video by setting the continuously increasing expression level. Image2video (picture to video) combines the basic encoder and the residual encoder, and realizes the frame-by-frame generation of video by changing the coefficient size of the feature map obtained by the residual encoder, that is, the variable of the degree of change.

After analysis, in the existing deep learning-based expression video generation scheme, the video is usually generated based on noise, but because of the small expression database and other reasons, the generated face is relatively single, and the face cannot be specified; Models for videos are less effective in terms of facial expressions.

SUMMARY OF THE INVENTION

The purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a method for generating facial expressions based on a generative confrontation network, and the generated video maintains continuity and authenticity.

The technical scheme of the present invention is to provide a method for generating facial expressions based on a generative confrontation network, the method comprising the following steps:

constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, a first video discriminator and a second video discriminator, wherein the recurrent neural network generates time-dependent motion vectors for the input image; generating The first video discriminator is used for judging the authenticity of the video and classifying the video; the image discriminator is used for judging the authenticity of each video frame; The second video discriminator assists the first video discriminator in controlling the authenticity and smoothness of the generated video changes;

Use sample images containing different expression categories as input, and train the deep learning network model with the set objective function as the optimization goal;

Generate face videos in real-time with a trained generator.

Compared with the prior art, the present invention has the advantages that, by improving the structure of the generator, the generation from the face image to the expression video can be better realized; The objective function is more suitable for the generation from face images to expression videos. It retains the facial features while generating expressions, and the generated video maintains the continuity and authenticity, and has the ability to generalize to different faces.

Other features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

1 is a flowchart of a method for generating facial expressions based on generative adversarial networks according to an embodiment of the present invention;

2 is a schematic diagram of the overall structure of a deep learning network model according to an embodiment of the present invention;

3 is a schematic diagram of a generator network structure according to an embodiment of the present invention;

4 is an effect diagram of different people doing "happy" expressions according to an embodiment of the present invention;

5 is an effect diagram of the same person doing three expressions of joy, sadness and surprise according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a video change curve according to an embodiment of the present invention.

Detailed ways

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

In all examples shown and discussed herein, any specific values should be construed as illustrative only and not limiting. Accordingly, other instances of the exemplary embodiment may have different values.

It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it need not be discussed further in subsequent figures.

Referring to Fig. 1, the method for generating facial expressions based on generative confrontation network provided by the present invention includes: step S110, constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, The first video discriminator and the second video discriminator, wherein the cyclic neural network generates time-related motion vectors for the input image; the generator is used to splicing the motion vector generated by the cyclic neural network and the input image in the channel layer as input, and the output corresponding The image discriminator judges each video frame; the first video discriminator is used to judge whether the video is real and used for classification; the second video discriminator assists the first video discriminator to control the authenticity and smoothness of the generated video changes Step S120, using sample images containing different expression categories as input, and training the deep learning network model with the set objective function as the optimization target; Step S130, using the trained generator to generate face video in real time.

In the following, we will take the improvement of the MoCoGAN framework (Motion and Content Decomposed GAN) as an example to illustrate. MoCoGAN is based on Recurrent Neural Network and infoGAN (Information Generative Adversarial Network) to realize the generation from noise to video. On this basis, the invention modifies the structure of the generator, so that it can better realize the generation from the face image to the expression video, and adds a local video discriminator and a conditional image discriminator, and redefines the target function to adapt it to the generation problem from face images to expression videos.

Figure 2 is the overall structure of the constructed deep learning network model, the main part of which includes a recurrent neural network (taken as an example with multiple gated recurrent units, marked as GRU cell), a generator (marked as G), three a discriminator. A recurrent neural network is used to generate time-dependent motion sequences; the input image and the resulting motion sequences are used as inputs to the generator, resulting in video frames; the image discriminator (labeled _{Di img} ) judges each video frame; the first A video discriminator (labeled as D _V or D _V /Q) is used to judge whether the video is real and used for classification, and another video discriminator (labeled as D _patch , also known as local video discriminator) assists the first discriminator for Control the realism and smoothness of changes in the generated video.

Specific embodiments of the recurrent neural network, the generator and each discriminator involved in FIG. 2 will be introduced below.

1) Recurrent Neural Network

The generation of video is mainly controlled by content and motion. In a short video clip, when the time is short enough, the content of the video can be regarded as unchanged (that is, the picture in the video is not switched, and the people, objects, and scenes in the picture do not occur. change), the change of the motion sequence causes the dynamic change of the video (such as the movement and deformation of the people, objects and scenes in the picture). In a generative adversarial network, random noise is required for each generation to generate a different output. In this embodiment, the video content is controlled by the input image, and the resulting changes are controlled by a vector representing motion.

When using the same generator to generate different frames of a video, it is necessary to ensure that the content of different frames of the same video is the same and the motion vector changes. The same content between video frames is guaranteed by inputting the same image. At this time, if the motion vector is completely random, the video frame generated each time will also map the content to a random distribution, which cannot guarantee that the video is continuous. There is no guarantee that it will make sense in reality. To ensure continuity between these video frames and the resulting video is meaningful, the motion sequences between different frames need to be correlated.

In one embodiment, a Recurrent Neural Network (RNN) is used to process sequence data to solve context-related or time-related problems. The recurrent neural network has memory, the previous state information will be remembered and passed on, thus affecting the next output, that is, each output is determined by the state of the previous step and the current input. A recurrent neural network consists of multiple cells, each of which shares weights and is only connected to the cells before and after, and the hidden states are transmitted between the connected cells in the direction. In the video generation problem, the recurrent neural network can be used to generate the motion state sequence, so as to control the correlation in the generated video timing and ensure the continuous change of the video. Gated Recurrent Unit (GRU) is one of the variants of RNN, which solves the problem of gradient disappearance and forgetting information caused by long sequence network. On the basis of retaining the characteristics of the combination of short-term memory and long-term memory of the network The network structure is simplified.

In this embodiment, a gated recurrent neural network is used to map class labels and t independent and identically distributed noises into t sequences representing motion relationships, which are used to control expression changes in the video.

h(k+1)=GRUcell(h(k),[z[k],c])k=0,1,…,t-1 (1)

Where h(k+1) represents the motion vector of the kth frame, which is also the hidden state passed to the next frame, h(0) is the random initial state, and z[k] is the random noise obeying the N(0,1) distribution , c is the class label, GRU cell is the gated recurrent unit, z[k] and c are concatenated to become the current input of the kth gated recurrent unit.

2) Generator

In one embodiment, the generator that encodes and decodes the input image is based on the U-net structure. As shown in FIG. 3 , the generator uses the motion vector and the input image to be spliced at the channel layer as input, and outputs the corresponding video frame. For example, the generator includes seven layers of convolution for downsampling and a corresponding seven layers of deconvolution for upsampling.

The generator outputs one image at a time, and each frame of the generated video shares a generator. For a video, the content x in the video is the same, each time you only need to change the motion vector h(k) to get different video frames, and the continuity of the video frames is controlled by the motion sequence h, the same video motion sequence The correlation of is guaranteed by the recurrent neural network, that is, h=R(z,c), where R is the recurrent neural network, c is the category label (for example, including three expressions of joy, surprise, and sadness), and z is random noise. generated video

If the recurrent neural network is also considered as part of the generator, then there is

In video generation, a face image with neutral expression is input, and the output video will be changed on this basis, that is, it will be regarded as the first frame of the video through reconstruction. In addition to the adversarial loss function mentioned below, here the pixel-level reconstruction error is used as the objective function, so that the first frame of the output video is consistent with the input image, and the loss function uses the L1 norm (sum of absolute errors). As a rule of thumb, the L2 norm (square root of the sum of squared errors) is more prone to blurring than the L1 norm for reconstructing images.

For example, the reconstruction loss function of the generator is expressed as:

in

Represents the first frame of the generated video, x is the input face image, and P(.) represents the corresponding distribution.

In this embodiment, the generator adopts the U-net structure, which is beneficial to not only transfer the features extracted by each layer of the encoder to the next layer, but also directly transfer important information to the corresponding decoder layer, avoiding In order to lose some information in downsampling, it can well preserve the features of shallow and deep layers.

3) Image discriminator

Traditional generative adversarial networks achieve target output from noise, which cannot effectively control the type of output or enable image editing. On this basis, Conditional Generative Adversarial Networks (CGAN) proposes a semi-supervised learning method. In addition to input random noise, a new condition is added as a constraint, and this condition is used in the judgment of the discriminator. The conditions here can be class labels, feature vectors, or even images. The objective function of CGAN is:

where P _data (x) is the true distribution of the domain to which the input data belongs, P _z (z) is the distribution of random noise, P _y (y) is the conditional distribution, G is the generator, and G(z, y) is the input noise and the generated data obtained under certain conditions, D is the discriminator, which is used to judge whether the data and labels are real.

In an embodiment of the present invention, the image discriminator adopts the structure of CGAN, takes the first frame of the video in the data set as the condition, and the input when generating the video as the target condition. The image discriminator constrains each frame of the output video individually, regardless of the relationship between the preceding and following video frames. The image discriminator is trained by splicing the first frame of the video in the training data and any frame in the middle as a real sample, and using the generator's input image and any frame in the output video as a fake sample. The image discriminator can not only judge whether a video frame is a real image, but also constrain the relationship between the generated video frame and the input image.

For example, the loss function of the image discriminator _{Di img} is expressed as:

where P _video (v) is the distribution of the real video, v[0] represents the first frame of video v, v[t] represents the (t+1)th frame of the video, P _z (z) is random noise, and c is The target category, G is the generator, and D _img is the image discriminator.

4) Video Discriminator D _V

In the first video discriminator _DV , the input is no longer a two-dimensional image, but a time dimension is added, that is, a three-dimensional data spliced by multiple video frames of a video segment. The two-dimensional convolution structure in the traditional discriminator is no longer applicable, and three-dimensional convolution is needed to deal with spatio-temporal related problems. The idea of three-dimensional convolution is the same as that of two-dimensional convolution. The filter is controlled by three-dimensional convolution kernel, stride, and padding, and the feature map is obtained by sliding the global.

The video discriminator DV classifies _videos while judging the authenticity. In one embodiment, instead of using the structure of CGAN, the idea of adding a classifier to the infoGAN is adopted, so that the network can be tuned by changing the weight of the classification error in the objective function. The classifier shares the weights with the video discriminator, and only increases the number of channels in the output to represent the category, which simplifies the network model. The input of the infoGAN discriminator no longer needs labels, only real data and generated data, and the output of the video discriminator should classify the input in addition to judging the authenticity. In order to calculate the cross-entropy loss function, the category labels are, for example, one-hot encoding, which maps N category labels to N-dimensional 0-1 vectors. When calculating the loss function, one-hot encoding can eliminate the influence of category numbers. , which is beneficial to measure the distance between different categories.

In the process of training the video discriminator, on the one hand, it is necessary to distinguish the authenticity of the video. The adversarial loss function of the video discriminator D _V is expressed as:

On the other hand, it is necessary to classify videos, i.e. to train a video classifier. The videos in the training set are all labeled with categories. The videos in the training set are classified, and the cross entropy is calculated for the predicted category and the actual category obtained by the classifier as a loss function, and the classifier is optimized by reducing the classification error rate. In the generated result, the cross-entropy of the predicted category and the target category obtained by the classifier is calculated as the loss function, so that the generator can generate the expressions of the specified category and achieve the purpose of optimizing the generator.

In one embodiment, the objective function for training the classifier Q is expressed as:

The classification loss function for training the generator G is expressed as:

where P _video (v, c) is the distribution of real videos and their labels, Q is the classification network, P _z (z) is random noise, c is the target category, and G is the generator.

5) Local video discriminator

In addition to the overall discriminator, a video discriminator D _patch for judging local regions is introduced. At this time, the discriminator no longer needs to deal with classification tasks. The local video discriminator D _patch is used to ensure the smoothness of video changes and the authenticity of video frames. Sex, the structure is simpler and easier to train. The introduction of a local video discriminator can make the training of the generator and the video discriminator D _V tend to balance, and provide the generator with an optimization space. Prevent the video discriminator D _V from being over-trained in the early stage of training, that is, the video discriminator D _V can accurately separate the correct and wrong samples, which makes the generator difficult to train.

In one embodiment, the adversarial loss function of the local video discriminator D _patch is expressed as:

In one embodiment, the overall objective function of the provided deep learning network model is expressed as:

min _G max _D loss=l _{img_adv} +λ ₁ l _{vid_adv} +λ ₂ l _{patch_adv} +λ ₃ l _cat +λ ₄ l _rec (9)

D includes image discriminator D _img , video discriminator D _V , local video discriminator D _patch , G includes generator and recurrent neural network, λ ₁ , λ ₂ , λ ₃ , λ ₄ are self-defined parameters, It can be determined by experience or simulation.

When training the above deep learning network model, sample images containing different expression categories are used as the training set. For example, select a frontal face, use the automatic crop tool to crop the database image to a size of 128*128 pixels, and the face in the center accounts for 80% of the entire image; in the training process, the batch size is 16, the video length is 8, and the input image size is Take 128*128 pixels; for the hyperparameters in the objective function, take λ ₁ =1, λ ₂ =1, λ ₃ =10, λ ₄ =10; choose Adam optimizer, the learning rate is set to 0.0002, β ₁ =0.5 , β ₂ =0.999, and the weight decay is set to 0.00001. The specific training process will not be repeated in the present invention.

In order to verify the effect of the present invention, a qualitative analysis was first carried out. Select some frontally identifiable faces from the CelebA database as the test set, and then use them as input after cropping. Visualize different people's "happy" expressions (as shown in Figure 4) and the same person's three expressions of happiness, sadness, and surprise (as shown in Figure 5). The experimental results show that the generation from face image to face expression video can be realized through the control of different expression tags, and the generated video changes continuously, clearly and with obvious expressions.

Further, quantitative analysis was performed. In order to judge whether the change of the video is smooth, that is, the temporal continuity, the offset distance of marker points is introduced as a metric. The specific method is: based on the face detection dlib library, detect the position of 68 key points of the face for each frame in the video, and calculate the L1 norm of the key point position of each frame and the position of the key point of the first frame as the distance. , with the time (ie the number of frames) as the abscissa and the distance (lanmark distance) as the ordinate, draw the curve as shown in Figure 6 .

In the training data, the offset distances of the first 8 frames of CK+ (expression database) and the first 8 frames of MMI are calculated respectively, and it can be found that they are all in a state of gradual rise, indicating that a person moves from a neutral expression to an expression. In the peak state, the offset of the marker point changes continuously and increases gradually. The variation of CK+ exceeds twice that of MMI, because during data extraction, the frames of CK+ are more sparse, while MMI is denser, that is, the difference in the number of frames extracted per unit time results in the difference in the offset distance of the marker points. . Validating CK+ with the trained model will find that its change is less obvious than the training set. During the CelebA test, it was found that its changes were generally consistent with the changes in the MMI training set and CK+ verification, indicating that the changes in the generated video were gentle and continuous, and there were no sudden changes or discontinuous pictures.

To sum up, the present invention designs a deep learning network model including a cyclic neural network, a generator and three discriminators, so that the generated video changes continuously, clearly, with obvious expressions, and there is no sudden change, discontinuous picture, etc.; By designing the reconstruction loss function of the generator, the loss function of the image discriminator D _img , the adversarial loss function of the video discriminator D _V , the objective function of the classifier Q, the classification loss function of the generator G, the local video discriminator D _patch The adversarial loss function and the overall objective function improve the accuracy of facial expression generation; in addition, by designing a U-net-based generator, the shallow and deep features can be well preserved, which further improves the performance of the generated video. clarity.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

The computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code written in any combination, including object-oriented programming languages, such as Smalltalk, C++, Python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

A method for generating facial expressions based on generative adversarial networks, comprising the following steps:

constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, a first video discriminator and a second video discriminator, wherein the recurrent neural network generates time-dependent motion vectors for the input image; generating The first video discriminator is used for judging the authenticity of the video and classifying the video; the image discriminator is used for judging the authenticity of each video frame; The second video discriminator assists the first video discriminator in controlling the authenticity and smoothness of the generated video changes;

Use sample images containing different expression categories as input, and train the deep learning network model with the set objective function as the optimization goal;

Generate face videos in real-time with a trained generator.
The method of claim 1, wherein the generator is constructed based on a U-net structure, comprising a multi-layer convolutional layer for downsampling, and a multi-layer deconvolution corresponding to the multi-layer convolutional layer Used to implement upsampling.
The method of claim 1, wherein the objective function is set to:

where D includes the image discriminator, the first video discriminator and the second video discriminator, G includes the generator and the recurrent neural network, λ 1 , λ 2 , λ 3 , λ 4 are hyperparameters, and limg_adv is the image discriminator’s Loss function, l vid_adv is the adversarial loss function of the first video discriminator, l patch_adv is the adversarial loss function of the second video discriminator, l cat is the classification loss function of the generator, and l rec is the reconstruction loss function of the generator.
The method according to claim 3, wherein the loss function of the image discriminator is expressed as:

where P video (v) is the distribution of the real video, v[0] represents the first frame of video v, v[t] represents the (t+1)th frame of the video, P z (z) is random noise, and c is The target category, G represents the generator, D img represents the image discriminator, and P data (x), P z (z), and P c (c) represent the distributions of x, z, and c, respectively.
The method according to claim 3, wherein the adversarial loss function of the first video discriminator D V is expressed as:

Among them, c is the target category, P z (z) is random noise, x is the input face image, v is the video frame, P video (v) is the distribution of the video, P data (x), P z (z) and P c (c) denote the distribution of x, z and c, respectively.
The method according to claim 3, wherein the adversarial loss function of the second video discriminator D patch is expressed as:

Among them, c is the target category, P z (z) is random noise, z is random noise, x is the input face image, P video (v) is the distribution of the video, P data (x), P z (z) and P c (c) denote the distribution of x, z and c, respectively.
The method according to claim 3, wherein the classification loss function of the generator is expressed as:

where Q is the classification network, P z (z) is random noise, c is the target category, G is the generator, z is random noise, x is the input face image, P data (x), P(z) and P c (c) represents the distribution of x, z and c, respectively.
The method of claim 3, wherein the reconstruction loss function of the generator is expressed as:

in
Represents the first frame of the generated video, x is the input face image,
represents the generated video,
express
The distribution of , P x (x) represents the distribution of x.
A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 8.
A computer device, comprising a memory and a processor, a computer program that can be run on the processor is stored in the memory, and characterized in that, when the processor executes the program, any one of claims 1 to 8 is implemented The steps of the method described in item.