WO2022205416A1 - Generative adversarial network-based facial expression generation method - Google Patents

Generative adversarial network-based facial expression generation method Download PDF

Info

Publication number
WO2022205416A1
WO2022205416A1 PCT/CN2021/085263 CN2021085263W WO2022205416A1 WO 2022205416 A1 WO2022205416 A1 WO 2022205416A1 CN 2021085263 W CN2021085263 W CN 2021085263W WO 2022205416 A1 WO2022205416 A1 WO 2022205416A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
discriminator
generator
image
loss function
Prior art date
Application number
PCT/CN2021/085263
Other languages
French (fr)
Chinese (zh)
Inventor
王蕊
施璠
曲强
姜青山
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Priority to PCT/CN2021/085263 priority Critical patent/WO2022205416A1/en
Publication of WO2022205416A1 publication Critical patent/WO2022205416A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the invention relates to the technical field of computer vision, and more particularly, to a method for generating facial expressions based on a generative confrontation network.
  • 3DMM Fe 3D Deformation Statistical Model
  • DRAW Deep Recursive Writer
  • RNN Recurrent Neural Network
  • CNN Convolutional Neural Network
  • GANs Generative Adversarial Networks
  • ExprGAN Intensity-Controlled Expression Editing
  • Facelet-Bank based on the fixed decoder and decoder, trains a network representing the difference between the two domains according to the target input domain and output domain, so as to realize face image editing.
  • ConvLSTM Convolutional Long Short-Term Memory Network
  • VGAN Vehicle C, etc.
  • TGAN Turing Test-based Generative Adversarial Model
  • HP Villegas R, etc.
  • Another way to achieve image-to-video generation is frame-by-frame generation of video.
  • This method no longer needs to consider the relationship between the video frames before and after, that is, the problem of video generation is converted into a simpler image generation problem, and the degree of change of each frame is controlled by coefficients.
  • ExprGAN can control the expression level in the facial expression editing experiment, and can generate expression video by setting the continuously increasing expression level.
  • Image2video (picture to video) combines the basic encoder and the residual encoder, and realizes the frame-by-frame generation of video by changing the coefficient size of the feature map obtained by the residual encoder, that is, the variable of the degree of change.
  • the video is usually generated based on noise, but because of the small expression database and other reasons, the generated face is relatively single, and the face cannot be specified; Models for videos are less effective in terms of facial expressions.
  • the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a method for generating facial expressions based on a generative confrontation network, and the generated video maintains continuity and authenticity.
  • the technical scheme of the present invention is to provide a method for generating facial expressions based on a generative confrontation network, the method comprising the following steps:
  • the deep learning network model includes a recurrent neural network, a generator, an image discriminator, a first video discriminator and a second video discriminator, wherein the recurrent neural network generates time-dependent motion vectors for the input image; generating The first video discriminator is used for judging the authenticity of the video and classifying the video; the image discriminator is used for judging the authenticity of each video frame; The second video discriminator assists the first video discriminator in controlling the authenticity and smoothness of the generated video changes;
  • the present invention has the advantages that, by improving the structure of the generator, the generation from the face image to the expression video can be better realized;
  • the objective function is more suitable for the generation from face images to expression videos. It retains the facial features while generating expressions, and the generated video maintains the continuity and authenticity, and has the ability to generalize to different faces.
  • FIG. 1 is a flowchart of a method for generating facial expressions based on generative adversarial networks according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of the overall structure of a deep learning network model according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a generator network structure according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a video change curve according to an embodiment of the present invention.
  • the method for generating facial expressions based on generative confrontation network includes: step S110, constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, The first video discriminator and the second video discriminator, wherein the cyclic neural network generates time-related motion vectors for the input image; the generator is used to splicing the motion vector generated by the cyclic neural network and the input image in the channel layer as input, and the output corresponding
  • the image discriminator judges each video frame; the first video discriminator is used to judge whether the video is real and used for classification; the second video discriminator assists the first video discriminator to control the authenticity and smoothness of the generated video changes
  • Step S120 using sample images containing different expression categories as input, and training the deep learning network model with the set objective function as the optimization target; Step S130, using the trained generator to generate face video in real time.
  • MoCoGAN is based on Recurrent Neural Network and infoGAN (Information Generative Adversarial Network) to realize the generation from noise to video.
  • infoGAN Information Generative Adversarial Network
  • the invention modifies the structure of the generator, so that it can better realize the generation from the face image to the expression video, and adds a local video discriminator and a conditional image discriminator, and redefines the target function to adapt it to the generation problem from face images to expression videos.
  • Figure 2 is the overall structure of the constructed deep learning network model, the main part of which includes a recurrent neural network (taken as an example with multiple gated recurrent units, marked as GRU cell), a generator (marked as G), three a discriminator.
  • a recurrent neural network is used to generate time-dependent motion sequences; the input image and the resulting motion sequences are used as inputs to the generator, resulting in video frames; the image discriminator (labeled Di img ) judges each video frame; the first A video discriminator (labeled as D V or D V /Q) is used to judge whether the video is real and used for classification, and another video discriminator (labeled as D patch , also known as local video discriminator) assists the first discriminator for Control the realism and smoothness of changes in the generated video.
  • the generation of video is mainly controlled by content and motion.
  • the content of the video can be regarded as unchanged (that is, the picture in the video is not switched, and the people, objects, and scenes in the picture do not occur. change)
  • the change of the motion sequence causes the dynamic change of the video (such as the movement and deformation of the people, objects and scenes in the picture).
  • random noise is required for each generation to generate a different output.
  • the video content is controlled by the input image, and the resulting changes are controlled by a vector representing motion.
  • a Recurrent Neural Network is used to process sequence data to solve context-related or time-related problems.
  • the recurrent neural network has memory, the previous state information will be remembered and passed on, thus affecting the next output, that is, each output is determined by the state of the previous step and the current input.
  • a recurrent neural network consists of multiple cells, each of which shares weights and is only connected to the cells before and after, and the hidden states are transmitted between the connected cells in the direction.
  • the recurrent neural network can be used to generate the motion state sequence, so as to control the correlation in the generated video timing and ensure the continuous change of the video.
  • Gated Recurrent Unit is one of the variants of RNN, which solves the problem of gradient disappearance and forgetting information caused by long sequence network. On the basis of retaining the characteristics of the combination of short-term memory and long-term memory of the network The network structure is simplified.
  • a gated recurrent neural network is used to map class labels and t independent and identically distributed noises into t sequences representing motion relationships, which are used to control expression changes in the video.
  • h(k+1) represents the motion vector of the kth frame, which is also the hidden state passed to the next frame
  • h(0) is the random initial state
  • z[k] is the random noise obeying the N(0,1) distribution
  • c is the class label
  • GRU cell is the gated recurrent unit
  • z[k] and c are concatenated to become the current input of the kth gated recurrent unit.
  • the generator that encodes and decodes the input image is based on the U-net structure. As shown in FIG. 3 , the generator uses the motion vector and the input image to be spliced at the channel layer as input, and outputs the corresponding video frame. For example, the generator includes seven layers of convolution for downsampling and a corresponding seven layers of deconvolution for upsampling.
  • the generator outputs one image at a time, and each frame of the generated video shares a generator.
  • the content x in the video is the same, each time you only need to change the motion vector h(k) to get different video frames, and the continuity of the video frames is controlled by the motion sequence h, the same video motion sequence
  • the output video will be changed on this basis, that is, it will be regarded as the first frame of the video through reconstruction.
  • the pixel-level reconstruction error is used as the objective function, so that the first frame of the output video is consistent with the input image, and the loss function uses the L1 norm (sum of absolute errors).
  • the L2 norm square root of the sum of squared errors
  • the L1 norm is more prone to blurring than the L1 norm for reconstructing images.
  • the reconstruction loss function of the generator is expressed as:
  • the generator adopts the U-net structure, which is beneficial to not only transfer the features extracted by each layer of the encoder to the next layer, but also directly transfer important information to the corresponding decoder layer, avoiding In order to lose some information in downsampling, it can well preserve the features of shallow and deep layers.
  • CGAN Conditional Generative Adversarial Networks proposes a semi-supervised learning method. In addition to input random noise, a new condition is added as a constraint, and this condition is used in the judgment of the discriminator.
  • the conditions here can be class labels, feature vectors, or even images.
  • the objective function of CGAN is:
  • P data (x) is the true distribution of the domain to which the input data belongs
  • P z (z) is the distribution of random noise
  • P y (y) is the conditional distribution
  • G is the generator
  • G(z, y) is the input noise and the generated data obtained under certain conditions
  • D is the discriminator, which is used to judge whether the data and labels are real.
  • the image discriminator adopts the structure of CGAN, takes the first frame of the video in the data set as the condition, and the input when generating the video as the target condition.
  • the image discriminator constrains each frame of the output video individually, regardless of the relationship between the preceding and following video frames.
  • the image discriminator is trained by splicing the first frame of the video in the training data and any frame in the middle as a real sample, and using the generator's input image and any frame in the output video as a fake sample.
  • the image discriminator can not only judge whether a video frame is a real image, but also constrain the relationship between the generated video frame and the input image.
  • the loss function of the image discriminator Di img is expressed as:
  • P video (v) is the distribution of the real video
  • v[0] represents the first frame of video v
  • v[t] represents the (t+1)th frame of the video
  • P z (z) is random noise
  • c is The target category
  • G is the generator
  • D img is the image discriminator.
  • the input is no longer a two-dimensional image, but a time dimension is added, that is, a three-dimensional data spliced by multiple video frames of a video segment.
  • the two-dimensional convolution structure in the traditional discriminator is no longer applicable, and three-dimensional convolution is needed to deal with spatio-temporal related problems.
  • the idea of three-dimensional convolution is the same as that of two-dimensional convolution.
  • the filter is controlled by three-dimensional convolution kernel, stride, and padding, and the feature map is obtained by sliding the global.
  • the video discriminator DV classifies videos while judging the authenticity.
  • the idea of adding a classifier to the infoGAN is adopted, so that the network can be tuned by changing the weight of the classification error in the objective function.
  • the classifier shares the weights with the video discriminator, and only increases the number of channels in the output to represent the category, which simplifies the network model.
  • the input of the infoGAN discriminator no longer needs labels, only real data and generated data, and the output of the video discriminator should classify the input in addition to judging the authenticity.
  • the category labels are, for example, one-hot encoding, which maps N category labels to N-dimensional 0-1 vectors. When calculating the loss function, one-hot encoding can eliminate the influence of category numbers. , which is beneficial to measure the distance between different categories.
  • the adversarial loss function of the video discriminator D V is expressed as:
  • the videos in the training set are all labeled with categories.
  • the videos in the training set are classified, and the cross entropy is calculated for the predicted category and the actual category obtained by the classifier as a loss function, and the classifier is optimized by reducing the classification error rate.
  • the cross-entropy of the predicted category and the target category obtained by the classifier is calculated as the loss function, so that the generator can generate the expressions of the specified category and achieve the purpose of optimizing the generator.
  • the objective function for training the classifier Q is expressed as:
  • the classification loss function for training the generator G is expressed as:
  • P video (v, c) is the distribution of real videos and their labels
  • Q is the classification network
  • P z (z) is random noise
  • c is the target category
  • G is the generator.
  • a video discriminator D patch for judging local regions is introduced.
  • the local video discriminator D patch is used to ensure the smoothness of video changes and the authenticity of video frames. Sex, the structure is simpler and easier to train.
  • the introduction of a local video discriminator can make the training of the generator and the video discriminator D V tend to balance, and provide the generator with an optimization space. Prevent the video discriminator D V from being over-trained in the early stage of training, that is, the video discriminator D V can accurately separate the correct and wrong samples, which makes the generator difficult to train.
  • the adversarial loss function of the local video discriminator D patch is expressed as:
  • the overall objective function of the provided deep learning network model is expressed as:
  • D includes image discriminator D img , video discriminator D V , local video discriminator D patch , G includes generator and recurrent neural network, ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are self-defined parameters, It can be determined by experience or simulation.
  • the specific training process will not be repeated in the present invention.
  • the offset distance of marker points is introduced as a metric.
  • the specific method is: based on the face detection dlib library, detect the position of 68 key points of the face for each frame in the video, and calculate the L1 norm of the key point position of each frame and the position of the key point of the first frame as the distance. , with the time (ie the number of frames) as the abscissa and the distance (lanmark distance) as the ordinate, draw the curve as shown in Figure 6 .
  • the offset distances of the first 8 frames of CK+ (expression database) and the first 8 frames of MMI are calculated respectively, and it can be found that they are all in a state of gradual rise, indicating that a person moves from a neutral expression to an expression.
  • the offset of the marker point changes continuously and increases gradually.
  • the variation of CK+ exceeds twice that of MMI, because during data extraction, the frames of CK+ are more sparse, while MMI is denser, that is, the difference in the number of frames extracted per unit time results in the difference in the offset distance of the marker points. .
  • Validating CK+ with the trained model will find that its change is less obvious than the training set.
  • the CelebA test it was found that its changes were generally consistent with the changes in the MMI training set and CK+ verification, indicating that the changes in the generated video were gentle and continuous, and there were no sudden changes or discontinuous pictures.
  • the present invention designs a deep learning network model including a cyclic neural network, a generator and three discriminators, so that the generated video changes continuously, clearly, with obvious expressions, and there is no sudden change, discontinuous picture, etc.;
  • the reconstruction loss function of the generator By designing the reconstruction loss function of the generator, the loss function of the image discriminator D img , the adversarial loss function of the video discriminator D V , the objective function of the classifier Q, the classification loss function of the generator G, the local video discriminator D patch
  • the adversarial loss function and the overall objective function improve the accuracy of facial expression generation; in addition, by designing a U-net-based generator, the shallow and deep features can be well preserved, which further improves the performance of the generated video. clarity.
  • the present invention may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • Source or object code written in any combination including object-oriented programming languages, such as Smalltalk, C++, Python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs)
  • FPGAs field programmable gate arrays
  • PDAs programmable logic arrays
  • Computer readable program instructions are executed to implement various aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a generative adversarial network-based facial expression generation method. The method comprises: constructing a deep learning network model, which comprises a recurrent neural network, a generator, an image discriminator, a first video discriminator, and a second video discriminator, wherein the recurrent neural network produces a time-related movement vector for an input image, the generator takes the movement vector and the input image as input, and outputs a corresponding video frame, the image discriminator is used for determining the authenticity of each video frame, the first video discriminator determines the authenticity of a video and performs classification, and the second video discriminator controls the realness and smoothness of a generated video change; training the deep learning network model using sample images containing different expression types as input; and using the trained generator to generate a facial video in real time. The present invention is able to generate an expression while retaining a facial feature, a generated video preserves continuity and realness, and same has a generalization ability for different human faces.

Description

一种基于生成式对抗网络的人脸表情生成方法A Generative Adversarial Network-Based Facial Expression Generation Method 技术领域technical field
本发明涉及计算机视觉技术领域,更具体地,涉及一种基于生成式对抗网络的人脸表情生成方法。The invention relates to the technical field of computer vision, and more particularly, to a method for generating facial expressions based on a generative confrontation network.
背景技术Background technique
在人脸生成方面,3DMM(人脸3D形变统计模型)通过改变形状、纹理、姿态、光照等参数生成人脸。DRAW(深度递归书写器)用循环神经网络(RNN)实现图像生成,Pixel CNN用卷积神经网络(CNN)代替RNN,实现逐像素的图像生成。In terms of face generation, 3DMM (Face 3D Deformation Statistical Model) generates faces by changing parameters such as shape, texture, pose, and illumination. DRAW (Deep Recursive Writer) uses Recurrent Neural Network (RNN) for image generation, and Pixel CNN uses Convolutional Neural Network (CNN) instead of RNN to achieve pixel-by-pixel image generation.
生成式对抗网络(GAN)出现之后被广泛应用于图像生成方面,越来越多基于GAN的模型被应用于人脸表情转换。例如,ExprGAN(基于强度可控的表情编辑)将条件生成对抗网络和对抗自动译码器相结合,实现人脸表情的转换。又如,Facelet-Bank在固定解码器和译码器的基础上,根据目标输入域和输出域训练出表示两个域差值的网络,以此实现人脸图像编辑。Generative Adversarial Networks (GANs) have been widely used in image generation after the emergence, and more and more GAN-based models are applied to facial expression conversion. For example, ExprGAN (Intensity-Controlled Expression Editing) combines conditional generative adversarial networks and adversarial auto-decoders to achieve facial expression translation. Another example, Facelet-Bank, based on the fixed decoder and decoder, trains a network representing the difference between the two domains according to the target input domain and output domain, so as to realize face image editing.
目前,用图像生成视频的主要方法之一是运动序列预测。例如,ConvLSTM(卷积长短时记忆网络)通过循环神经网络和卷积神经网络相结合的方法预测未来的视频帧;VGAN(Vondrick C等)在实现表情视频识别之外,用GAN实现了视频的生成;TGAN(基于图灵测试的生成对抗模型)指出视频可以由时间生成器和图像生成器共同生成,即生成一组与时间相关的序列帧,此外,TGAN使用WGAN(Wasserstein GAN)结构使训练更稳定;HP(Villegas R等)将视频的生成分为两个独立的步骤,第一步用循环神经网络对关键点进行预测,第二步根据预测出的关键点的位置实现视频的逐帧生成。Currently, one of the main methods for generating video from images is motion sequence prediction. For example, ConvLSTM (Convolutional Long Short-Term Memory Network) predicts future video frames through a combination of recurrent neural networks and convolutional neural networks; VGAN (Vondrick C, etc.) uses GAN to realize video recognition in addition to expressive video recognition. Generation; TGAN (Turing Test-based Generative Adversarial Model) points out that video can be jointly generated by temporal generator and image generator, that is, a set of time-related sequence frames are generated, in addition, TGAN uses WGAN (Wasserstein GAN) structure to make training More stable; HP (Villegas R, etc.) divides the generation of the video into two independent steps. The first step uses a recurrent neural network to predict the key points, and the second step realizes the video frame by frame according to the position of the predicted key points. generate.
另一种实现从图像到视频生成的方法是视频的逐帧生成。这种方式不 再需要另外考虑前后视频帧之间的关系,即将视频生成的问题转换为更为简单的图像生成问题,通过系数控制每一帧的变化程度。ExprGAN在人脸表情编辑实验中可以控制表情程度,通过设置连续增大的表情程度,可生成表情视频。Image2video(图片转视频)将基础编码器和剩余编码器相结合,通过改变剩余编码器得到的特征图的系数大小,即表示变化程度变量,实现视频的逐帧生成。Another way to achieve image-to-video generation is frame-by-frame generation of video. This method no longer needs to consider the relationship between the video frames before and after, that is, the problem of video generation is converted into a simpler image generation problem, and the degree of change of each frame is controlled by coefficients. ExprGAN can control the expression level in the facial expression editing experiment, and can generate expression video by setting the continuously increasing expression level. Image2video (picture to video) combines the basic encoder and the residual encoder, and realizes the frame-by-frame generation of video by changing the coefficient size of the feature map obtained by the residual encoder, that is, the variable of the degree of change.
经分析,在现有的基于深度学习生成表情视频的方案中,通常是根据噪声生成视频,但因为表情数据库较小等原因导致生成的人脸较为单一,并且无法指定人脸;而从图像生成视频的模型在人脸表情方面效果较差。After analysis, in the existing deep learning-based expression video generation scheme, the video is usually generated based on noise, but because of the small expression database and other reasons, the generated face is relatively single, and the face cannot be specified; Models for videos are less effective in terms of facial expressions.
发明内容SUMMARY OF THE INVENTION
本发明的目的是克服上述现有技术的缺陷,提供一种基于生成式对抗网络的人脸表情生成方法,所生成视频保持了连续性和真实性。The purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a method for generating facial expressions based on a generative confrontation network, and the generated video maintains continuity and authenticity.
本发明的技术方案是:提供一种基于生成式对抗网络的人脸表情生成方法,该方法包括以下步骤:The technical scheme of the present invention is to provide a method for generating facial expressions based on a generative confrontation network, the method comprising the following steps:
构建深度学习网络模型,该深度学习网络模型包括循环神经网络、生成器、图像判别器、第一视频判别器和第二视频判别器,其中循环神经网络针对输入图像产生时间相关的运动向量;生成器用于将循环神经网络产生的运动向量和输入图像作为输入,输出相应的视频帧;图像判别器用于判断各视频帧的真伪;第一视频判别器用于判断视频真伪并对视频进行分类;第二视频判别器辅助第一视频判别器用于控制生成视频变化的真实性和平滑性;constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, a first video discriminator and a second video discriminator, wherein the recurrent neural network generates time-dependent motion vectors for the input image; generating The first video discriminator is used for judging the authenticity of the video and classifying the video; the image discriminator is used for judging the authenticity of each video frame; The second video discriminator assists the first video discriminator in controlling the authenticity and smoothness of the generated video changes;
利用包含不同表情类别的样本图像作为输入,以设定的目标函数为优化目标训练所述深度学习网络模型;Use sample images containing different expression categories as input, and train the deep learning network model with the set objective function as the optimization goal;
利用经训练的生成器实时生成人脸视频。Generate face videos in real-time with a trained generator.
与现有技术相比,本发明的优点在于,通过改进生成器的结构,可以较好地实现从人脸图像到表情视频的生成;通过引入局部视频判别器和条件图像判别器,重新定义了目标函数,更适用于从人脸图像到表情视频的生成,在生成表情的同时保留了人脸特征,并且所生成视频保持了连续性 和真实性,对不同的人脸有泛化能力。Compared with the prior art, the present invention has the advantages that, by improving the structure of the generator, the generation from the face image to the expression video can be better realized; The objective function is more suitable for the generation from face images to expression videos. It retains the facial features while generating expressions, and the generated video maintains the continuity and authenticity, and has the ability to generalize to different faces.
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。Other features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.
附图说明Description of drawings
被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例,并且连同其说明一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.
图1是根据本发明一个实施例的基于生成式对抗网络的人脸表情生成方法的流程图;1 is a flowchart of a method for generating facial expressions based on generative adversarial networks according to an embodiment of the present invention;
图2是根据本发明一个实施例的深度学习网络模型的总体结构示意图;2 is a schematic diagram of the overall structure of a deep learning network model according to an embodiment of the present invention;
图3是根据本发明一个实施例的生成器网络结构示意图;3 is a schematic diagram of a generator network structure according to an embodiment of the present invention;
图4是根据本发明一个实施例的不同人做“高兴”表情的效果图;4 is an effect diagram of different people doing "happy" expressions according to an embodiment of the present invention;
图5是根据本发明一个实施例的同一个人做高兴、悲伤、惊讶三种表情的效果图;5 is an effect diagram of the same person doing three expressions of joy, sadness and surprise according to an embodiment of the present invention;
图6是根据本发明一个实施例的视频变化曲线示意图。FIG. 6 is a schematic diagram of a video change curve according to an embodiment of the present invention.
具体实施方式Detailed ways
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific values should be construed as illustrative only and not limiting. Accordingly, other instances of the exemplary embodiment may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步 讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it need not be discussed further in subsequent figures.
参见图1所示,本发明所提供的基于生成式对抗网络的人脸表情生成方法包括:步骤S110,构建深度学习网络模型,该深度学习网络模型包括循环神经网络、生成器、图像判别器、第一视频判别器和第二视频判别器,其中循环神经网络针对输入图像产生时间相关的运动向量;生成器用于将循环神经网络产生的运动向量和输入图像在通道层进行拼接作为输入,输出相应的视频帧;图像判别器针对每一个视频帧进行判断;第一视频判别器用于判断视频是否真实并用于分类;第二视频判别器辅助第一视频判别器用于控制生成视频变化的真实性和平滑性;步骤S120,利用包含不同表情类别的样本图像作为输入,以设定的目标函数为优化目标训练所述深度学习网络模型;步骤S130,利用经训练的生成器实时生成人脸视频。Referring to Fig. 1, the method for generating facial expressions based on generative confrontation network provided by the present invention includes: step S110, constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, The first video discriminator and the second video discriminator, wherein the cyclic neural network generates time-related motion vectors for the input image; the generator is used to splicing the motion vector generated by the cyclic neural network and the input image in the channel layer as input, and the output corresponding The image discriminator judges each video frame; the first video discriminator is used to judge whether the video is real and used for classification; the second video discriminator assists the first video discriminator to control the authenticity and smoothness of the generated video changes Step S120, using sample images containing different expression categories as input, and training the deep learning network model with the set objective function as the optimization target; Step S130, using the trained generator to generate face video in real time.
在下文中,将以对MoCoGAN框架(Motion and Content Decomposed GAN)进行改进为例进行说明。MoCoGAN基于循环神经网络和infoGAN(信息生成对抗网络)实现从噪声到视频的生成。本发明在此基础上修改了生成器的结构,使之可以较好地实现从人脸图像到表情视频的生成,并新增了一个局部视频判别器和一个条件图像判别器,重新定义了目标函数,使之适应从人脸图像到表情视频的生成问题。In the following, we will take the improvement of the MoCoGAN framework (Motion and Content Decomposed GAN) as an example to illustrate. MoCoGAN is based on Recurrent Neural Network and infoGAN (Information Generative Adversarial Network) to realize the generation from noise to video. On this basis, the invention modifies the structure of the generator, so that it can better realize the generation from the face image to the expression video, and adds a local video discriminator and a conditional image discriminator, and redefines the target function to adapt it to the generation problem from face images to expression videos.
图2是所构建的深度学习网络模型的总体结构,其主体部分包括一个循环神经网络(以含有多个门控循环单元为例,标记为GRU cell),一个生成器(标记为G),三个判别器。循环神经网络用于产生时间相关的运动序列;输入图像和得到的运动序列作为生成器的输入,从而得到视频帧;图像判别器(标记为D img)对每一个视频帧进行判断;第一个视频判别器(标记为D V或D V/Q)用于判断视频是否真实并用于分类,另一个视频判别器(标记为D patch,也称为局部视频判别器)辅助第一个判别器用于控制生成视频变化的真实性和平滑性。 Figure 2 is the overall structure of the constructed deep learning network model, the main part of which includes a recurrent neural network (taken as an example with multiple gated recurrent units, marked as GRU cell), a generator (marked as G), three a discriminator. A recurrent neural network is used to generate time-dependent motion sequences; the input image and the resulting motion sequences are used as inputs to the generator, resulting in video frames; the image discriminator (labeled Di img ) judges each video frame; the first A video discriminator (labeled as D V or D V /Q) is used to judge whether the video is real and used for classification, and another video discriminator (labeled as D patch , also known as local video discriminator) assists the first discriminator for Control the realism and smoothness of changes in the generated video.
以下将介绍图2涉及的循环神经网络、生成器及各判别器的具体实施例。Specific embodiments of the recurrent neural network, the generator and each discriminator involved in FIG. 2 will be introduced below.
1)循环神经网络1) Recurrent Neural Network
视频的生成主要由内容和运动控制,在一小段视频片段中,当时间足 够短的时候,可以看作视频的内容不变(即视频中画面没有切换,画面中的人、物、景没有发生改变),运动序列的变化造成了视频的动态变化(如画面中的人、物、景的移动和形变)。在生成式对抗网络中,每次生成需要随机噪声来产生不一样的输出,在本实施例中,视频内容由输入图像控制,所产生的变化由表示运动的向量控制。The generation of video is mainly controlled by content and motion. In a short video clip, when the time is short enough, the content of the video can be regarded as unchanged (that is, the picture in the video is not switched, and the people, objects, and scenes in the picture do not occur. change), the change of the motion sequence causes the dynamic change of the video (such as the movement and deformation of the people, objects and scenes in the picture). In a generative adversarial network, random noise is required for each generation to generate a different output. In this embodiment, the video content is controlled by the input image, and the resulting changes are controlled by a vector representing motion.
用同一个生成器来生成视频的不同帧时,需要保证同一视频不同帧的内容相同而运动向量发生变化。视频帧之间内容相同由输入同一张图像来保证,此时如果运动向量是完全随机的,那么每次生成的视频帧也会将内容映射到一个随机的分布,既无法保证视频是连续的,更无法保证其在现实中是有意义的。为了确保这些视频帧之间的连续性和生成的视频是有意义的,不同帧之间的运动序列需要是相关的。When using the same generator to generate different frames of a video, it is necessary to ensure that the content of different frames of the same video is the same and the motion vector changes. The same content between video frames is guaranteed by inputting the same image. At this time, if the motion vector is completely random, the video frame generated each time will also map the content to a random distribution, which cannot guarantee that the video is continuous. There is no guarantee that it will make sense in reality. To ensure continuity between these video frames and the resulting video is meaningful, the motion sequences between different frames need to be correlated.
在一个实施例中,采用循环神经网络(Recurrent Neural Network,RNN)来处理序列数据,用来解决上下文相关或者在时间范畴相关的问题。循环神经网络具有记忆性,上一次的状态信息会被记住并传递下去,从而影响下一次的输出,即每一次输出是由上一步的状态和当前的输入共同决定。一个循环神经网络由多个细胞(cell)构成,每个细胞权值共享,并且只与前后的细胞相连,相连的细胞之间按方向传递隐藏状态。在视频生成问题中,循环神经网络可以用来生成运动状态序列,从而控制生成视频时序上的相关性,保证视频的连续变化。门控循环单元(Gated Recurrent Unit,GRU)是RNN的变体之一,解决了因序列较长网络导致梯度消失、遗忘信息的问题,在保留网络短期记忆和长期记忆相结合的特性的基础上简化了网络结构。In one embodiment, a Recurrent Neural Network (RNN) is used to process sequence data to solve context-related or time-related problems. The recurrent neural network has memory, the previous state information will be remembered and passed on, thus affecting the next output, that is, each output is determined by the state of the previous step and the current input. A recurrent neural network consists of multiple cells, each of which shares weights and is only connected to the cells before and after, and the hidden states are transmitted between the connected cells in the direction. In the video generation problem, the recurrent neural network can be used to generate the motion state sequence, so as to control the correlation in the generated video timing and ensure the continuous change of the video. Gated Recurrent Unit (GRU) is one of the variants of RNN, which solves the problem of gradient disappearance and forgetting information caused by long sequence network. On the basis of retaining the characteristics of the combination of short-term memory and long-term memory of the network The network structure is simplified.
在该实施例中,用门控循环神经网络将类别标签和t个独立同分布的噪声映射为t个表示运动关系的序列,用于控制视频中的表情变化。In this embodiment, a gated recurrent neural network is used to map class labels and t independent and identically distributed noises into t sequences representing motion relationships, which are used to control expression changes in the video.
h(k+1)=GRUcell(h(k),[z[k],c])k=0,1,…,t-1     (1)h(k+1)=GRUcell(h(k),[z[k],c])k=0,1,…,t-1 (1)
其中h(k+1)表示第k帧的运动向量,也是传递到下一帧的隐藏状态,h(0)为随机初始状态,z[k]为服从N(0,1)分布的随机噪声,c为类别标签,GRU cell为门控循环单元,z[k]和c拼接成为第k个门控循环单元的当前输入。Where h(k+1) represents the motion vector of the kth frame, which is also the hidden state passed to the next frame, h(0) is the random initial state, and z[k] is the random noise obeying the N(0,1) distribution , c is the class label, GRU cell is the gated recurrent unit, z[k] and c are concatenated to become the current input of the kth gated recurrent unit.
2)生成器2) Generator
在一个实施例中,对输入图像进行编码和解码的生成器基于U-net结构,如图3所示,生成器将运动向量和输入图像在通道层进行拼接作为输入,输出相应的视频帧。例如,生成器包括七层卷积实现下采样和与之对应的七层反卷积用于实现上采样。In one embodiment, the generator that encodes and decodes the input image is based on the U-net structure. As shown in FIG. 3 , the generator uses the motion vector and the input image to be spliced at the channel layer as input, and outputs the corresponding video frame. For example, the generator includes seven layers of convolution for downsampling and a corresponding seven layers of deconvolution for upsampling.
生成器每次输出一张图像,生成视频的每一帧共享一个生成器。对于一个视频,视频中的内容x是相同的,每次只需改变运动向量h(k)即可得到不同的视频帧,而视频帧的连续性是由运动序列h控制的,同一视频运动序列的相关性由循环神经网络保证,即h=R(z,c),其中R为循环神经网络,c为类别标签(例如包括高兴、惊讶、悲伤三种表情),z为随机噪声。生成的视频
Figure PCTCN2021085263-appb-000001
如果将循环神经网络也看作生成器的一部分,则有
Figure PCTCN2021085263-appb-000002
The generator outputs one image at a time, and each frame of the generated video shares a generator. For a video, the content x in the video is the same, each time you only need to change the motion vector h(k) to get different video frames, and the continuity of the video frames is controlled by the motion sequence h, the same video motion sequence The correlation of is guaranteed by the recurrent neural network, that is, h=R(z,c), where R is the recurrent neural network, c is the category label (for example, including three expressions of joy, surprise, and sadness), and z is random noise. generated video
Figure PCTCN2021085263-appb-000001
If the recurrent neural network is also considered as part of the generator, then there is
Figure PCTCN2021085263-appb-000002
在视频生成中,输入中立表情的人脸图像,输出视频将在此基础上进行变化,即通过重构将它作为视频的第一帧。除了下文将提到的对抗损失函数外,这里用像素级别重构误差作为目标函数,使得输出视频的第一帧与输入图像一致,损失函数用L1范数(误差绝对值之和)。根据经验,L2范数(误差平方和的平方根)用于重构图像比L1范数更容易产生模糊。In video generation, a face image with neutral expression is input, and the output video will be changed on this basis, that is, it will be regarded as the first frame of the video through reconstruction. In addition to the adversarial loss function mentioned below, here the pixel-level reconstruction error is used as the objective function, so that the first frame of the output video is consistent with the input image, and the loss function uses the L1 norm (sum of absolute errors). As a rule of thumb, the L2 norm (square root of the sum of squared errors) is more prone to blurring than the L1 norm for reconstructing images.
例如,生成器的重构损失函数表示为:For example, the reconstruction loss function of the generator is expressed as:
Figure PCTCN2021085263-appb-000003
Figure PCTCN2021085263-appb-000003
其中
Figure PCTCN2021085263-appb-000004
表示生成视频的第一帧,x为输入的人脸图像,P(.)表示对应的分布。
in
Figure PCTCN2021085263-appb-000004
Represents the first frame of the generated video, x is the input face image, and P(.) represents the corresponding distribution.
在该实施例中,生成器采用U-net结构,有利于使编码器每一层提取的特征除了向下一层传递外,还直接将重要信息传递给与之对应的译码器层,避免了在下采样中丢失部分信息,从而能很好地保留浅层和深层的特征。In this embodiment, the generator adopts the U-net structure, which is beneficial to not only transfer the features extracted by each layer of the encoder to the next layer, but also directly transfer important information to the corresponding decoder layer, avoiding In order to lose some information in downsampling, it can well preserve the features of shallow and deep layers.
3)图像判别器3) Image discriminator
传统的生成式对抗网络实现的是从噪声生成目标输出,而这并不能有效地控制输出的类型或者实现图像的编辑。在此基础上,条件生成式对抗网络(Conditional Generative Adversarial Networks,CGAN)提出半监督学 习方式,除了输入随机噪声,新增了条件作为约束,并将这个条件用于判别器的判断中。这里的条件可以是类别标签,特征向量,甚至图像。CGAN的目标函数为:Traditional generative adversarial networks achieve target output from noise, which cannot effectively control the type of output or enable image editing. On this basis, Conditional Generative Adversarial Networks (CGAN) proposes a semi-supervised learning method. In addition to input random noise, a new condition is added as a constraint, and this condition is used in the judgment of the discriminator. The conditions here can be class labels, feature vectors, or even images. The objective function of CGAN is:
Figure PCTCN2021085263-appb-000005
Figure PCTCN2021085263-appb-000005
其中P data(x)为输入数据所属域的真实分布,P z(z)为随机噪声的分布,P y(y)为条件的分布,G为生成器,G(z,y)为输入噪声和特定条件得到的生成数据,D为判别器,用于判断数据和标签是否真实。 where P data (x) is the true distribution of the domain to which the input data belongs, P z (z) is the distribution of random noise, P y (y) is the conditional distribution, G is the generator, and G(z, y) is the input noise and the generated data obtained under certain conditions, D is the discriminator, which is used to judge whether the data and labels are real.
在本发明一个实施例中,图像判别器采用CGAN的结构,将数据集中视频的第一帧作为条件,生成视频时的输入作为目标条件。图像判别器对输出视频的每一帧单独进行约束,不考虑前后视频帧之间的关系。将训练数据中视频的第一帧和中间的任意一帧拼接作为真实样本,将生成器的输入图像和输出视频中的任意一帧作为伪造的样本,训练图像判别器。图像判别器不仅可以判断视频帧是否为真实图像,而且可以约束生成的视频帧和输入图像之间的关系。In an embodiment of the present invention, the image discriminator adopts the structure of CGAN, takes the first frame of the video in the data set as the condition, and the input when generating the video as the target condition. The image discriminator constrains each frame of the output video individually, regardless of the relationship between the preceding and following video frames. The image discriminator is trained by splicing the first frame of the video in the training data and any frame in the middle as a real sample, and using the generator's input image and any frame in the output video as a fake sample. The image discriminator can not only judge whether a video frame is a real image, but also constrain the relationship between the generated video frame and the input image.
例如,图像判别器D img的损失函数表示为: For example, the loss function of the image discriminator Di img is expressed as:
Figure PCTCN2021085263-appb-000006
Figure PCTCN2021085263-appb-000006
其中P video(v)为真实视频的分布,v[0]表示视频v的第一帧,v[t]表示视频的第(t+1)帧,P z(z)为随机噪声,c为目标类别,G为生成器,D img为图像判别器。 where P video (v) is the distribution of the real video, v[0] represents the first frame of video v, v[t] represents the (t+1)th frame of the video, P z (z) is random noise, and c is The target category, G is the generator, and D img is the image discriminator.
4)视频判别器D V 4) Video Discriminator D V
在第一视频判别器D V中,输入不再是二维图像,而是增加了时间的维度,即一个由视频片段的多个视频帧拼接而成的三维数据。传统判别器中的二维卷积结构不再适用,需要通过三维卷积来处理时空相关的问题。三维卷积的思想与二维卷积一致,滤波器通过三维的卷积核、步长、填充控制,滑动全局得到特征图。 In the first video discriminator DV , the input is no longer a two-dimensional image, but a time dimension is added, that is, a three-dimensional data spliced by multiple video frames of a video segment. The two-dimensional convolution structure in the traditional discriminator is no longer applicable, and three-dimensional convolution is needed to deal with spatio-temporal related problems. The idea of three-dimensional convolution is the same as that of two-dimensional convolution. The filter is controlled by three-dimensional convolution kernel, stride, and padding, and the feature map is obtained by sliding the global.
视频判别器D V在判断真伪的同时对视频进行分类。在一个实施例中,没有采用CGAN的结构,而是采用了在infoGAN中增加分类器的思想,从而可以通过改变分类误差在目标函数中的比重来对网络进行调优。分类器 与视频判别器共享权值,只在输出增加通道数,用于表示类别,简化了网络模型。infoGAN判别器输入不再需要标签,只需要真实的数据和生成的数据,视频判别器的输出除了判断真伪还要对输入进行分类。为了计算交叉熵损失函数,类别标签例如采用独热(one-hot)编码,即将N个类别标签映射到N维的0-1向量,在计算损失函数时,独热编码可以消除类别编号的影响,有利于度量不同类别之间的距离。 The video discriminator DV classifies videos while judging the authenticity. In one embodiment, instead of using the structure of CGAN, the idea of adding a classifier to the infoGAN is adopted, so that the network can be tuned by changing the weight of the classification error in the objective function. The classifier shares the weights with the video discriminator, and only increases the number of channels in the output to represent the category, which simplifies the network model. The input of the infoGAN discriminator no longer needs labels, only real data and generated data, and the output of the video discriminator should classify the input in addition to judging the authenticity. In order to calculate the cross-entropy loss function, the category labels are, for example, one-hot encoding, which maps N category labels to N-dimensional 0-1 vectors. When calculating the loss function, one-hot encoding can eliminate the influence of category numbers. , which is beneficial to measure the distance between different categories.
在训练视频判别器过程中,一方面需要区分出视频的真伪,视频判别器D V的对抗损失函数表示为: In the process of training the video discriminator, on the one hand, it is necessary to distinguish the authenticity of the video. The adversarial loss function of the video discriminator D V is expressed as:
Figure PCTCN2021085263-appb-000007
Figure PCTCN2021085263-appb-000007
另一方面需要对视频进行分类,即训练视频分类器。训练集中的视频都是有类别标签的,对训练集中的视频进行分类,对于经过分类器得到的预测类别与实际类别计算交叉熵作为损失函数,通过降低分类的错误率,优化分类器。在生成结果中,计算经过分类器得到的预测类别和目标类别的交叉熵作为损失函数,使得生成器可以生成指定类别的表情,达到优化生成器的目的。On the other hand, it is necessary to classify videos, i.e. to train a video classifier. The videos in the training set are all labeled with categories. The videos in the training set are classified, and the cross entropy is calculated for the predicted category and the actual category obtained by the classifier as a loss function, and the classifier is optimized by reducing the classification error rate. In the generated result, the cross-entropy of the predicted category and the target category obtained by the classifier is calculated as the loss function, so that the generator can generate the expressions of the specified category and achieve the purpose of optimizing the generator.
在一个实施例中,训练分类器Q的目标函数表示为:In one embodiment, the objective function for training the classifier Q is expressed as:
Figure PCTCN2021085263-appb-000008
Figure PCTCN2021085263-appb-000008
训练生成器G的分类损失函数表示为:The classification loss function for training the generator G is expressed as:
Figure PCTCN2021085263-appb-000009
Figure PCTCN2021085263-appb-000009
其中P video(v,c)为真实视频和其标签的分布,Q为分类网络,P z(z)为随机噪声,c为目标类别,G为生成器。 where P video (v, c) is the distribution of real videos and their labels, Q is the classification network, P z (z) is random noise, c is the target category, and G is the generator.
5)局部视频判别器5) Local video discriminator
除了整体的判别器外,引入一个判断局部区域的视频判别器D patch,此时的判别器不再需要处理分类任务,局部视频判别器D patch用于保证视频变化的平滑性和视频帧的真实性,结构更为简单,更容易训练。引入局部视频判别器能够使生成器和视频判别器D V的训练趋于平衡,给生成器提供可优化的空间。防止在训练初期视频判别器D V训练得过好,即视频判别器D V可以准确分离正确和错误样本导致生成器难以训练。 In addition to the overall discriminator, a video discriminator D patch for judging local regions is introduced. At this time, the discriminator no longer needs to deal with classification tasks. The local video discriminator D patch is used to ensure the smoothness of video changes and the authenticity of video frames. Sex, the structure is simpler and easier to train. The introduction of a local video discriminator can make the training of the generator and the video discriminator D V tend to balance, and provide the generator with an optimization space. Prevent the video discriminator D V from being over-trained in the early stage of training, that is, the video discriminator D V can accurately separate the correct and wrong samples, which makes the generator difficult to train.
在一个实施例中,局部视频判别器D patch的对抗损失函数表示为: In one embodiment, the adversarial loss function of the local video discriminator D patch is expressed as:
Figure PCTCN2021085263-appb-000010
Figure PCTCN2021085263-appb-000010
在一个实施例中,所提供的深度学习网络模型的整体目标函数表示为:In one embodiment, the overall objective function of the provided deep learning network model is expressed as:
min Gmax Dloss=l img_adv1l vid_adv2l patch_adv3l cat4l rec   (9) min G max D loss=l img_adv1 l vid_adv2 l patch_adv3 l cat4 l rec (9)
其中D包括图像判别器D img、视频判别器D V、局部视频判别器D patch,G包括生成器和循环神经网络两部分,λ 1、λ 2、λ 3、λ 4为自定义的参数,可根据经验或仿真确定。 D includes image discriminator D img , video discriminator D V , local video discriminator D patch , G includes generator and recurrent neural network, λ 1 , λ 2 , λ 3 , λ 4 are self-defined parameters, It can be determined by experience or simulation.
在训练上述深度学习网络模型时,利用包含不同表情类别的样本图像作为训练集。例如,选取正面人脸,用自动剪裁工具将数据库的图片裁剪成128*128像素大小,人脸居中占整张图片的80%;训练过程中批大小取16,视频长度取8,输入图像尺寸取128*128像素;对于目标函数中的超参数,取λ 1=1、λ 2=1、λ 3=10、λ 4=10;选用Adam优化器,学习率设为0.0002,β 1=0.5,β 2=0.999,权重衰减设为0.00001。对于具体训练过程,本发明不再进行赘述。 When training the above deep learning network model, sample images containing different expression categories are used as the training set. For example, select a frontal face, use the automatic crop tool to crop the database image to a size of 128*128 pixels, and the face in the center accounts for 80% of the entire image; in the training process, the batch size is 16, the video length is 8, and the input image size is Take 128*128 pixels; for the hyperparameters in the objective function, take λ 1 =1, λ 2 =1, λ 3 =10, λ 4 =10; choose Adam optimizer, the learning rate is set to 0.0002, β 1 =0.5 , β 2 =0.999, and the weight decay is set to 0.00001. The specific training process will not be repeated in the present invention.
为验证本发明的效果,首先进行了定性分析。从CelebA数据库中选取部分正面可识别的人脸作为测试集,经过剪裁后作为输入。分别对不同人做“高兴”表情(如图4所示)和同一个人做高兴、悲伤、惊讶三种表情(如图5所示)进行可视化。实验结果表明,从人脸图像到人脸表情视频的生成是可以通过不同表情标签控制实现的,生成的视频变化连续、清晰、表情明显。In order to verify the effect of the present invention, a qualitative analysis was first carried out. Select some frontally identifiable faces from the CelebA database as the test set, and then use them as input after cropping. Visualize different people's "happy" expressions (as shown in Figure 4) and the same person's three expressions of happiness, sadness, and surprise (as shown in Figure 5). The experimental results show that the generation from face image to face expression video can be realized through the control of different expression tags, and the generated video changes continuously, clearly and with obvious expressions.
进一步地,进行了定量分析。为了判断视频的变化是否平滑,即时间连续性,引入标志点偏移距离作为度量标准。具体做法是:基于人脸检测dlib库,对于视频中的每一帧检测面部68个关键点的位置,计算每一帧的关键点位置和第一帧的关键点的位置的L1范数作为距离,以时间(即帧数)为横坐标,以距离(lanmark距离)为纵坐标,绘制如图6所示的曲线。Further, quantitative analysis was performed. In order to judge whether the change of the video is smooth, that is, the temporal continuity, the offset distance of marker points is introduced as a metric. The specific method is: based on the face detection dlib library, detect the position of 68 key points of the face for each frame in the video, and calculate the L1 norm of the key point position of each frame and the position of the key point of the first frame as the distance. , with the time (ie the number of frames) as the abscissa and the distance (lanmark distance) as the ordinate, draw the curve as shown in Figure 6 .
在训练数据中,分别对CK+(表情数据库)的前8帧和MMI的前8帧的标志点偏移距离进行计算,可以发现它们都是呈平缓上升状态,表明一个人从中立表情到表情的峰值状态时,标志点的偏移是连续变化且逐渐 变大的。CK+的变化量超出MMI的两倍,这是因为在数据提取时,CK+的帧更为稀疏,而MMI更为稠密,即单位时间内提取的帧的数量不同造成了标志点偏移距离的不同。用训练好的模型对CK+进行验证,会发现它的变化没有训练集明显。在CelebA测试时,发现它的变化总体与MMI训练集和CK+验证时的变化水平一致,说明生成的视频的变化是平缓而连续的,不存在突变、画面不连续等情况。In the training data, the offset distances of the first 8 frames of CK+ (expression database) and the first 8 frames of MMI are calculated respectively, and it can be found that they are all in a state of gradual rise, indicating that a person moves from a neutral expression to an expression. In the peak state, the offset of the marker point changes continuously and increases gradually. The variation of CK+ exceeds twice that of MMI, because during data extraction, the frames of CK+ are more sparse, while MMI is denser, that is, the difference in the number of frames extracted per unit time results in the difference in the offset distance of the marker points. . Validating CK+ with the trained model will find that its change is less obvious than the training set. During the CelebA test, it was found that its changes were generally consistent with the changes in the MMI training set and CK+ verification, indicating that the changes in the generated video were gentle and continuous, and there were no sudden changes or discontinuous pictures.
综上所述,本发明通过设计包含循环神经网络、生成器和三个判别器的深度学习网络模型,使所生成的视频变化连续、清晰、表情明显,不存在突变、画面不连续等情况;通过设计生成器的重构损失函数、图像判别器D img的损失函数、视频判别器D V的对抗损失函数、分类器Q的目标函数、生成器G的分类损失函数、局部视频判别器D patch的对抗损失函数以及整体的目标函数,提升了人脸表情生成的精确度;此外,通过设计基于U-net的生成器,能很好地保留浅层和深层的特征,进一步提升了生成视频的清晰度。 To sum up, the present invention designs a deep learning network model including a cyclic neural network, a generator and three discriminators, so that the generated video changes continuously, clearly, with obvious expressions, and there is no sudden change, discontinuous picture, etc.; By designing the reconstruction loss function of the generator, the loss function of the image discriminator D img , the adversarial loss function of the video discriminator D V , the objective function of the classifier Q, the classification loss function of the generator G, the local video discriminator D patch The adversarial loss function and the overall objective function improve the accuracy of facial expression generation; in addition, by designing a U-net-based generator, the shallow and deep features can be well preserved, which further improves the performance of the generated video. clarity.
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++、Python等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本发明的各个方面。The computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code written in any combination, including object-oriented programming languages, such as Smalltalk, C++, Python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present invention.
这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计 算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

  1. 一种基于生成式对抗网络的人脸表情生成方法,包括以下步骤:A method for generating facial expressions based on generative adversarial networks, comprising the following steps:
    构建深度学习网络模型,该深度学习网络模型包括循环神经网络、生成器、图像判别器、第一视频判别器和第二视频判别器,其中循环神经网络针对输入图像产生时间相关的运动向量;生成器用于将循环神经网络产生的运动向量和输入图像作为输入,输出相应的视频帧;图像判别器用于判断各视频帧的真伪;第一视频判别器用于判断视频真伪并对视频进行分类;第二视频判别器辅助第一视频判别器用于控制生成视频变化的真实性和平滑性;constructing a deep learning network model, the deep learning network model includes a recurrent neural network, a generator, an image discriminator, a first video discriminator and a second video discriminator, wherein the recurrent neural network generates time-dependent motion vectors for the input image; generating The first video discriminator is used for judging the authenticity of the video and classifying the video; the image discriminator is used for judging the authenticity of each video frame; The second video discriminator assists the first video discriminator in controlling the authenticity and smoothness of the generated video changes;
    利用包含不同表情类别的样本图像作为输入,以设定的目标函数为优化目标训练所述深度学习网络模型;Use sample images containing different expression categories as input, and train the deep learning network model with the set objective function as the optimization goal;
    利用经训练的生成器实时生成人脸视频。Generate face videos in real-time with a trained generator.
  2. 根据权利要求1所述的方法,其中,所述生成器基于U-net结构构建,包括用于下采样的多层卷积层,以及与所述多层卷积层对应的多层反卷积用于实现上采样。The method of claim 1, wherein the generator is constructed based on a U-net structure, comprising a multi-layer convolutional layer for downsampling, and a multi-layer deconvolution corresponding to the multi-layer convolutional layer Used to implement upsampling.
  3. 根据权利要求1所述的方法,其中,所述目标函数设置为:The method of claim 1, wherein the objective function is set to:
    Figure PCTCN2021085263-appb-100001
    Figure PCTCN2021085263-appb-100001
    其中D包括图像判别器、第一视频判别器和第二视频判别器,G包括生成器和循环神经网络,λ 1、λ 2、λ 3、λ 4是超参数,l img_adv是图像判别器的损失函数,l vid_adv是第一视频判别器的对抗损失函数,l patch_adv是第二视频判别器的对抗损失函数,l cat是生成器的分类损失函数,l rec是生成器的重构损失函数。 where D includes the image discriminator, the first video discriminator and the second video discriminator, G includes the generator and the recurrent neural network, λ 1 , λ 2 , λ 3 , λ 4 are hyperparameters, and limg_adv is the image discriminator’s Loss function, l vid_adv is the adversarial loss function of the first video discriminator, l patch_adv is the adversarial loss function of the second video discriminator, l cat is the classification loss function of the generator, and l rec is the reconstruction loss function of the generator.
  4. 根据权利要求3所述的方法,其中,所述图像判别器的损失函数表示为:The method according to claim 3, wherein the loss function of the image discriminator is expressed as:
    Figure PCTCN2021085263-appb-100002
    Figure PCTCN2021085263-appb-100002
    其中P video(v)是真实视频的分布,v[0]表示视频v的第一帧,v[t]表示视频的第(t+1)帧,P z(z)是随机噪声,c是目标类别,G表示生成器,D img表示图像判别器,P data(x)、P z(z)和P c(c)分别表示x、z和c的分布。 where P video (v) is the distribution of the real video, v[0] represents the first frame of video v, v[t] represents the (t+1)th frame of the video, P z (z) is random noise, and c is The target category, G represents the generator, D img represents the image discriminator, and P data (x), P z (z), and P c (c) represent the distributions of x, z, and c, respectively.
  5. 根据权利要求3所述的方法,其中,所述第一视频判别器D V的对抗损失函数表示为: The method according to claim 3, wherein the adversarial loss function of the first video discriminator D V is expressed as:
    Figure PCTCN2021085263-appb-100003
    Figure PCTCN2021085263-appb-100003
    其中,c是目标类别,P z(z)是随机噪声,x为输入的人脸图像,v表示视频帧,P video(v)表示视频的分布,P data(x)、P z(z)和P c(c)分别表示x、z和c的分布。 Among them, c is the target category, P z (z) is random noise, x is the input face image, v is the video frame, P video (v) is the distribution of the video, P data (x), P z (z) and P c (c) denote the distribution of x, z and c, respectively.
  6. 根据权利要求3所述的方法,其中,所述第二视频判别器D patch的对抗损失函数表示为: The method according to claim 3, wherein the adversarial loss function of the second video discriminator D patch is expressed as:
    Figure PCTCN2021085263-appb-100004
    Figure PCTCN2021085263-appb-100004
    其中,c是目标类别,P z(z)是随机噪声,z为随机噪声,x为输入的人脸图像,P video(v)表示视频的分布,P data(x)、P z(z)和P c(c)分别表示x、z和c的分布。 Among them, c is the target category, P z (z) is random noise, z is random noise, x is the input face image, P video (v) is the distribution of the video, P data (x), P z (z) and P c (c) denote the distribution of x, z and c, respectively.
  7. 根据权利要求3所述的方法,其中,所述生成器的分类损失函数表示为:The method according to claim 3, wherein the classification loss function of the generator is expressed as:
    Figure PCTCN2021085263-appb-100005
    Figure PCTCN2021085263-appb-100005
    其中,Q是分类网络,P z(z)为随机噪声,c为目标类别,G为生成器,z为随机噪声,x为输入的人脸图像,P data(x)、P(z)和P c(c)分别表示x、z和c的分布。 where Q is the classification network, P z (z) is random noise, c is the target category, G is the generator, z is random noise, x is the input face image, P data (x), P(z) and P c (c) represents the distribution of x, z and c, respectively.
  8. 根据权利要求3所述的方法,其中,所述生成器的重构损失函数表示为:The method of claim 3, wherein the reconstruction loss function of the generator is expressed as:
    Figure PCTCN2021085263-appb-100006
    Figure PCTCN2021085263-appb-100006
    其中
    Figure PCTCN2021085263-appb-100007
    表示生成视频的第一帧,x为输入的人脸图像,
    Figure PCTCN2021085263-appb-100008
    表示生成的视频,
    Figure PCTCN2021085263-appb-100009
    表示
    Figure PCTCN2021085263-appb-100010
    的分布,P x(x)表示x的分布。
    in
    Figure PCTCN2021085263-appb-100007
    Represents the first frame of the generated video, x is the input face image,
    Figure PCTCN2021085263-appb-100008
    represents the generated video,
    Figure PCTCN2021085263-appb-100009
    express
    Figure PCTCN2021085263-appb-100010
    The distribution of , P x (x) represents the distribution of x.
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至8中任一项所述方法的步骤。A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 8.
  10. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程 序时实现权利要求1至8中任一项所述的方法的步骤。A computer device, comprising a memory and a processor, a computer program that can be run on the processor is stored in the memory, and characterized in that, when the processor executes the program, any one of claims 1 to 8 is implemented The steps of the method described in item.
PCT/CN2021/085263 2021-04-02 2021-04-02 Generative adversarial network-based facial expression generation method WO2022205416A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/085263 WO2022205416A1 (en) 2021-04-02 2021-04-02 Generative adversarial network-based facial expression generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/085263 WO2022205416A1 (en) 2021-04-02 2021-04-02 Generative adversarial network-based facial expression generation method

Publications (1)

Publication Number Publication Date
WO2022205416A1 true WO2022205416A1 (en) 2022-10-06

Family

ID=83457562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/085263 WO2022205416A1 (en) 2021-04-02 2021-04-02 Generative adversarial network-based facial expression generation method

Country Status (1)

Country Link
WO (1) WO2022205416A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116074577A (en) * 2022-12-23 2023-05-05 北京生数科技有限公司 Video processing method, related device and storage medium
CN116188684A (en) * 2023-01-03 2023-05-30 中国电信股份有限公司 Three-dimensional human body reconstruction method based on video sequence and related equipment
CN116502548A (en) * 2023-06-29 2023-07-28 湖北工业大学 Three-dimensional toy design method based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268845A (en) * 2018-01-17 2018-07-10 深圳市唯特视科技有限公司 A kind of dynamic translation system using generation confrontation network synthesis face video sequence
US20180288431A1 (en) * 2017-03-31 2018-10-04 Nvidia Corporation System and method for content and motion controlled action video generation
CN109726654A (en) * 2018-12-19 2019-05-07 河海大学 A kind of gait recognition method based on generation confrontation network
CN110210429A (en) * 2019-06-06 2019-09-06 山东大学 A method of network is generated based on light stream, image, movement confrontation and improves anxiety, depression, angry facial expression recognition correct rate
CN111028305A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Expression generation method, device, equipment and storage medium
US10671838B1 (en) * 2019-08-19 2020-06-02 Neon Evolution Inc. Methods and systems for image and voice processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180288431A1 (en) * 2017-03-31 2018-10-04 Nvidia Corporation System and method for content and motion controlled action video generation
CN108268845A (en) * 2018-01-17 2018-07-10 深圳市唯特视科技有限公司 A kind of dynamic translation system using generation confrontation network synthesis face video sequence
CN109726654A (en) * 2018-12-19 2019-05-07 河海大学 A kind of gait recognition method based on generation confrontation network
CN110210429A (en) * 2019-06-06 2019-09-06 山东大学 A method of network is generated based on light stream, image, movement confrontation and improves anxiety, depression, angry facial expression recognition correct rate
US10671838B1 (en) * 2019-08-19 2020-06-02 Neon Evolution Inc. Methods and systems for image and voice processing
CN111028305A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Expression generation method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TULYAKOV SERGEY; LIU MING-YU; YANG XIAODONG; KAUTZ JAN: "MoCoGAN: Decomposing Motion and Content for Video Generation", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 1526 - 1535, XP033476116, DOI: 10.1109/CVPR.2018.00165 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116074577A (en) * 2022-12-23 2023-05-05 北京生数科技有限公司 Video processing method, related device and storage medium
CN116074577B (en) * 2022-12-23 2023-09-26 北京生数科技有限公司 Video processing method, related device and storage medium
CN116188684A (en) * 2023-01-03 2023-05-30 中国电信股份有限公司 Three-dimensional human body reconstruction method based on video sequence and related equipment
CN116502548A (en) * 2023-06-29 2023-07-28 湖北工业大学 Three-dimensional toy design method based on deep learning
CN116502548B (en) * 2023-06-29 2023-09-15 湖北工业大学 Three-dimensional toy design method based on deep learning

Similar Documents

Publication Publication Date Title
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
US20220014807A1 (en) Method, apparatus, device and medium for generating captioning information of multimedia data
US11200424B2 (en) Space-time memory network for locating target object in video content
WO2022205416A1 (en) Generative adversarial network-based facial expression generation method
CN109598231B (en) Video watermark identification method, device, equipment and storage medium
CN112990078B (en) Facial expression generation method based on generation type confrontation network
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
Saxena et al. Monocular depth estimation using diffusion models
CN109168003B (en) Method for generating neural network model for video prediction
US20240119697A1 (en) Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling
JP2022161564A (en) System for training machine learning model recognizing character of text image
CN116982089A (en) Method and system for image semantic enhancement
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN112131429A (en) Video classification method and system based on depth prediction coding network
Fan et al. [Retracted] Accurate Recognition and Simulation of 3D Visual Image of Aerobics Movement
Wang et al. Feature enhancement: predict more detailed and crisper edges
US20230254230A1 (en) Processing a time-varying signal
Ciamarra et al. Forecasting future instance segmentation with learned optical flow and warping
CN115082840A (en) Action video classification method and device based on data combination and channel correlation
CN110969187B (en) Semantic analysis method for map migration
CN116601682A (en) Improved processing of sequential data via machine learning model featuring temporal residual connection
Chan et al. A combination of background modeler and encoder-decoder CNN for background/foreground segregation in image sequence
KR102685693B1 (en) Methods and Devices for Deepfake Video Detection, Computer Programs
Li et al. A learnable motion preserving pooling for fine-grained video classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21934073

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21934073

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21934073

Country of ref document: EP

Kind code of ref document: A1