WO2021027759A1 - Facial image processing - Google Patents

Facial image processing Download PDF

Info

Publication number
WO2021027759A1
WO2021027759A1 PCT/CN2020/108121 CN2020108121W WO2021027759A1 WO 2021027759 A1 WO2021027759 A1 WO 2021027759A1 CN 2020108121 W CN2020108121 W CN 2020108121W WO 2021027759 A1 WO2021027759 A1 WO 2021027759A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
facial image
image
parameters
input
Prior art date
Application number
PCT/CN2020/108121
Other languages
French (fr)
Inventor
Evangelos VERVERAS
Stefanos ZAFEIRIOU
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2021027759A1 publication Critical patent/WO2021027759A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/175Static expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This specification relates to methods for generating facial images with a given facial expression using neural networks, and methods of training neural networks for generating facial images with a given facial expression.
  • Image-to-image translation is a ubiquitous problem in image processing, in which an input image of a source domain is transformed to a synthetic image of a target domain, where the synthetic image maintains some properties of the original input image.
  • Examples of image-to-image translation include converting images from black-and-white to colour, turning daylight scenes into night-time scenes, increasing the quality of images and/or manipulating facial attributes of an image.
  • many methods of performing image-to-image translation require aligned image pairs.
  • Methods of performing image-to-image translation in the absence of aligned images are limited in that they may not generate synthetic samples of sufficient quality when the input image is recorded under unconstrained conditions (e.g. in the wild) .
  • the target domain is usually limited to be one out of a discrete set of domains. Hence, these methods are limited in their capabilities for generating facial images with a given facial expression.
  • a method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters comprising: receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters; applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image; applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image; applying the discriminator neural network to the input facial image to generate a first predicted classification; applying the discriminator neural network to the output facial image to generate a second predicted classification; and updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the
  • the first classifier and/or the second classifier may comprise a probability distribution over image patches in the input facial image and/or output facial image indicative of whether the image patches are real or synthetic.
  • the generator neural network and/or discriminator neural network may comprise one or more convolutional layers.
  • the set of initial expression parameters and/or the set of target expression parameters may correspond to continuous parameters of a linear three-dimensional blendshape model.
  • One or more of the training examples may further comprise a target facial image corresponding to the set of target expression parameters and the generator loss function may further depend on a comparison of the target facial image with the output facial image when using training examples comprising a target facial image.
  • One or more of the training examples may comprise a target facial image generated synthetically.
  • the method may further comprise: applying a face recognition module to the input facial image to generate an input embedding; and applying the face recognition module to the output facial image to generate an output embedding, and the generator loss function may further depend on a comparison of the input embedding with the output embedding.
  • the face recognition module may comprise a pre-trained neural network.
  • the generator loss function and/or discriminator loss function may comprise a gradient penalty term based on a gradient of the second classifier with respect to the input facial image.
  • Updating parameters of the generator neural network and/or parameters of the discriminator neural network may be performed using backpropagation.
  • the discriminator neural network may comprise a regression layer, and the discriminator neural network may further generate a set of predicted expression parameters from the input facial image and/or the output facial image.
  • the discriminator loss function may further depend on a comparison of the set of predicted expression parameters for an input image with the set of initial expression parameters for the input image.
  • Updating parameters of the generator neural network may further be based on a comparison of the set of predicted expression parameters for an output image with the set of target expression parameters used to generate the output image.
  • the generator neural network may further generate an attention mask from the input facial data and a reconstructed attention mask from the first output facial image and a corresponding set of initial expression parameters. Updating parameters of the generator neural network may further be in dependence on a comparison of the attention mask and the reconstructed attention mask.
  • the generator neural network may further generate a deformation image from the input facial data.
  • the output facial image may be generated by combining the deformation image, the attention mask and the input facial image.
  • the reconstructed facial image may be generated by combining a reconstructed deformation image (generated by the generator neural network from the output facial image) , the reconstructed attention mask and the output facial image
  • the discriminator neural network may be a relativistic discriminator.
  • a method of generating a target image from an input image and a set of target expression parameters comprising: receiving the set of target expression parameters, the target expression parameters taken from a continuous range of target expression parameters; receiving the input image; applying a generator neural network to the input image and the set of target expression parameters to generate the target image, wherein the generator neural network is trained according to the method of any preceding claim.
  • a method of determining a set of expression parameters from an input facial image comprising: receiving an input image; and applying a discriminator neural network to the input image to generate the set of expression parameters, wherein the discriminator neural network has been trained according to the methods described above.
  • Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network
  • Figure 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression
  • Figure 3 shows an overview of a further example method of training a generator neural network for generating a facial image with a given expression
  • Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression
  • Figures 5a and 5b show an example structure of a generator neural network for generating a facial image with a given expression and an example structure of a discriminator neural network for predicting the expression parameters and calculating an adversarial loss for a facial image;
  • Figure 6 shows an overview of an example method of predicting the expression parameters of a facial image using a trained discriminator neural network
  • Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.
  • Example implementations provide system (s) and methods for generating facial images with a given facial expression defined by a set of expression parameters, and methods of determining facial expression parameters of a facial model from an input facial image.
  • Interactive editing of facial expressions has wide ranging applications including, but not limited to, post-production of movies, computational photography, and face recognition. Improving editing of facial expressions may improve systems directed at these applications.
  • face recognition systems using the system (s) and method (s) described herein may be able to more realistically neutralise the expression of a user’s face, and thus verify the user’s identity under less constrained environments than systems and methods in the state of the art.
  • methods and systems disclosed herein may utilise continuous expression parameters, in contrast to discrete expression parameters which use, for example, labels indicating an emotion, such as “happy” , “fearful” , or “sad” .
  • the expression parameters may correspond with expression parameters used by models to vary expression when generating three-dimensional face meshes. These models may include, for example, linear 3D blendshape models. These models may also be able to represent any deformation in facial images that occur, for example deformations as a result of speech. Therefore, it will be appreciated that reference to expression parameters herein refer to any parameter that captures variation in facial images caused by a deformation to the face in the facial image.
  • methods disclosed herein may provide a finer level of control when editing the expressions of facial images than methods using discrete expression parameters.
  • Using the same expression parameters as a 3D facial mesh models may allow fine-tuning of the desired target output by editing an input facial image to depict any expression capable of being produced by the 3D facial mesh model.
  • the target expression may be fine-tuned by modifying the continuous expression parameters.
  • the desired target expressions may be previewed through rendering a 3D facial mesh with the selected expression parameters, prior to generating the edited facial image.
  • Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network.
  • the method 100 takes an input facial image 102 and a set of target expression parameters 104 as input and forms an output facial image 108 as output using a trained generator neural network 106.
  • the input facial image 102, I org is an image comprising one or more faces.
  • the facial expression of each face in the input image is described by an initial set of expression parameters, p org .
  • the input facial image 102 is processed by the trained generator neural network 106 in order to produce an output facial image 108, I gen , comprising one or more faces corresponding to faces in the input image 102, but with an expression described by the target expression parameters 104, p trg .
  • the output facial image 108 generated by the trained generator neural network 106 may retain many aspects of the input facial image including identity, angle, lighting, and background elements, while changing the expression of the face to correspond with the target expression parameters 104.
  • the trained generator neural network 106 may produce an output facial image 108 with different mouth and lip motion to the input facial image 102. In this way, the trained generator neural network 106 may be used for expression and/or speech synthesis.
  • the input facial image 102 and/or output facial image 108 comprise a set of pixel values in a two-dimensional array.
  • H is the height of the image in pixels
  • W is the height of the image in pixels
  • the image has three colour channels (e.g. RGB or CIELAB) .
  • the facial images 102, 108 may, in some embodiments, be in black-and-white/greyscale.
  • the target expression parameters 104 encode a target facial expression of the output image 108 (i.e. the desired facial expression of the output facial image 108) . The target expression parameters are typically different to initial expression parameters associated with the input facial image 102.
  • the target expression parameters 104 are processed by the trained generator neural network 106 along with the input facial image 102 in order to produce the output facial image 108.
  • the initial expression and/or target expression parameters 104 may correspond with the expression parameters used by models to vary expression when generating 3D face meshes.
  • the expression parameters may be regularised to lie within a predefined range. For example, the expression parameters may be regularised to lie in the range [-1, 1] .
  • a zero expression parameter vector i.e. a set of expression parameters where each parameter is set to 0
  • the zero expression parameter vector may correspond to a three-dimensional model with a neutral expression.
  • the magnitude of an expression parameter may correspond with an intensity of a corresponding expression depicted in the 3D facial mesh.
  • the trained generator neural network 106 is a neural network configured to process the input facial image 102 and target expression parameters 104 to generate an output facial image 108.
  • the generator neural network 106 may generate an attention mask 112 (also referred to as a smooth deformation mask) .
  • the mask 112 may have the same spatial dimension as the input facial image 102.
  • the generator neural network may also generate a deformation image, 110.
  • the output facial image 108 may then be generated as a combination of the input facial image 102 and the generator output 110.
  • the trained generator neural network 106 may determine which regions of the input facial image 102 should be modified in order to generate an output facial image 108 corresponding to target expression parameters 104.
  • the trained generator neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters.
  • the parameters of each node of the neural network may comprise one or more weights and/or biases.
  • the nodes take as input one or more outputs of nodes in the previous layer.
  • the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
  • One or more of the layers of the trained generator neural network 106 may be convolutional layers. Examples of generator neural network architectures are described below in relation to Figure 6a.
  • the parameters of the trained generator neural network 106 may be trained using generative adversarial training, and the trained generator neural network 106 may therefore be referred to as a Generative Adversarial Network (GAN) .
  • the trained generator neural network 106 may be the generator network of the generative adversarial training.
  • Generator neural networks 106 trained in an adversarial manner may output more realistic images than other methods. Examples of training methods are described below in relation to Figures 2 and 4.
  • FIG. 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression.
  • the method 200 comprises jointly training a generator neural network 208 and a discriminator neural network 216.
  • the generator neural network 208 is trained using a generator loss function 214.
  • the discriminator neural network 216 is trained using a discriminator loss function 222.
  • the objective of the generator neural network 208 during training is to learn to generate realistic output facial images 210 with a given target expression, which is defined by the target expression parameters 206.
  • the generator loss function 214 is used to update parameters of the generator neural network 208 during training with the aim of producing more realistic output facial images 210.
  • the objective of the discriminator neural network 216 during training is to learn to distinguish between real facial images 202 and output/generated facial images 210.
  • the discriminator loss function 222 is used to update the discriminator neural network 216 with the aim of better discriminating between output (i.e. “fake” ) facial images 210 and input (i.e. “real” ) facial images 202.
  • the objective of the discriminator neural network during training is to learn to distinguish between real facial images 202 and output/generated facial images 210.
  • the discriminator neural network may include a regression layer which estimates expression parameter vectors for an image, which may be a real image or a generated image. This regression layer may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic.
  • the discriminator neural network 216 may be a relativistic discriminator, generating predicted classifications indicating the probability of an image being relatively more realistic than a generated one. Example structures of the discriminator neural network are described below, in relation to Figure 5b.
  • the generator neural network 208 and the discriminator neural network compete against each other until they reach a threshold/equilibrium condition.
  • the generator neural network 208 and the discriminator neural network compete with each other until the discriminator neural network can no longer distinguish between real and synthetic facial images.
  • the adversarial loss calculated by the discriminator neural network may be used to determine when to terminate the training procedure.
  • the adversarial loss of the discriminator neural network may correspond with a distance metric between the distributions of real images and generated images.
  • the training procedure may be terminated when this adversarial loss reaches a threshold.
  • the training procedure may be terminated after a fixed number of training epochs, for example 80 epochs.
  • the training uses a set of training data (also referred to herein as a training set) .
  • the training set may comprise of a set of K training examples of the form where I org is an input facial image 202, p org is an initial expression parameter vector 204, and p trg is a target expression parameter vector 206.
  • the initial expression parameters 204 correspond to the expression of the input facial image 202.
  • a 3D facial mesh model which is configured to receive expression parameters may depict the same expression as the expression of the input facial image 202 when instantiated with the initial expression parameters 204.
  • the training set may additionally or alternatively comprise a set of L training examples of the form where I org , p org , and p trg are defined as above, and I trg is a known target facial image.
  • the target expression parameters 206 may correspond with the target facial image.
  • the method of training the generator neural network 208 where at least one example of the training set includes a target facial image is described below in more detail with respect to Figure 4.
  • the generator neural network 208 is applied to an input facial image 202, I org , and a corresponding set of target expression parameters 206, p trg , taken from a training example included in the training data.
  • the output of the generator neural network 208 is a corresponding output facial image 210, I gen .
  • the output facial image 210 may be represented symbolically as where represents the mapping performed by the generator neural network 208.
  • the generator neural network 208 may generate an attention mask (also referred to as a smooth deformation mask) .
  • the mask may have the same spatial dimension as the input facial image 202.
  • the generator neural network may also generate a deformation image,
  • the output facial image 210 may then be generated as a combination of the input facial image 202 and the generator output.
  • the output facial image 210 may be given by:
  • the values of the mask may be constrained to lie between zero and one. This may be implemented, for example, by the use of a sigmoid activation function.
  • the input facial image 202 and the generated facial image 210 may additionally be processed by a face recognition module (not shown) .
  • the input facial image 202 may be processed by the face recognition module in order to produce an input embedding.
  • the output/generated facial image 210 may be processed by the face recognition module in order to produce an output embedding.
  • the input/output embeddings may represent the identity of faces in facial images.
  • the face recognition module may be a component of the loss function 214, as described below in further detail.
  • the generator neural network 208 is then applied to the output facial image 210, and the set of initial expression parameters 204, p org , taken from the same training example of the training data.
  • the initial expression parameters 204 correspond with the expression depicted in the input facial image 202.
  • the output of the generator neural network in this case is a reconstructed facial image 212, I rec .
  • the generator neural network 208 may generate a mask (also referred to herein as an attention mask) along with a deformation image The mask may have the same spatial dimension as the input facial image 202.
  • the reconstructed facial image 212 may then be generated as a combination of the generated facial image 210 and the generator output.
  • the reconstructed facial image 212 may be given by
  • the generator neural network 208 is configured to process the output facial image 210 and the corresponding set of initial expression parameters 204 in an attempt to produce a reconstructed facial image 212 replicates the input facial image 202.
  • the use of this step during training of the generator neural network 208 can result in the contents of the input facial image 202 (e.g. background elements, presence of glasses and other accessories) being retained when generating output facial images 210 with a trained generator neural network 208.
  • the discriminator neural network 216 is applied to the input facial image 202 to generate a first predicted classification 218, The discriminator neural network is also applied to the output facial image 202 to generate a second predicted classification 220,
  • the predicted classifications may be represented as:
  • classifications 218, 220 may comprise a probability distribution over image patches in the input facial image 202 and/or the output facial image 210 indicative of whether the image patches are real or synthetic.
  • the discriminator neural network 216 may be a relativistic discriminator.
  • the relativistic discriminator generates predicted classifications indicative of the probability of an image being relatively more realistic than a generated one. Examples of relativistic discriminators can be found in “The relativistic discriminator: a key element missing from standard GAN” (Alexia Jolicoeur-Martineau, arXiv: 1807.00734) , the contents of which are incorporated by reference. For example, the predicted classifications may be represented by;
  • the relativistic discriminator may be a relativistic average discriminator, which averages over a mini-batch of generated images.
  • the classifications generated by a relative average discriminator may be dependent on whether the input to the relativistic average discriminator is a real facial image or a generated facial image.
  • the classifications for a real image I may be given by:
  • the discriminator neural network 216 may also output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210. These predicted parameters may be represented by the mapping for an image I.
  • the discriminator neural network 216 may include one or more regression layers to determine this mapping. These one or more regression layers may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic. Example structures of the discriminator neural network 216 are described below, in relation to Figure 6b.
  • the input facial image 202, the reconstructed facial image 212, the first predicted classification 218 and second predicted classification 220 are used to calculate a generator loss function 214.
  • Calculating the generator loss function 214 may comprise comparing the input facial image 202 and the reconstructed facial image 212.
  • the generator loss function 214 may include a plurality of component loss functions, as will be described in greater detail below.
  • the generator loss is used to update the parameters of the generator neural network 208.
  • the parameters of the generator neural network 208 may be updated using an optimisation procedure that aims to optimise the generator loss function 214.
  • the optimisation procedure may, for example, be a gradient descent algorithm.
  • the optimisation procedure may use backpropagation, e.g. by backpropagating the generator loss function 214 to the parameters of the generator neural network 208.
  • the first predicted classification 218 and second predicted classification 220 are used to calculate a discriminator loss function 222.
  • a comparison of the initial expression parameters 204 and the predicted expression parameters of the input image 202 may also be used to determine the discriminator loss function 222.
  • the discriminator loss function 222 may include a plurality of component loss functions, as will be described in greater detail below.
  • the discriminator loss function 222 is used to update the parameters of the discriminator neural network 216.
  • the parameters of the discriminator neural network 216 may be updated using an optimisation procedure that aims to optimise the discriminator loss function 222.
  • the optimisation procedure may, for example, be a gradient descent algorithm.
  • the optimisation procedure may use backpropagation, e.g. by backpropagating the discriminator loss function 222to the parameters of the discriminator neural network 216.
  • the generator loss function 214 and/or discriminator loss function 222 may comprise an adversarial loss function.
  • An adversarial loss function depends on the first predicted classification and the second predicted classification.
  • the adversarial loss function may further comprise a gradient penalty term.
  • the gradient penalty term may be based on a gradient of the second classification with respect to the input facial image.
  • Adversarial loss functions may correspond with a distance metric between the distributions of the generated images and the real images.
  • Expectation operators may be approximated by sampling from the respective distributions from which the expectations are taken, performing the calculation inside the expectation operator, and then dividing by the number of samples taken. In some embodiments, may be replaced by the relativistic discriminator, described above.
  • a gradient penalty term may be based on a gradient of the second classification with respect to the input facial image. While the norm in this example is a L 2 norm, it will be appreciated that alternative norms may be used.
  • a coefficient ⁇ gp may control the contribution of the gradient penalty term to the adversarial loss function. For example, this may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
  • adversarial loss functions may be used, wherein the adversarial loss function depends on an output of a discriminator neural network 216.
  • a generator neural networks 208 trained with adversarial loss functions may generate synthesised images with greater photorealism than when trained with other loss functions.
  • the discriminator neural network 216 is a relativistic discriminator
  • both real and generated images are included in the generator part of the adversarial loss function. This may allow the generator to benefit by the gradients of both real and fake images, generating output facial images 210 with sharper edges and more detail which also better represent the distribution of the real data.
  • An example of a discriminator loss function 214 may then be given by:
  • One or more additional loss functions may be included in the generator loss function 214 and/or discriminator loss function, as described below.
  • generator loss function 214 may further comprise a reconstruction loss function.
  • the reconstruction loss function provides a measure of the difference between the input facial image 202 and the reconstructed facial image 212.
  • the reconstruction loss function is based on a comparison between the input facial image 202 and the reconstructed facial image 212. For an input facial image 202 with width W and height H, an example of a reconstruction loss function is given by:
  • I rec is the reconstructed facial image 212, and can be represented symbolically as
  • ⁇ rec controls the contribution of the reconstruction loss function to the generator loss function.
  • ⁇ rec may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
  • Generator neural networks 208 trained using reconstruction loss functions may generate output facial images 210 which better preserve the contents of the input facial image 202 than when trained without reconstruction loss functions. For example, background elements and the presence of accessories (e.g. sunglasses, hats, jewellery) in the input facial image 202 may be retained in the output facial image 210.
  • Other reconstruction loss functions may be used. For example, different norms may be used in the calculation instead of the L1 norm, or the images may be pre-processed before calculating the reconstruction loss.
  • generator loss function 214 may further comprise an attention mask loss function.
  • the attention mask loss function compares the attention mask generated from the input facial data 202 to a reconstructed attention mask generated from the first output facial image 212 and a corresponding set of initial expression parameters 204.
  • the attention mask loss may encourage the generator to produce attention masks which are sparse and do not saturate to 1.
  • the attention mask may minimise the L1 norm of the produced masks for both the generated and reconstructed images. For an input facial image 202 with width W and height H, an example of an attention mask loss function is given by
  • norms other than an L1 may alternatively be used.
  • ⁇ att controls the contribution of the attention mask loss function to the generator loss function.
  • ⁇ att may be set to a number greater than zero, such as between 0.005 and 50, for example 0.3.
  • Generator neural networks 208 trained with an attention mask loss function may generate output facial images 210 which better preserve the content and the colour of the input facial image 202.
  • the generator loss function 214 may include an identity loss function.
  • An identity loss function depends on the input facial image 202 and the output facial image 210.
  • the identity loss calculation system 216 may include a face recognition module.
  • the face recognition module may be a pre-trained neural network configured to produce identity embeddings. Identity embeddings represent the identity of people in facial images. For example, two images of the same person may be processed by the face recognition module to produce similar embeddings that are closer (as defined by some metric, such as Euclidean distance) to identity embeddings corresponding to different people.
  • the input facial image 202 may be processed by the face recognition module to produce an input embedding, e org .
  • the output facial image 210 may be processed by the face recognition module to produce an output embedding, e gen .
  • the identity loss function may depend on the input embedding and the output embedding. An example of an identity loss function is:
  • generator loss function 214 may then be given as
  • ⁇ id controls the contribution of the identity loss function to the generator loss function.
  • ⁇ id may be set to a number in the range 1-100, such as between 2 and 10, for example 5.
  • the loss function may additionally include a reconstruction loss term and/or attention mask loss function, as described above.
  • Generator neural networks 208 trained using identity loss functions may generate output facial images 210 which better maintain the identity of the face in the input facial image 202 than when trained without identity loss functions.
  • Other identity loss functions may be used, for example, different similarity/distance metrics may be used to compare the identity embeddings.
  • the generator loss function 214 and/or discriminator loss function may further include an expression loss function.
  • An expression loss is calculated based on an output of the discriminator neural network.
  • the discriminator neural network may output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210.
  • a regression layer may be used to determine the expression parameters from an input image.
  • the predicted expression parameters may be compared with the initial and/or target expression parameters using an expression parameter loss function. These predicted parameters may be represented by the mapping for an image I.
  • a different expression loss function may be used in the generator loss function 214 and the discriminator loss function 222.
  • an example expression loss function for the generator loss function 214 is:
  • a generator loss function may then be given as
  • ⁇ exp controls the contribution of the expression loss function to the generator loss function.
  • ⁇ exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
  • the generator loss function 214 may additionally include a reconstruction loss term and/or attention mask loss term and/or an identity loss term, as described above.
  • Generator neural networks 208 trained using expression loss functions in the generator loss function 214 and/or discriminator loss function 222 may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206 than when trained without expression loss functions.
  • the generator neural network 208 may generate output facial images which have an accurate expression according to the discriminator neural network.
  • an example expression loss function for the discriminator loss function 222 is:
  • the discriminator loss function may then be given as
  • ⁇ exp controls the contribution of the expression loss function to the discriminator loss function.
  • ⁇ exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
  • Discriminator neural networks 508 trained with expression loss functions may accurately estimate the expression parameters of an input facial image.
  • the generator neural network may be configured to generate output facial images which, according to the discriminator neural network, accurately depict the expression corresponding to the target expression parameters.
  • a generator loss function may be given as:
  • one or more of the components of the above generator loss function may be omitted.
  • the training set may comprise a set of L training examples of the form with L being at least one.
  • one or more of the training examples may further comprise a target facial image 302, denoted by I trg , in addition to the input image 202, target expression parameters 206 and original expression parameters 204.
  • This training set may be in addition to or as an alternative to the training set K.
  • the method 300 of Figure 3 proceeds in substantially the same manner as the method 200 shown in Figure 2.
  • the target facial image 302 comprise a set of pixel values in a two-dimensional array.
  • the target facial image 302 may have the same dimensions as the input facial image 202 and output facial image 210.
  • H is the height of the image in pixels
  • W is the height of the image in pixels
  • the image has three colour channels (e.g. RGB or CIELAB) .
  • the target facial images 302 may, in some embodiments, be in black-and-white/greyscale.
  • the target facial image 302 may depict a face with the same identity as the face in input facial image 202, but with expression corresponding with the target expression parameters 206.
  • the target facial image 302 may be generated synthetically. Using methods described below, models for 3D face reconstruction from 2D images may be fitted to a set of facial images. The input facial image 202 may be processed by the fitted 3D face reconstruction model in order to produce a target facial image 302 with expression corresponding to target expression parameters 206. In this way, the target expression parameters 206 corresponding to the target facial image 302 are known, and thus the target facial image 302 may be referred to as a ground truth output facial image.
  • the generator loss 214 may include a generation loss function.
  • the generation loss function compares the target facial image with the output facial image.
  • the generator loss may be based on a difference between the target image 302 and the generated image 210. The difference may be measured by a distance metric.
  • an example generation loss function may be given as:
  • ⁇ gen controls the contribution of the generation loss function to the generator loss function.
  • ⁇ gen may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
  • the generator loss function 214 may additionally include an expression loss function, a reconstruction loss term, attention mask loss function and/or an identity loss term, as described above in relation to Figure 2.
  • Generator neural networks 208 trained using generation loss functions may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206.
  • the generator neural network 208 may generate output facial images with fewer artefacts than when trained without generation loss functions.
  • the generator neural network 208 may be trained by separate loss functions depending on whether the training example includes a target facial image 302 or not.
  • a generator loss function 214 is selected based on whether the training data for the current iteration is taken from set K or set L.
  • Training examples including a target facial image 302 may be referred to as paired data, and training examples without target facial images 302 may be referred to as unpaired data.
  • generator loss function 214 may be omitted.
  • generator loss function 214 may be omitted.
  • the training process of the method 200 and/or 300 may update the generator neural network 208 after one or more updates made to the discriminator neural network.
  • the updates to the parameters of the generator and/or discriminator neural networks may be determined using backpropagation.
  • the adversarial loss corresponds with a distance metric between distributions
  • updating the discriminator neural network more often than the generator neural network 208 may lead to the adversarial loss better approximating the distance metric between the distributions of the real and generated images. This in turn may provide a more accurate signal for the generator neural network 208 so that output facial images 210 are generated with more realism.
  • initial and/or target expression parameters 204, 206 may correspond with the expression parameters used to vary expression in 3D facial mesh models.
  • Expression parameters used for training the methods described herein may be extracted by first fitting models for 3D reconstruction of faces to a set of facial images. These models may include 3D Morphable Models, or 3DMM, neural networks, and other machine-learned models. These models for 3D face reconstruction may be configured to optimize three parametric models: a shape model, a texture model, and a camera model, in order to render a 2D instance of the 3D facial reconstruction as close to the initial facial image as possible.
  • the 3D facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector, and summing these products with a mean shape vector.
  • a 3D facial mesh comprising N vertices may be represented as a vector:
  • An identity parameter vector, p s, id may control variations in identity in 3D facial shapes.
  • the 3D facial mesh may be calculated as:
  • m s is a mean shape vector and U s, id is a matrix of basis vectors corresponding to the principle components of the identity subspace.
  • basis vectors may be learned from 3D facial scans displaying a neutral expression and the identity parameters may be used to represent identity variations by instantiating a 3D shape instance.
  • Expression parameters may also be derived using principle component analysis and are used to generate the 3D facial mesh.
  • the 3D facial mesh may be generated by finding the product of each of the identity parameters and a respective basis vector and each of the expression parameters and a respective basis vector, and summing these products with a mean shape vector.
  • the 3D facial mesh may be calculated as:
  • m s is a mean shape vector
  • U s, id is a matrix of basis vectors for identity variations and U s
  • exp is a matrix of basis vectors for expression variations
  • p s, id are identity parameters controlling identity variations in the 3D facial mesh
  • p s, exp are expression parameters controlling expression variations in the 3D facial mesh.
  • the expression basis vectors may be learned from displacement vectors calculated by comparing shape vectors from 3D facial scans depicting a neutral expression and from 3D facial scans depicting a posed expression. These expression parameters may be used to represent expression variations by, in addition to the identity parameters, instantiating a 3D shape instance.
  • methods described herein may not require the 3D facial mesh model to generate a facial image with a given expression; instead the expression parameters may be processed by a generator neural network trained by the methods described herein. Additionally or alternatively, the expression parameters may be predicted by a discriminator neural network trained by the methods described herein.
  • identity and expression parameters p s, id , p s, exp may be extracted from any input facial image 202.
  • expression parameters may be extracted to compose an annotated dataset of K images and their corresponding vector of expression parameters with no manual annotation cost. This dataset may be used in part to produce a training set for training the generator neural network 208.
  • the target expression parameters 206 may be expression parameters determined (using the methods described herein or otherwise) from a facial image different to the input facial image 202. Additionally or alternatively, the target expression parameters 206 may be selected by generating a 3D facial mesh with a desired target expression, and using the corresponding expression parameters that produce the desired target expression. Additionally or alternatively, the target expression parameters 104 may be randomly selected, for example they may be sampled from a multivariate Gaussian distribution.
  • Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression.
  • training data comprising a plurality of training examples, each training example comprising an input facial image a set of initial expression parameters and a set of target expression parameters is received by a generator neural network.
  • the expression parameters may correspond to continuous parameters of a 3D facial model.
  • 3D facial models may include 3D blendshape models and/or 3D morphable models.
  • the training data further comprises a target image corresponding to the target expression parameters.
  • the target image may be a ground truth image from which the target expression parameters were extracted.
  • the target facial image may be a synthetic image.
  • the generator neural network is applied to an input facial image and a corresponding set of target parameters to generate an output facial image.
  • the generator neural network is applied to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image.
  • a discriminator neural network is applied to the output facial image to generate a second predicted classification.
  • the discriminator neural network is applied to the output facial image to generate a second predicted classification.
  • the second predicted classification may comprise a probability distribution over image patches of the output image indicative of whether those patches are real (i.e. from ground truth images) or synthetic (i.e. generated by the generator neural network)
  • Operations 4.1 to 4.5 may be iterated over the training examples in the training data (e.g. K and/or L described above) to form an ensemble of training examples.
  • the training data e.g. K and/or L described above
  • the parameters of the generator neural network are updated in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image.
  • the update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5.
  • the generator loss function may comprise one or more expectation values taken over the ensemble of training examples.
  • the parameters of the generator neural network may also be updated in dependence on a comparison of the target facial image with the output facial image.
  • the parameters of the discriminator neural network are updated in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.
  • the update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5.
  • the discriminator loss function may comprise one or more expectation values taken over the ensemble of training examples.
  • the parameters of the generator neural network and/or discriminator neural network may be updated using an optimisation procedure applied to the generator loss function and/or discriminator loss function respectively.
  • an optimisation procedure include, but are not limited to, gradient descent methods and/or gradient free methods.
  • the threshold condition may be, for example, a threshold number of training epochs.
  • the threshold number of training epochs may lie in the range 50-150, such as 70-90, for example 80.
  • a predetermined number of epochs may be used for each training set. For example, a first number of training epochs may be performed on the training set K, followed by a second number of training epochs on training set L.
  • the number of training epochs for each training set may lie in the range 20-100, such as 30-50, for example 40.
  • the training set K (or some subset of it) may be used in one or more of the iterations.
  • the training set L (or some subset of it) may be used in one or more of the other iterations.
  • the training sets used in each iteration may have a predefined batch size.
  • the batch size may, for example lie in the range 5-100, such as between 10 and 20, for example 16.
  • Figures 5a and 5b show an example structure of a generator neural network 500 for generating a facial image with a given expression and an example structure of a discriminator neural network 512 for predicting the expression parameters and calculating an adversarial loss for a facial image.
  • FIG. 5a shows an example structure of a generator neural network 500.
  • the generator neural network 500 is a neural network configured to process an input facial image 502 and expression parameters 504 to generate an output facial image 508.
  • the output facial image 508 corresponds to the input facial image 502, but with an expression dependent on the expression parameters 504.
  • the generator neural network 106 may generate an attention mask 520 (also referred to as a smooth deformation mask) .
  • the mask 520 may have the same spatial dimension as the input facial image 502.
  • the generator neural network 500 may also generate a deformation image, 522.
  • the output facial image 508 may then be generated as a combination of the input facial image 502 and the generator output 522.
  • the generator neural network 500 comprises a plurality of layers of nodes 506, each node associated with one or more parameters.
  • the parameters of each node of the neural network may comprise one or more weights and/or biases.
  • the nodes take as input one or more outputs of nodes in the previous layer.
  • the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
  • One or more of the layers of the generator neural network 500 may be convolutional layers.
  • Figure 5b shows an example structure of a discriminator neural network 512.
  • the discriminator neural network 512 is a neural network configured to process a facial image 512 to output a classification 516.
  • the generator neural network may, in some embodiments, further output a set of predicted expression parameters 518.
  • the predicted expression parameters 518 may be determined by one or more regression layers (not shown) .
  • the regression layers may in parallel to the other layers 514 of the discriminator neural network 512.
  • determination of the predicted expression parameters 518 may be performed as part of the other layers 514 of the discriminator neural network 512.
  • the discriminator neural network 512 comprises a plurality of layers of nodes 514, each node associated with one or more parameters.
  • the parameters of each node of the neural network may comprise one or more weights and/or biases.
  • the nodes take as input one or more outputs of nodes in the previous layer.
  • the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
  • One or more of the layers of the discriminator neural network 512 may be convolutional layers.
  • Figure 6 shows an overview of an example method 600 of predicting the expression parameters 606 of a facial image 602 using a trained discriminator neural network 604.
  • the discriminator neural network may, for example, be trained by the methods described in relation to Figure 2.
  • the input image 602 comprise a set of pixel values in a two-dimensional array. For example, in a colour image, where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (i.e. RGB) .
  • the input image 602 may, in some embodiments, be in black-and-white/greyscale.
  • Expression parameters 606 are a set of continuous variables that encode a facial expression (also referred to herein as an expression vector) . Additionally, the expression parameters may encode facial deformations that occur as a result of speech. Expression parameters 606 may be represented by an N-dimensional vector, e.g. The expression parameters may correspond to parameters of a 3D facial model, such as a 3DMM and/or a linear 3D blendshape model, as described above.
  • the trained discriminator neural network may be trained to minimise an expression loss function, so as to accurately regress the expression parameters 606 to correspond with the input image 602.
  • Discriminator and generator neural networks trained by the methods disclosed herein may be used for expression transfer, i.e. the transfer of an expression from a source facial image to a target facial image.
  • a trained generator neural network may process the expression parameters from the source facial image, and the target facial image.
  • the expression parameters from the source facial image may be extracted by a trained discriminator neural network 604.
  • the output of the trained generator neural network may be a facial image depicting the identity and other elements of the target facial image but with the expression depicted in the source facial image.
  • Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.
  • the system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
  • the apparatus (or system) 700 comprises one or more processors 702.
  • the one or more processors control operation of other components of the system/apparatus 700.
  • the one or more processors 702 may, for example, comprise a general purpose processor.
  • the one or more processors 702 may be a single core device or a multiple core device.
  • the one or more processors 702 may comprise a Central Processing Unit (CPU) or a graphical processing unit (GPU) .
  • the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.
  • the system/apparatus comprises a working or volatile memory 704.
  • the one or more processors may access the volatile memory 704 in order to process data and may control the storage of data in memory.
  • the volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM) , Dynamic RAM (DRAM) , or it may comprise Flash memory, such as an SD-Card.
  • SRAM Static RAM
  • DRAM Dynamic RAM
  • Flash memory such as an SD-Card.
  • the system/apparatus comprises a non-volatile memory 706.
  • the non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions.
  • the non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM) , a Flash memory or a magnetic drive memory.
  • the one or more processors 702 are configured to execute operating instructions 808 to cause the system/apparatus to perform any of the methods described herein.
  • the operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as well as code relating to the basic operation of the system/apparatus 700.
  • the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory 704 to temporarily store data generated during execution of said operating instructions 708.
  • Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits) , computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 8, cause the computer to perform one or more of the methods described herein.
  • Any system feature as described herein may also be provided as a method feature, and vice versa.
  • means plus function features may be expressed alternatively in terms of their corresponding structure.
  • method aspects may be applied to system aspects, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Methods for generating facial images with a given facial expression using neural networks, and methods of training neural networks for generating facial images with a given facial expression are provided. A method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises: receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters; applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image; applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image; applying the discriminator neural network to the input facial image to generate a first predicted classification; applying the discriminator neural network to the output facial image to generate a second predicted classification; and updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.

Description

Facial Image Processing Field of the Invention
This specification relates to methods for generating facial images with a given facial expression using neural networks, and methods of training neural networks for generating facial images with a given facial expression.
Background
Image-to-image translation is a ubiquitous problem in image processing, in which an input image of a source domain is transformed to a synthetic image of a target domain, where the synthetic image maintains some properties of the original input image. Examples of image-to-image translation include converting images from black-and-white to colour, turning daylight scenes into night-time scenes, increasing the quality of images and/or manipulating facial attributes of an image. However, many methods of performing image-to-image translation require aligned image pairs.
Methods of performing image-to-image translation in the absence of aligned images are limited in that they may not generate synthetic samples of sufficient quality when the input image is recorded under unconstrained conditions (e.g. in the wild) . Additionally, the target domain is usually limited to be one out of a discrete set of domains. Hence, these methods are limited in their capabilities for generating facial images with a given facial expression.
Summary
According to a first aspect of this disclosure, there is described a method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises: receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters; applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image; applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image; applying the discriminator neural network to the input facial image to generate a first predicted classification; applying the discriminator neural network to the output facial image to generate a second predicted classification; and updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.
The first classifier and/or the second classifier may comprise a probability distribution over image patches in the input facial image and/or output facial  image indicative of whether the image patches are real or synthetic. The generator neural network and/or discriminator neural network may comprise one or more convolutional layers.
The set of initial expression parameters and/or the set of target expression parameters may correspond to continuous parameters of a linear three-dimensional blendshape model.
One or more of the training examples may further comprise a target facial image corresponding to the set of target expression parameters and the generator loss function may further depend on a comparison of the target facial image with the output facial image when using training examples comprising a target facial image. One or more of the training examples may comprise a target facial image generated synthetically.
The method may further comprise: applying a face recognition module to the input facial image to generate an input embedding; and applying the face recognition module to the output facial image to generate an output embedding, and the generator loss function may further depend on a comparison of the input embedding with the output embedding. The face recognition module may comprise a pre-trained neural network.
The generator loss function and/or discriminator loss function may comprise a gradient penalty term based on a gradient of the second classifier with respect to the input facial image.
Updating parameters of the generator neural network and/or parameters of the discriminator neural network may be performed using backpropagation.
The discriminator neural network may comprise a regression layer, and the discriminator neural network may further generate a set of predicted expression parameters from the input facial image and/or the output facial image. The discriminator loss function may further depend on a comparison of the set of predicted expression parameters for an input image with the set of initial expression parameters for the input image.
Updating parameters of the generator neural network may further be based on a comparison of the set of predicted expression parameters for an output image with the set of target expression parameters used to generate the output image.
The generator neural network may further generate an attention mask from the input facial data and a reconstructed attention mask from the first output facial image and a corresponding set of initial expression parameters. Updating parameters of the generator neural network may further be in dependence on a comparison of the attention mask and the reconstructed attention mask. The generator neural network may further generate a deformation image from the input facial data. The output facial image may be generated by combining the deformation image, the attention mask and the input facial image. The reconstructed facial image may be generated by combining a reconstructed deformation image (generated by the generator neural network from the output facial image) , the reconstructed attention mask and the output facial image
The discriminator neural network may be a relativistic discriminator.
According to a further aspect of this disclosure, there is described a method of generating a target image from an input image and a set of target expression parameters, the method comprising: receiving the set of target expression parameters, the target expression parameters taken from a continuous range of target expression parameters; receiving the input image; applying a generator neural network to the input image and the set of target expression parameters to generate the target image, wherein the generator neural network is trained according to the method of any preceding claim.
According to a further aspect of this disclosure, there is described a method of determining a set of expression parameters from an input facial image, the method comprising: receiving an input image; and applying a discriminator neural network to the input image to generate the set of expression parameters, wherein the discriminator neural network has been trained according to the methods described above.
Brief Description of the Drawings
Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which:
Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network;
Figure 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression;
Figure 3 shows an overview of a further example method of training a generator neural network for generating a facial image with a given expression;
Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression;
Figures 5a and 5b show an example structure of a generator neural network for generating a facial image with a given expression and an example structure of a discriminator neural network for predicting the expression parameters and calculating an adversarial loss for a facial image;
Figure 6 shows an overview of an example method of predicting the expression parameters of a facial image using a trained discriminator neural network; and
Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.
Detailed Description
Example implementations provide system (s) and methods for generating facial images with a given facial expression defined by a set of expression parameters, and methods of determining facial expression parameters of a facial model from an input facial image.
Interactive editing of facial expressions has wide ranging applications including, but not limited to, post-production of movies, computational photography, and face recognition. Improving editing of facial expressions may improve systems directed at these applications. For example, face recognition systems using the system (s) and method (s) described herein may be able to more realistically  neutralise the expression of a user’s face, and thus verify the user’s identity under less constrained environments than systems and methods in the state of the art.
Moreover, methods and systems disclosed herein may utilise continuous expression parameters, in contrast to discrete expression parameters which use, for example, labels indicating an emotion, such as “happy” , “fearful” , or “sad” . The expression parameters may correspond with expression parameters used by models to vary expression when generating three-dimensional face meshes. These models may include, for example, linear 3D blendshape models. These models may also be able to represent any deformation in facial images that occur, for example deformations as a result of speech. Therefore, it will be appreciated that reference to expression parameters herein refer to any parameter that captures variation in facial images caused by a deformation to the face in the facial image.
Consequently, methods disclosed herein may provide a finer level of control when editing the expressions of facial images than methods using discrete expression parameters. Using the same expression parameters as a 3D facial mesh models may allow fine-tuning of the desired target output by editing an input facial image to depict any expression capable of being produced by the 3D facial mesh model. The target expression may be fine-tuned by modifying the continuous expression parameters. The desired target expressions may be  previewed through rendering a 3D facial mesh with the selected expression parameters, prior to generating the edited facial image.
Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network. The method 100 takes an input facial image 102 and a set of target expression parameters 104 as input and forms an output facial image 108 as output using a trained generator neural network 106.
The input facial image 102, I org, is an image comprising one or more faces. The facial expression of each face in the input image is described by an initial set of expression parameters, p org. The input facial image 102 is processed by the trained generator neural network 106 in order to produce an output facial image 108, I gen, comprising one or more faces corresponding to faces in the input image 102, but with an expression described by the target expression parameters 104, p trg.
The output facial image 108 generated by the trained generator neural network 106 may retain many aspects of the input facial image including identity, angle, lighting, and background elements, while changing the expression of the face to correspond with the target expression parameters 104. In some embodiments where the expression parameters additionally or alternatively control the variation in facial deformations that occur as a result of speech, the trained generator neural network 106 may produce an output facial image 108 with  different mouth and lip motion to the input facial image 102. In this way, the trained generator neural network 106 may be used for expression and/or speech synthesis.
The input facial image 102 and/or output facial image 108 comprise a set of pixel values in a two-dimensional array. For example, in a colour image, 
Figure PCTCN2020108121-appb-000001
where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB) . The  facial images  102, 108 may, in some embodiments, be in black-and-white/greyscale.
Expression parameters (also referred to herein as an expression vector) are a set of continuous variables that encode a facial expression. Additionally, the expression parameters may encode facial deformations that occur as a result of speech. Expression parameters may be represented by an N-dimensional vector, e.g. p i = [p i, 1, p i, 2 …p i, NT. The expression parameters may correspond to parameters of a 3D model, such as a linear 3D blendshape model, as described below. The target expression parameters 104 encode a target facial expression of the output image 108 (i.e. the desired facial expression of the output facial image 108) . The target expression parameters are typically different to initial expression parameters associated with the input facial image 102. The target expression parameters 104 are processed by the trained generator neural network 106 along with the input facial image 102 in order to produce the output facial image 108.
The initial expression and/or target expression parameters 104 may correspond with the expression parameters used by models to vary expression when generating 3D face meshes. The expression parameters may be regularised to lie within a predefined range. For example, the expression parameters may be regularised to lie in the range [-1, 1] . In some embodiments, a zero expression parameter vector (i.e. a set of expression parameters where each parameter is set to 0) is defined that corresponds to a neutral expression. For example, the zero expression parameter vector may correspond to a three-dimensional model with a neutral expression. In some embodiments, the magnitude of an expression parameter may correspond with an intensity of a corresponding expression depicted in the 3D facial mesh.
The trained generator neural network 106 is a neural network configured to process the input facial image 102 and target expression parameters 104 to generate an output facial image 108. In some embodiments, instead of generating the output image 108 directly, the generator neural network 106 may generate an attention mask
Figure PCTCN2020108121-appb-000002
112 (also referred to as a smooth deformation mask) . The mask 112 may have the same spatial dimension as the input facial image 102. In these embodiments, the generator neural network may also generate a deformation image, 
Figure PCTCN2020108121-appb-000003
110. The output facial image 108 may then be generated as a combination of the input facial image 102 and the generator output 110. By being configured to output an attention mask 112, the trained generator neural network 106 may determine which regions of the input facial  image 102 should be modified in order to generate an output facial image 108 corresponding to target expression parameters 104.
The trained generator neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the trained generator neural network 106 may be convolutional layers. Examples of generator neural network architectures are described below in relation to Figure 6a.
The parameters of the trained generator neural network 106 may be trained using generative adversarial training, and the trained generator neural network 106 may therefore be referred to as a Generative Adversarial Network (GAN) . The trained generator neural network 106 may be the generator network of the generative adversarial training. Generator neural networks 106 trained in an adversarial manner may output more realistic images than other methods. Examples of training methods are described below in relation to Figures 2 and 4.
Figure 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression. The method 200  comprises jointly training a generator neural network 208 and a discriminator neural network 216. The generator neural network 208 is trained using a generator loss function 214. The discriminator neural network 216 is trained using a discriminator loss function 222.
The objective of the generator neural network 208 during training is to learn to generate realistic output facial images 210 with a given target expression, which is defined by the target expression parameters 206. The generator loss function 214 is used to update parameters of the generator neural network 208 during training with the aim of producing more realistic output facial images 210. The objective of the discriminator neural network 216 during training is to learn to distinguish between real facial images 202 and output/generated facial images 210. The discriminator loss function 222 is used to update the discriminator neural network 216 with the aim of better discriminating between output (i.e. “fake” ) facial images 210 and input (i.e. “real” ) facial images 202.
The objective of the discriminator neural network during training is to learn to distinguish between real facial images 202 and output/generated facial images 210. Additionally, the discriminator neural network may include a regression layer which estimates expression parameter vectors for an image, which may be a real image or a generated image. This regression layer may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic. In some embodiments, the discriminator neural network 216 may be a relativistic  discriminator, generating predicted classifications indicating the probability of an image being relatively more realistic than a generated one. Example structures of the discriminator neural network are described below, in relation to Figure 5b.
During the training process, the generator neural network 208 and the discriminator neural network compete against each other until they reach a threshold/equilibrium condition. For example, the generator neural network 208 and the discriminator neural network compete with each other until the discriminator neural network can no longer distinguish between real and synthetic facial images. Additionally or alternatively, the adversarial loss calculated by the discriminator neural network may be used to determine when to terminate the training procedure. For example, the adversarial loss of the discriminator neural network may correspond with a distance metric between the distributions of real images and generated images. The training procedure may be terminated when this adversarial loss reaches a threshold. Alternatively, the training procedure may be terminated after a fixed number of training epochs, for example 80 epochs.
The training uses a set of training data (also referred to herein as a training set) . The training set may comprise of a set of K training examples of the form 
Figure PCTCN2020108121-appb-000004
where I org is an input facial image 202, p org is an initial expression parameter vector 204, and p trg is a target expression parameter vector 206. The initial expression parameters 204 correspond to the expression  of the input facial image 202. For example, a 3D facial mesh model which is configured to receive expression parameters may depict the same expression as the expression of the input facial image 202 when instantiated with the initial expression parameters 204.
In some embodiments, the training set may additionally or alternatively comprise a set of L training examples of the form
Figure PCTCN2020108121-appb-000005
where I org, p org, and p trg are defined as above, and I trg is a known target facial image. The target expression parameters 206 may correspond with the target facial image. The method of training the generator neural network 208 where at least one example of the training set includes a target facial image is described below in more detail with respect to Figure 4.
During the training process, the generator neural network 208 is applied to an input facial image 202, I org, and a corresponding set of target expression parameters 206, p trg, taken from a training example included in the training data. The output of the generator neural network 208 is a corresponding output facial image 210, I gen. The output facial image 210 may be represented symbolically as
Figure PCTCN2020108121-appb-000006
where
Figure PCTCN2020108121-appb-000007
represents the mapping performed by the generator neural network 208.
In some embodiments, instead of generating the output image directly, the generator neural network 208 may generate an attention mask
Figure PCTCN2020108121-appb-000008
 (also referred to as a smooth deformation mask) . The mask may have the same spatial  dimension as the input facial image 202. In these embodiments, the generator neural network may also generate a deformation image, 
Figure PCTCN2020108121-appb-000009
The output facial image 210 may then be generated as a combination of the input facial image 202 and the generator output. For example, the output facial image 210 may be given by:
Figure PCTCN2020108121-appb-000010
The values of the mask may be constrained to lie between zero and one. This may be implemented, for example, by the use of a sigmoid activation function.
In some embodiments, the input facial image 202 and the generated facial image 210 may additionally be processed by a face recognition module (not shown) . The input facial image 202 may be processed by the face recognition module in order to produce an input embedding. Similarly, the output/generated facial image 210 may be processed by the face recognition module in order to produce an output embedding. The input/output embeddings may represent the identity of faces in facial images. The face recognition module may be a component of the loss function 214, as described below in further detail.
The generator neural network 208 is then applied to the output facial image 210, and the set of initial expression parameters 204, p org, taken from the same training example of the training data. In other words, the initial expression parameters 204 correspond with the expression depicted in the input facial image 202. The output of the generator neural network in this case is a  reconstructed facial image 212, I rec. Similarly, as before, this may be represented symbolically by the mapping
Figure PCTCN2020108121-appb-000011
In some embodiments, the generator neural network 208 may generate a mask
Figure PCTCN2020108121-appb-000012
 (also referred to herein as an attention mask) along with a deformation image
Figure PCTCN2020108121-appb-000013
The mask may have the same spatial dimension as the input facial image 202. The reconstructed facial image 212 may then be generated as a combination of the generated facial image 210 and the generator output. For example, the reconstructed facial image 212 may be given by
Figure PCTCN2020108121-appb-000014
The generator neural network 208 is configured to process the output facial image 210 and the corresponding set of initial expression parameters 204 in an attempt to produce a reconstructed facial image 212 replicates the input facial image 202. The use of this step during training of the generator neural network 208 can result in the contents of the input facial image 202 (e.g. background elements, presence of glasses and other accessories) being retained when generating output facial images 210 with a trained generator neural network 208.
The discriminator neural network 216 is applied to the input facial image 202 to generate a first predicted classification 218, 
Figure PCTCN2020108121-appb-000015
The discriminator neural network is also applied to the output facial image 202 to generate a second predicted classification 220, 
Figure PCTCN2020108121-appb-000016
The predicted classifications may be represented as:
Figure PCTCN2020108121-appb-000017
with
Figure PCTCN2020108121-appb-000018
representing the processing of the facial image by the discriminator neural network 216 prior to the classification layer and σ represents the action of the classification layer (for example, an activation function) . These  classifications  218, 220 may comprise a probability distribution over image patches in the input facial image 202 and/or the output facial image 210 indicative of whether the image patches are real or synthetic.
In some embodiments, the discriminator neural network 216 may be a relativistic discriminator. The relativistic discriminator generates predicted classifications indicative of the probability of an image being relatively more realistic than a generated one. Examples of relativistic discriminators can be found in “The relativistic discriminator: a key element missing from standard GAN” (Alexia Jolicoeur-Martineau, arXiv: 1807.00734) , the contents of which are incorporated by reference. For example, the predicted classifications may be represented by;
Figure PCTCN2020108121-appb-000019
The relativistic discriminator may be a relativistic average discriminator, which averages over a mini-batch of generated images. The classifications generated by a relative average discriminator may be dependent on whether the input to the relativistic average discriminator is a real facial image or a generated facial image. For example, the classifications
Figure PCTCN2020108121-appb-000020
for a real image I may be given by:
Figure PCTCN2020108121-appb-000021
and for generated images, I, by:
Figure PCTCN2020108121-appb-000022
where
Figure PCTCN2020108121-appb-000023
and
Figure PCTCN2020108121-appb-000024
are averages over generated images and real images in a mini batch respectively.
In some embodiments, in addition to generating these  classifications  218, 220, the discriminator neural network 216 may also output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210. These predicted parameters may be represented by the mapping 
Figure PCTCN2020108121-appb-000025
for an image I. The discriminator neural network 216 may include one or more regression layers to determine this mapping. These one or more regression layers may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic. Example structures of the discriminator neural network 216 are described below, in relation to Figure 6b.
The input facial image 202, the reconstructed facial image 212, the first predicted classification 218 and second predicted classification 220 are used to calculate a generator loss function 214. Calculating the generator loss function 214 may comprise comparing the input facial image 202 and the reconstructed facial image 212. The generator loss function 214 may include a plurality of component loss functions, as will be described in greater detail below. The generator loss is used to update the parameters of the generator neural network 208. The parameters of the generator neural network 208 may be updated using an optimisation procedure that aims to optimise the generator loss function 214.
The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by backpropagating the generator loss function 214 to the parameters of the generator neural network 208.
The first predicted classification 218 and second predicted classification 220 are used to calculate a discriminator loss function 222. In some embodiments, a comparison of the initial expression parameters 204 and the predicted expression parameters of the input image 202 (not shown) may also be used to determine the discriminator loss function 222. The discriminator loss function 222 may include a plurality of component loss functions, as will be described in greater detail below. The discriminator loss function 222 is used to update the parameters of the discriminator neural network 216. The parameters of the discriminator neural network 216 may be updated using an optimisation procedure that aims to optimise the discriminator loss function 222. The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by backpropagating the discriminator loss function 222to the parameters of the discriminator neural network 216.
In some embodiments, the generator loss function 214 and/or discriminator loss function 222 may comprise an adversarial loss function. An adversarial loss function depends on the first predicted classification and the second predicted classification. The adversarial loss function may further comprise a gradient  penalty term. The gradient penalty term may be based on a gradient of the second classification with respect to the input facial image. Adversarial loss functions may correspond with a distance metric between the distributions of the generated images and the real images.
An example of such an adversarial loss function is given by:
Figure PCTCN2020108121-appb-000026
Where
Figure PCTCN2020108121-appb-000027
indicates an expectation value taken over a set of input facial images 202, 
Figure PCTCN2020108121-appb-000028
indicates an expectation value taken over a set of input facial images and corresponding target expression parameters, and
Figure PCTCN2020108121-appb-000029
indicates an expectation value taken over a set of generated facial images 210. Expectation operators may be approximated by sampling from the respective distributions from which the expectations are taken, performing the calculation inside the expectation operator, and then dividing by the number of samples taken. In some embodiments, 
Figure PCTCN2020108121-appb-000030
may be replaced by the relativistic discriminator, 
Figure PCTCN2020108121-appb-000031
described above.
The term
Figure PCTCN2020108121-appb-000032
in the equation above may be referred to as a gradient penalty term. A gradient penalty term may be based on a gradient of the second classification with respect to the input facial image. While the norm in this example is a L 2 norm, it will be appreciated that alternative norms may be used. A coefficient λ gp may control the contribution of  the gradient penalty term to the adversarial loss function. For example, this may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
It will be appreciated that other adversarial loss functions may be used, wherein the adversarial loss function depends on an output of a discriminator neural network 216. A generator neural networks 208 trained with adversarial loss functions may generate synthesised images with greater photorealism than when trained with other loss functions. In some embodiments, when the discriminator neural network 216 is a relativistic discriminator, both real and generated images are included in the generator part of the adversarial loss function. This may allow the generator to benefit by the gradients of both real and fake images, generating output facial images 210 with sharper edges and more detail which also better represent the distribution of the real data.
An example of a generator loss function 214 may then be given by:
Figure PCTCN2020108121-appb-000033
An example of a discriminator loss function 214 may then be given by:
Figure PCTCN2020108121-appb-000034
One or more additional loss functions may be included in the generator loss function 214 and/or discriminator loss function, as described below.
In some embodiments, generator loss function 214 may further comprise a reconstruction loss function. The reconstruction loss function provides a measure of the difference between the input facial image 202 and the reconstructed facial image 212. The reconstruction loss function is based on a  comparison between the input facial image 202 and the reconstructed facial image 212. For an input facial image 202 with width W and height H, an example of a reconstruction loss function is given by:
Figure PCTCN2020108121-appb-000035
where I rec is the reconstructed facial image 212, and can be represented symbolically as
Figure PCTCN2020108121-appb-000036
An example of a generator loss function 214 incorporating the reconstruction loss is given by
Figure PCTCN2020108121-appb-000037
where λ rec controls the contribution of the reconstruction loss function to the generator loss function. For example, λ rec may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
Generator neural networks 208 trained using reconstruction loss functions may generate output facial images 210 which better preserve the contents of the input facial image 202 than when trained without reconstruction loss functions. For example, background elements and the presence of accessories (e.g. sunglasses, hats, jewellery) in the input facial image 202 may be retained in the output facial image 210. Other reconstruction loss functions may be used. For example, different norms may be used in the calculation instead of the L1 norm, or the images may be pre-processed before calculating the reconstruction loss.
In some embodiments, generator loss function 214 may further comprise an attention mask loss function. The attention mask loss function compares the attention mask generated from the input facial data 202 to a reconstructed attention mask generated from the first output facial image 212 and a corresponding set of initial expression parameters 204. The attention mask loss may encourage the generator to produce attention masks which are sparse and do not saturate to 1. The attention mask may minimise the L1 norm of the produced masks for both the generated and reconstructed images. For an input facial image 202 with width W and height H, an example of an attention mask loss function is given by
Figure PCTCN2020108121-appb-000038
It will be appreciated that norms other than an L1 may alternatively be used.
An example of a generator loss function 214 incorporating the attention mask loss is given by
Figure PCTCN2020108121-appb-000039
where λ att controls the contribution of the attention mask loss function to the generator loss function. For example, λ att may be set to a number greater than zero, such as between 0.005 and 50, for example 0.3.
Generator neural networks 208 trained with an attention mask loss function may generate output facial images 210 which better preserve the content and the colour of the input facial image 202.
In some embodiments, the generator loss function 214 may include an identity loss function. An identity loss function depends on the input facial image 202 and the output facial image 210. In some embodiments, the identity loss calculation system 216 may include a face recognition module. The face recognition module may be a pre-trained neural network configured to produce identity embeddings. Identity embeddings represent the identity of people in facial images. For example, two images of the same person may be processed by the face recognition module to produce similar embeddings that are closer (as defined by some metric, such as Euclidean distance) to identity embeddings corresponding to different people. The input facial image 202 may be processed by the face recognition module to produce an input embedding, e org. The output facial image 210 may be processed by the face recognition module to produce an output embedding, e gen. The identity loss function may depend on the input embedding and the output embedding. An example of an identity loss function is:
Figure PCTCN2020108121-appb-000040
An example of generator loss function 214 may then be given as
Figure PCTCN2020108121-appb-000041
where λ id controls the contribution of the identity loss function to the generator loss function. For example, λ id may be set to a number in the range 1-100, such as between 2 and 10, for example 5. The loss function may additionally include a  reconstruction loss term and/or attention mask loss function, as described above.
Generator neural networks 208 trained using identity loss functions may generate output facial images 210 which better maintain the identity of the face in the input facial image 202 than when trained without identity loss functions. Other identity loss functions may be used, for example, different similarity/distance metrics may be used to compare the identity embeddings.
In various embodiments, the generator loss function 214 and/or discriminator loss function may further include an expression loss function. An expression loss is calculated based on an output of the discriminator neural network. The discriminator neural network may output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210. For example, a regression layer may be used to determine the expression parameters from an input image. The predicted expression parameters may be compared with the initial and/or target expression parameters using an expression parameter loss function. These predicted parameters may be represented by the mapping
Figure PCTCN2020108121-appb-000042
for an image I. A different expression loss function may be used in the generator loss function 214 and the discriminator loss function 222.
For an N-dimensional parameter vector, an example expression loss function for the generator loss function 214 is:
Figure PCTCN2020108121-appb-000043
A generator loss function may then be given as
Figure PCTCN2020108121-appb-000044
where λ exp controls the contribution of the expression loss function to the generator loss function. For example, λ exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10. The generator loss function 214 may additionally include a reconstruction loss term and/or attention mask loss term and/or an identity loss term, as described above.
Generator neural networks 208 trained using expression loss functions in the generator loss function 214 and/or discriminator loss function 222 may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206 than when trained without expression loss functions. In detail, the generator neural network 208 may generate output facial images which have an accurate expression according to the discriminator neural network.
For an N-dimensional parameter vector, an example expression loss function for the discriminator loss function 222 is:
Figure PCTCN2020108121-appb-000045
The discriminator loss function may then be given as
Figure PCTCN2020108121-appb-000046
where λ exp controls the contribution of the expression loss function to the discriminator loss function. For example, λ exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10.
Discriminator neural networks 508 trained with expression loss functions may accurately estimate the expression parameters of an input facial image. In turn, the generator neural network may be configured to generate output facial images which, according to the discriminator neural network, accurately depict the expression corresponding to the target expression parameters.
It will be appreciated that any combination of the loss functions described above may be used to create a generator loss function. For example, a generator loss function may be given as:
Figure PCTCN2020108121-appb-000047
Alternatively, one or more of the components of the above generator loss function may be omitted.
Referring now to Figure 3, an overview of an example method 300 of training a generator neural network 208 for generating a facial image with a given expression is shown. In some embodiments, the training set may comprise a set of L training examples of the form
Figure PCTCN2020108121-appb-000048
with L being at least one. In other words, one or more of the training examples may further comprise a target facial image 302, denoted by I trg, in addition to the input image 202, target expression parameters 206 and original expression parameters 204. This training set may be in addition to or as an alternative to the training set K.
In many respects, the method 300 of Figure 3 proceeds in substantially the same manner as the method 200 shown in Figure 2.
The target facial image 302 comprise a set of pixel values in a two-dimensional array. The target facial image 302 may have the same dimensions as the input facial image 202 and output facial image 210. For example, in a colour image, 
Figure PCTCN2020108121-appb-000049
where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB) . The target facial images 302 may, in some embodiments, be in black-and-white/greyscale.
The target facial image 302 may depict a face with the same identity as the face in input facial image 202, but with expression corresponding with the target expression parameters 206.
The target facial image 302 may be generated synthetically. Using methods described below, models for 3D face reconstruction from 2D images may be fitted to a set of facial images. The input facial image 202 may be processed by the fitted 3D face reconstruction model in order to produce a target facial image 302 with expression corresponding to target expression parameters 206. In this way, the target expression parameters 206 corresponding to the target facial image 302 are known, and thus the target facial image 302 may be referred to as a ground truth output facial image.
In some embodiments, the generator loss 214 may include a generation loss function. The generation loss function compares the target facial image with the output facial image. For example, the generator loss may be based on a difference between the target image 302 and the generated image 210. The difference may be measured by a distance metric. For input/output facial images with width W and height H, an example generation loss function may be given as:
Figure PCTCN2020108121-appb-000050
where the generated image 210 may be represented symbolically as
Figure PCTCN2020108121-appb-000051
An generator loss function 214 may then be given as
Figure PCTCN2020108121-appb-000052
where λ gen controls the contribution of the generation loss function to the generator loss function. For example, λ gen may be set to a number in the range 1-100, such as between 5 and 15, for example 10. The generator loss function 214 may additionally include an expression loss function, a reconstruction loss term, attention mask loss function and/or an identity loss term, as described above in relation to Figure 2.
Generator neural networks 208 trained using generation loss functions may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206. In detail, the  generator neural network 208 may generate output facial images with fewer artefacts than when trained without generation loss functions.
In various embodiments, the generator neural network 208 may be trained by separate loss functions depending on whether the training example includes a target facial image 302 or not. In other words, when the training data comprises the set K and the set L, a generator loss function 214 is selected based on whether the training data for the current iteration is taken from set K or set L. Training examples including a target facial image 302 may be referred to as paired data, and training examples without target facial images 302 may be referred to as unpaired data.
An example of such a generator loss function is given by:
Figure PCTCN2020108121-appb-000053
Alternatively, one or more of the terms in the above generator loss function 214 may be omitted.
Alternatively, one or more of the terms in the above generator loss function 214 may be omitted.
The training process of the method 200 and/or 300 may update the generator neural network 208 after one or more updates made to the discriminator neural  network. The updates to the parameters of the generator and/or discriminator neural networks may be determined using backpropagation. Where the adversarial loss corresponds with a distance metric between distributions, updating the discriminator neural network more often than the generator neural network 208 may lead to the adversarial loss better approximating the distance metric between the distributions of the real and generated images. This in turn may provide a more accurate signal for the generator neural network 208 so that output facial images 210 are generated with more realism.
As described above, initial and/or target  expression parameters  204, 206 may correspond with the expression parameters used to vary expression in 3D facial mesh models. Expression parameters used for training the methods described herein may be extracted by first fitting models for 3D reconstruction of faces to a set of facial images. These models may include 3D Morphable Models, or 3DMM, neural networks, and other machine-learned models. These models for 3D face reconstruction may be configured to optimize three parametric models: a shape model, a texture model, and a camera model, in order to render a 2D instance of the 3D facial reconstruction as close to the initial facial image as possible.
Focussing on a shape model, where the shape parameters are derived using principle component analysis. The 3D facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector, and  summing these products with a mean shape vector. For example, a 3D facial mesh comprising N vertices may be represented as a vector:
Figure PCTCN2020108121-appb-000054
An identity parameter vector, p s, id, may control variations in identity in 3D facial shapes. The 3D facial mesh may be calculated as:
s= m s+ U s, idp s, id
where m s is a mean shape vector and U s, id is a matrix of basis vectors corresponding to the principle components of the identity subspace. These basis vectors may be learned from 3D facial scans displaying a neutral expression and the identity parameters may be used to represent identity variations by instantiating a 3D shape instance.
Expression parameters may also be derived using principle component analysis and are used to generate the 3D facial mesh. The 3D facial mesh may be generated by finding the product of each of the identity parameters and a respective basis vector and each of the expression parameters and a respective basis vector, and summing these products with a mean shape vector. For example, the 3D facial mesh may be calculated as:
s=m s+ U s, idp s, id+ U s, exop s, exp
or equivalently,
s=m s+ [U s, id, U s, exp] [ s, id T, p s, exp TT
where m s is a mean shape vector, where U s, id is a matrix of basis vectors for identity variations and U s, exp is a matrix of basis vectors for expression variations, and p s, id are identity parameters controlling identity variations in the 3D facial mesh and p s, exp are expression parameters controlling expression variations in the 3D facial mesh. The expression basis vectors may be learned from displacement vectors calculated by comparing shape vectors from 3D facial scans depicting a neutral expression and from 3D facial scans depicting a posed expression. These expression parameters may be used to represent expression variations by, in addition to the identity parameters, instantiating a 3D shape instance.
Expression parameters may be configured to be in the range [-1, 1] . Where expression parameters are derived using principle component analysis, the expression parameters may be normalised by the square root of the eigenvalues e_i, i=1, …, N of the PCA blendshape model. Additionally, the zero expression parameter vector, where each element of the vector is set to 0, may correspond with a 3D facial mesh depicting a neutral expression. Moreover, the magnitude of an expression parameter may correspond with the intensity of the expression depicted in the 3D facial mesh. For example, setting a certain expression parameter to -1 may correspond with an intense frown in the 3D facial model. The same expression parameter set to -0.5 may correspond with a more moderate frown. The 3D facial mesh may depict an intense smile when the same parameter is set to 1.
However, methods described herein may not require the 3D facial mesh model to generate a facial image with a given expression; instead the expression parameters may be processed by a generator neural network trained by the methods described herein. Additionally or alternatively, the expression parameters may be predicted by a discriminator neural network trained by the methods described herein.
Thus, by first fitting models for 3D reconstruction of faces from a set of facial images, identity and expression parameters p s, id , p s, exp may be extracted from any input facial image 202. Based on the independent shape parameters for identity and expression, expression parameters may be extracted to compose an annotated dataset of K images and their corresponding vector of expression parameters
Figure PCTCN2020108121-appb-000055
with no manual annotation cost. This dataset may be used in part to produce a training set for training the generator neural network 208.
The target expression parameters 206 may be expression parameters determined (using the methods described herein or otherwise) from a facial image different to the input facial image 202. Additionally or alternatively, the target expression parameters 206 may be selected by generating a 3D facial mesh with a desired target expression, and using the corresponding expression parameters that produce the desired target expression. Additionally or alternatively, the target expression parameters 104 may be randomly selected, for example they may be sampled from a multivariate Gaussian distribution.
Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression.
At operation 4.1, training data comprising a plurality of training examples, each training example comprising an input facial image a set of initial expression parameters and a set of target expression parameters is received by a generator neural network. The expression parameters may correspond to continuous parameters of a 3D facial model. Examples of 3D facial models may include 3D blendshape models and/or 3D morphable models.
In some embodiments, the training data further comprises a target image corresponding to the target expression parameters. The target image may be a ground truth image from which the target expression parameters were extracted. Alternatively or additionally, the target facial image may be a synthetic image.
At operation 4.2, the generator neural network is applied to an input facial image and a corresponding set of target parameters to generate an output facial image.
At operation 4.3, the generator neural network is applied to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image.
At operation 4.4, a discriminator neural network is applied to the output facial image to generate a second predicted classification.
At operation 4.5, the discriminator neural network is applied to the output facial image to generate a second predicted classification. The second predicted classification may comprise a probability distribution over image patches of the output image indicative of whether those patches are real (i.e. from ground truth images) or synthetic (i.e. generated by the generator neural network) 
Operations 4.1 to 4.5 may be iterated over the training examples in the training data (e.g. K and/or L described above) to form an ensemble of training examples.
At operation 4.6, the parameters of the generator neural network are updated in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image. The update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5. In these embodiments, the generator loss function may comprise one or more expectation values taken over the ensemble of training examples.
In embodiments where a target image is used, the parameters of the generator neural network may also be updated in dependence on a comparison of the target facial image with the output facial image.
At operation 4.7, the parameters of the discriminator neural network are updated in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification. The update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5. In these embodiments, the discriminator loss function may comprise one or more expectation values taken over the ensemble of training examples.
The parameters of the generator neural network and/or discriminator neural network may be updated using an optimisation procedure applied to the generator loss function and/or discriminator loss function respectively. Examples of such an optimisation include, but are not limited to, gradient descent methods and/or gradient free methods.
Operations 4.1 to 4.7 may be iterated until a threshold condition is met. The threshold condition may be, for example, a threshold number of training epochs. The threshold number of training epochs may lie in the range 50-150, such as 70-90, for example 80. A predetermined number of epochs may be used for each training set. For example, a first number of training epochs may be performed on the training set K, followed by a second number of training  epochs on training set L. The number of training epochs for each training set may lie in the range 20-100, such as 30-50, for example 40.
Different training examples from the training dataset may be utilised for each iteration of operations 4.1 to 4.7. For example, the training set K (or some subset of it) may be used in one or more of the iterations. The training set L (or some subset of it) may be used in one or more of the other iterations. The training sets used in each iteration may have a predefined batch size. The batch size may, for example lie in the range 5-100, such as between 10 and 20, for example 16.
Figures 5a and 5b show an example structure of a generator neural network 500 for generating a facial image with a given expression and an example structure of a discriminator neural network 512 for predicting the expression parameters and calculating an adversarial loss for a facial image.
Figure 5a shows an example structure of a generator neural network 500. The generator neural network 500 is a neural network configured to process an input facial image 502 and expression parameters 504 to generate an output facial image 508. The output facial image 508 corresponds to the input facial image 502, but with an expression dependent on the expression parameters 504. In some embodiments, instead of generating the output image directly, the generator neural network 106 may generate an attention mask
Figure PCTCN2020108121-appb-000056
520 (also referred to as a smooth deformation mask) . The mask 520 may have the same spatial dimension as the input facial image 502. In these embodiments, the  generator neural network 500 may also generate a deformation image, 
Figure PCTCN2020108121-appb-000057
522. The output facial image 508 may then be generated as a combination of the input facial image 502 and the generator output 522.
The generator neural network 500 comprises a plurality of layers of nodes 506, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the generator neural network 500 may be convolutional layers.
Figure 5b shows an example structure of a discriminator neural network 512. The discriminator neural network 512 is a neural network configured to process a facial image 512 to output a classification 516. The generator neural network may, in some embodiments, further output a set of predicted expression parameters 518. The predicted expression parameters 518 may be determined by one or more regression layers (not shown) . The regression layers may in parallel to the other layers 514 of the discriminator neural network 512. Alternatively, determination of the predicted expression parameters 518 may be performed as part of the other layers 514 of the discriminator neural network 512. The discriminator neural network 512 comprises a plurality of layers of nodes 514, each node associated with one or more parameters. The parameters  of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the discriminator neural network 512 may be convolutional layers.
Figure 6 shows an overview of an example method 600 of predicting the expression parameters 606 of a facial image 602 using a trained discriminator neural network 604.
The discriminator neural network may, for example, be trained by the methods described in relation to Figure 2.
The input image 602 comprise a set of pixel values in a two-dimensional array. For example, in a colour image, 
Figure PCTCN2020108121-appb-000058
where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (i.e. RGB) . The input image 602 may, in some embodiments, be in black-and-white/greyscale.
Expression parameters 606 are a set of continuous variables that encode a facial expression (also referred to herein as an expression vector) . Additionally, the expression parameters may encode facial deformations that occur as a result of speech. Expression parameters 606 may be represented by an N-dimensional vector, e.g. 
Figure PCTCN2020108121-appb-000059
The expression parameters may  correspond to parameters of a 3D facial model, such as a 3DMM and/or a linear 3D blendshape model, as described above. The trained discriminator neural network may be trained to minimise an expression loss function, so as to accurately regress the expression parameters 606 to correspond with the input image 602.
Discriminator and generator neural networks trained by the methods disclosed herein may be used for expression transfer, i.e. the transfer of an expression from a source facial image to a target facial image. For example, a trained generator neural network may process the expression parameters from the source facial image, and the target facial image. The expression parameters from the source facial image may be extracted by a trained discriminator neural network 604. The output of the trained generator neural network may be a facial image depicting the identity and other elements of the target facial image but with the expression depicted in the source facial image.
Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
The apparatus (or system) 700 comprises one or more processors 702. The one or more processors control operation of other components of the  system/apparatus 700. The one or more processors 702 may, for example, comprise a general purpose processor. The one or more processors 702 may be a single core device or a multiple core device. The one or more processors 702 may comprise a Central Processing Unit (CPU) or a graphical processing unit (GPU) . Alternatively, the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.
The system/apparatus comprises a working or volatile memory 704. The one or more processors may access the volatile memory 704 in order to process data and may control the storage of data in memory. The volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM) , Dynamic RAM (DRAM) , or it may comprise Flash memory, such as an SD-Card.
The system/apparatus comprises a non-volatile memory 706. The non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions. The non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM) , a Flash memory or a magnetic drive memory.
The one or more processors 702 are configured to execute operating instructions 808 to cause the system/apparatus to perform any of the methods described herein. The operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as  well as code relating to the basic operation of the system/apparatus 700. Generally speaking, the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory 704 to temporarily store data generated during execution of said operating instructions 708.
Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits) , computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 8, cause the computer to perform one or more of the methods described herein.
Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.
Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features  described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.
Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.

Claims (18)

  1. A method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises:
    receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters;
    applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image;
    applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image;
    applying the discriminator neural network to the input facial image to generate a first predicted classification;
    applying the discriminator neural network to the output facial image to generate a second predicted classification; and
    updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or
    updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.
  2. The method of claim 1, wherein the first classification and/or the second classification comprises a probability distribution over image patches in the input facial image and/or output facial image indicative of whether the image patches are real or synthetic.
  3. The method of claims 1 or 2, wherein the generator neural network and/or discriminator neural network comprises one or more convolutional layers.
  4. The method of any preceding claim, wherein the set of initial expression parameters and/or the set of target expression parameters correspond to continuous parameters of a linear three-dimensional blendshape model.
  5. The method of any preceding claim, wherein one or more of the training examples further comprises a target facial image corresponding to the set of target expression parameters and wherein the generator loss function further depends on a comparison of the target facial image with the output facial image when using training examples comprising a target facial image.
  6. The method of any preceding claim, wherein one or more of the training examples comprising a target facial image is generated synthetically.
  7. The method of any preceding claim, wherein the method further comprises:
    applying a face recognition module to the input facial image to generate an input embedding; and
    applying the face recognition module to the output facial image to generate an output embedding, and
    wherein the generator loss function further depends on a comparison of the input embedding with the output embedding.
  8. The method of claim 7, wherein the face recognition module comprises a pre-trained neural network.
  9. The method of any preceding claim, wherein the generator loss function and/or discriminator loss function comprises a gradient penalty term based on a gradient of the second classification with respect to the input facial image.
  10. The method of any preceding claim, wherein updating parameters of the generator neural network and/or parameters of the discriminator neural network is performed using backpropagation.
  11. The method of any preceding claim, wherein the discriminator neural network comprises a regression layer, and wherein the discriminator neural network further generates a set of predicted expression parameters from the input facial image and/or the output facial image.
  12. The method of claim 11, wherein the discriminator loss function further depends on a comparison of the set of predicted expression parameters for an input image with the set of initial expression parameters for the input image.
  13. The method of any of claims 11 or 12, wherein updating parameters of the generator neural network is further based on a comparison of the set of predicted expression parameters for an output image with the set of target expression parameters used to generate the output image.
  14. The method of any preceding claim, wherein:
    the generator neural network further generates an attention mask from the input facial data and a reconstructed attention mask from the first output facial image and a corresponding set of initial expression parameters; and
    wherein updating parameters of the generator neural network is further in dependence on a comparison of the attention mask and the reconstructed attention mask.
  15. The method of claim 14, wherein the generator neural network further generates deformation image from the input facial data, and wherein the output image is generated by combining the deformation image, the attention mask and the input facial image.
  16. The method of any preceding claim, wherein the discriminator neural network is a relativistic discriminator.
  17. A method of generating a target image from an input image and a set of target expression parameters, the method comprising:
    receiving the set of target expression parameters, the target expression parameters taken from a continuous range of target expression parameters;
    receiving the input image;
    applying a generator neural network to the input image and the set of target expression parameters to generate the target image,
    wherein the generator neural network is trained according to the method of any preceding claim.
  18. A method of determining a set of expression parameters from an input facial image, the method comprising:
    receiving an input image; and
    applying a discriminator neural network to the input image to generate the set of expression parameters,
    wherein the discriminator neural network has been trained according to the method of any of claims 11 to 13.
PCT/CN2020/108121 2019-08-15 2020-08-10 Facial image processing WO2021027759A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1911689.6A GB2586260B (en) 2019-08-15 2019-08-15 Facial image processing
GB1911689.6 2019-08-15

Publications (1)

Publication Number Publication Date
WO2021027759A1 true WO2021027759A1 (en) 2021-02-18

Family

ID=68099558

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108121 WO2021027759A1 (en) 2019-08-15 2020-08-10 Facial image processing

Country Status (2)

Country Link
GB (1) GB2586260B (en)
WO (1) WO2021027759A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706428A (en) * 2021-07-02 2021-11-26 杭州海康威视数字技术股份有限公司 Image generation method and device
CN113870399A (en) * 2021-09-23 2021-12-31 北京百度网讯科技有限公司 Expression driving method and device, electronic equipment and storage medium
CN113989103A (en) * 2021-10-25 2022-01-28 北京字节跳动网络技术有限公司 Model training method, image processing method, device, electronic device and medium
CN114399593A (en) * 2021-12-23 2022-04-26 北京航空航天大学 Face glasses removing and three-dimensional model generating method based on deep learning
CN116229214A (en) * 2023-03-20 2023-06-06 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN117974853A (en) * 2024-03-29 2024-05-03 成都工业学院 Self-adaptive switching generation method, system, terminal and medium for homologous micro-expression image

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144348A (en) * 2019-12-30 2020-05-12 腾讯科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN112508239A (en) * 2020-11-22 2021-03-16 国网河南省电力公司电力科学研究院 Energy storage output prediction method based on VAE-CGAN
CN113284059A (en) * 2021-04-29 2021-08-20 Oppo广东移动通信有限公司 Model training method, image enhancement method, device, electronic device and medium
CN113642467B (en) * 2021-08-16 2023-12-01 江苏师范大学 Facial expression recognition method based on improved VGG network model
CN115984947B (en) * 2023-02-21 2023-06-27 北京百度网讯科技有限公司 Image generation method, training device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647560A (en) * 2018-03-22 2018-10-12 中山大学 A kind of face transfer method of the holding expression information based on CNN
US20180373999A1 (en) * 2017-06-26 2018-12-27 Konica Minolta Laboratory U.S.A., Inc. Targeted data augmentation using neural style transfer
CN109308725A (en) * 2018-08-29 2019-02-05 华南理工大学 A kind of system that expression interest figure in mobile terminal generates
CN109934767A (en) * 2019-03-06 2019-06-25 中南大学 A kind of human face expression conversion method of identity-based and expressive features conversion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102387570B1 (en) * 2016-12-16 2022-04-18 삼성전자주식회사 Method and apparatus of generating facial expression and learning method for generating facial expression
CN108171770B (en) * 2018-01-18 2021-04-06 中科视拓(北京)科技有限公司 Facial expression editing method based on generative confrontation network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373999A1 (en) * 2017-06-26 2018-12-27 Konica Minolta Laboratory U.S.A., Inc. Targeted data augmentation using neural style transfer
CN108647560A (en) * 2018-03-22 2018-10-12 中山大学 A kind of face transfer method of the holding expression information based on CNN
CN109308725A (en) * 2018-08-29 2019-02-05 华南理工大学 A kind of system that expression interest figure in mobile terminal generates
CN109934767A (en) * 2019-03-06 2019-06-25 中南大学 A kind of human face expression conversion method of identity-based and expressive features conversion

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706428A (en) * 2021-07-02 2021-11-26 杭州海康威视数字技术股份有限公司 Image generation method and device
CN113706428B (en) * 2021-07-02 2024-01-05 杭州海康威视数字技术股份有限公司 Image generation method and device
CN113870399A (en) * 2021-09-23 2021-12-31 北京百度网讯科技有限公司 Expression driving method and device, electronic equipment and storage medium
CN113989103A (en) * 2021-10-25 2022-01-28 北京字节跳动网络技术有限公司 Model training method, image processing method, device, electronic device and medium
CN113989103B (en) * 2021-10-25 2024-04-26 北京字节跳动网络技术有限公司 Model training method, image processing device, electronic equipment and medium
CN114399593A (en) * 2021-12-23 2022-04-26 北京航空航天大学 Face glasses removing and three-dimensional model generating method based on deep learning
CN114399593B (en) * 2021-12-23 2024-05-14 北京航空航天大学 Face glasses removing and three-dimensional model generating method based on deep learning
CN116229214A (en) * 2023-03-20 2023-06-06 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN116229214B (en) * 2023-03-20 2023-12-01 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN117974853A (en) * 2024-03-29 2024-05-03 成都工业学院 Self-adaptive switching generation method, system, terminal and medium for homologous micro-expression image

Also Published As

Publication number Publication date
GB2586260B (en) 2021-09-15
GB2586260A (en) 2021-02-17
GB201911689D0 (en) 2019-10-02

Similar Documents

Publication Publication Date Title
WO2021027759A1 (en) Facial image processing
US10796414B2 (en) Kernel-predicting convolutional neural networks for denoising
Tran et al. On learning 3d face morphable model from in-the-wild images
Meng et al. Sdedit: Guided image synthesis and editing with stochastic differential equations
US10424087B2 (en) Systems and methods for providing convolutional neural network based image synthesis using stable and controllable parametric models, a multiscale synthesis framework and novel network architectures
CN112215050A (en) Nonlinear 3DMM face reconstruction and posture normalization method, device, medium and equipment
CN110097609B (en) Sample domain-based refined embroidery texture migration method
CA3144236A1 (en) Real-time video ultra resolution
Li et al. Exploring compositional high order pattern potentials for structured output learning
CN110084193B (en) Data processing method, apparatus, and medium for face image generation
WO2018203549A1 (en) Signal conversion device, method, and program
CA3137297C (en) Adaptive convolutions in neural networks
EP4377898A1 (en) Neural radiance field generative modeling of object classes from single two-dimensional views
CN114746904A (en) Three-dimensional face reconstruction
US20230130281A1 (en) Figure-Ground Neural Radiance Fields For Three-Dimensional Object Category Modelling
WO2023129190A1 (en) Generative modeling of three dimensional scenes and applications to inverse problems
CN116416376A (en) Three-dimensional hair reconstruction method, system, electronic equipment and storage medium
Wang et al. Learning to hallucinate face in the dark
CN113763535A (en) Characteristic latent code extraction method, computer equipment and storage medium
Costigan et al. Facial retargeting using neural networks
US20220172421A1 (en) Enhancement of Three-Dimensional Facial Scans
CN116030181A (en) 3D virtual image generation method and device
CN114764746A (en) Super-resolution method and device for laser radar, electronic device and storage medium
Saval-Calvo et al. Evaluation of sampling method effects in 3D non-rigid registration
CN113592970B (en) Method and device for generating hair styling, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20852928

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20852928

Country of ref document: EP

Kind code of ref document: A1