WO2021027759A1

WO2021027759A1 - Facial image processing

Info

Publication number: WO2021027759A1
Application number: PCT/CN2020/108121
Authority: WO
Inventors: Evangelos VERVERAS; Stefanos ZAFEIRIOU
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-08-15
Filing date: 2020-08-10
Publication date: 2021-02-18
Also published as: GB2586260B; GB2586260A; GB201911689D0

Abstract

Methods for generating facial images with a given facial expression using neural networks, and methods of training neural networks for generating facial images with a given facial expression are provided. A method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises: receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters; applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image; applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image; applying the discriminator neural network to the input facial image to generate a first predicted classification; applying the discriminator neural network to the output facial image to generate a second predicted classification; and updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.

Description

Facial Image Processing

Field of the Invention

This specification relates to methods for generating facial images with a given facial expression using neural networks, and methods of training neural networks for generating facial images with a given facial expression.

Background

Image-to-image translation is a ubiquitous problem in image processing, in which an input image of a source domain is transformed to a synthetic image of a target domain, where the synthetic image maintains some properties of the original input image. Examples of image-to-image translation include converting images from black-and-white to colour, turning daylight scenes into night-time scenes, increasing the quality of images and/or manipulating facial attributes of an image. However, many methods of performing image-to-image translation require aligned image pairs.

Methods of performing image-to-image translation in the absence of aligned images are limited in that they may not generate synthetic samples of sufficient quality when the input image is recorded under unconstrained conditions (e.g. in the wild) . Additionally, the target domain is usually limited to be one out of a discrete set of domains. Hence, these methods are limited in their capabilities for generating facial images with a given facial expression.

Summary

According to a first aspect of this disclosure, there is described a method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises: receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters; applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image; applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image; applying the discriminator neural network to the input facial image to generate a first predicted classification; applying the discriminator neural network to the output facial image to generate a second predicted classification; and updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.

The first classifier and/or the second classifier may comprise a probability distribution over image patches in the input facial image and/or output facial image indicative of whether the image patches are real or synthetic. The generator neural network and/or discriminator neural network may comprise one or more convolutional layers.

The set of initial expression parameters and/or the set of target expression parameters may correspond to continuous parameters of a linear three-dimensional blendshape model.

One or more of the training examples may further comprise a target facial image corresponding to the set of target expression parameters and the generator loss function may further depend on a comparison of the target facial image with the output facial image when using training examples comprising a target facial image. One or more of the training examples may comprise a target facial image generated synthetically.

The method may further comprise: applying a face recognition module to the input facial image to generate an input embedding; and applying the face recognition module to the output facial image to generate an output embedding, and the generator loss function may further depend on a comparison of the input embedding with the output embedding. The face recognition module may comprise a pre-trained neural network.

The generator loss function and/or discriminator loss function may comprise a gradient penalty term based on a gradient of the second classifier with respect to the input facial image.

Updating parameters of the generator neural network and/or parameters of the discriminator neural network may be performed using backpropagation.

The discriminator neural network may comprise a regression layer, and the discriminator neural network may further generate a set of predicted expression parameters from the input facial image and/or the output facial image. The discriminator loss function may further depend on a comparison of the set of predicted expression parameters for an input image with the set of initial expression parameters for the input image.

Updating parameters of the generator neural network may further be based on a comparison of the set of predicted expression parameters for an output image with the set of target expression parameters used to generate the output image.

The generator neural network may further generate an attention mask from the input facial data and a reconstructed attention mask from the first output facial image and a corresponding set of initial expression parameters. Updating parameters of the generator neural network may further be in dependence on a comparison of the attention mask and the reconstructed attention mask. The generator neural network may further generate a deformation image from the input facial data. The output facial image may be generated by combining the deformation image, the attention mask and the input facial image. The reconstructed facial image may be generated by combining a reconstructed deformation image (generated by the generator neural network from the output facial image) , the reconstructed attention mask and the output facial image

The discriminator neural network may be a relativistic discriminator.

According to a further aspect of this disclosure, there is described a method of generating a target image from an input image and a set of target expression parameters, the method comprising: receiving the set of target expression parameters, the target expression parameters taken from a continuous range of target expression parameters; receiving the input image; applying a generator neural network to the input image and the set of target expression parameters to generate the target image, wherein the generator neural network is trained according to the method of any preceding claim.

According to a further aspect of this disclosure, there is described a method of determining a set of expression parameters from an input facial image, the method comprising: receiving an input image; and applying a discriminator neural network to the input image to generate the set of expression parameters, wherein the discriminator neural network has been trained according to the methods described above.

Brief Description of the Drawings

Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which:

Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network;

Figure 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression;

Figure 3 shows an overview of a further example method of training a generator neural network for generating a facial image with a given expression;

Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression;

Figures 5a and 5b show an example structure of a generator neural network for generating a facial image with a given expression and an example structure of a discriminator neural network for predicting the expression parameters and calculating an adversarial loss for a facial image;

Figure 6 shows an overview of an example method of predicting the expression parameters of a facial image using a trained discriminator neural network; and

Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.

Detailed Description

Example implementations provide system (s) and methods for generating facial images with a given facial expression defined by a set of expression parameters, and methods of determining facial expression parameters of a facial model from an input facial image.

Interactive editing of facial expressions has wide ranging applications including, but not limited to, post-production of movies, computational photography, and face recognition. Improving editing of facial expressions may improve systems directed at these applications. For example, face recognition systems using the system (s) and method (s) described herein may be able to more realistically neutralise the expression of a user’s face, and thus verify the user’s identity under less constrained environments than systems and methods in the state of the art.

Moreover, methods and systems disclosed herein may utilise continuous expression parameters, in contrast to discrete expression parameters which use, for example, labels indicating an emotion, such as “happy” , “fearful” , or “sad” . The expression parameters may correspond with expression parameters used by models to vary expression when generating three-dimensional face meshes. These models may include, for example, linear 3D blendshape models. These models may also be able to represent any deformation in facial images that occur, for example deformations as a result of speech. Therefore, it will be appreciated that reference to expression parameters herein refer to any parameter that captures variation in facial images caused by a deformation to the face in the facial image.

Consequently, methods disclosed herein may provide a finer level of control when editing the expressions of facial images than methods using discrete expression parameters. Using the same expression parameters as a 3D facial mesh models may allow fine-tuning of the desired target output by editing an input facial image to depict any expression capable of being produced by the 3D facial mesh model. The target expression may be fine-tuned by modifying the continuous expression parameters. The desired target expressions may be previewed through rendering a 3D facial mesh with the selected expression parameters, prior to generating the edited facial image.

Figure 1 shows an overview of an example method of generating a facial image with a given expression using a trained generator neural network. The method 100 takes an input facial image 102 and a set of target expression parameters 104 as input and forms an output facial image 108 as output using a trained generator neural network 106.

The input facial image 102, I _org, is an image comprising one or more faces. The facial expression of each face in the input image is described by an initial set of expression parameters, p _org. The input facial image 102 is processed by the trained generator neural network 106 in order to produce an output facial image 108, I _gen, comprising one or more faces corresponding to faces in the input image 102, but with an expression described by the target expression parameters 104, p _trg.

The output facial image 108 generated by the trained generator neural network 106 may retain many aspects of the input facial image including identity, angle, lighting, and background elements, while changing the expression of the face to correspond with the target expression parameters 104. In some embodiments where the expression parameters additionally or alternatively control the variation in facial deformations that occur as a result of speech, the trained generator neural network 106 may produce an output facial image 108 with different mouth and lip motion to the input facial image 102. In this way, the trained generator neural network 106 may be used for expression and/or speech synthesis.

The input facial image 102 and/or output facial image 108 comprise a set of pixel values in a two-dimensional array. For example, in a colour image,

where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB) . The

facial images

102, 108 may, in some embodiments, be in black-and-white/greyscale.

Expression parameters (also referred to herein as an expression vector) are a set of continuous variables that encode a facial expression. Additionally, the expression parameters may encode facial deformations that occur as a result of speech. Expression parameters may be represented by an N-dimensional vector, e.g. p _i = [p _i, 1, p _i, 2 …p _i, N] ^T. The expression parameters may correspond to parameters of a 3D model, such as a linear 3D blendshape model, as described below. The target expression parameters 104 encode a target facial expression of the output image 108 (i.e. the desired facial expression of the output facial image 108) . The target expression parameters are typically different to initial expression parameters associated with the input facial image 102. The target expression parameters 104 are processed by the trained generator neural network 106 along with the input facial image 102 in order to produce the output facial image 108.

The initial expression and/or target expression parameters 104 may correspond with the expression parameters used by models to vary expression when generating 3D face meshes. The expression parameters may be regularised to lie within a predefined range. For example, the expression parameters may be regularised to lie in the range [-1, 1] . In some embodiments, a zero expression parameter vector (i.e. a set of expression parameters where each parameter is set to 0) is defined that corresponds to a neutral expression. For example, the zero expression parameter vector may correspond to a three-dimensional model with a neutral expression. In some embodiments, the magnitude of an expression parameter may correspond with an intensity of a corresponding expression depicted in the 3D facial mesh.

The trained generator neural network 106 is a neural network configured to process the input facial image 102 and target expression parameters 104 to generate an output facial image 108. In some embodiments, instead of generating the output image 108 directly, the generator neural network 106 may generate an attention mask

112 (also referred to as a smooth deformation mask) . The mask 112 may have the same spatial dimension as the input facial image 102. In these embodiments, the generator neural network may also generate a deformation image,

110. The output facial image 108 may then be generated as a combination of the input facial image 102 and the generator output 110. By being configured to output an attention mask 112, the trained generator neural network 106 may determine which regions of the input facial image 102 should be modified in order to generate an output facial image 108 corresponding to target expression parameters 104.

The trained generator neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the trained generator neural network 106 may be convolutional layers. Examples of generator neural network architectures are described below in relation to Figure 6a.

The parameters of the trained generator neural network 106 may be trained using generative adversarial training, and the trained generator neural network 106 may therefore be referred to as a Generative Adversarial Network (GAN) . The trained generator neural network 106 may be the generator network of the generative adversarial training. Generator neural networks 106 trained in an adversarial manner may output more realistic images than other methods. Examples of training methods are described below in relation to Figures 2 and 4.

Figure 2 shows an overview of an example method of training a generator neural network for generating a facial image with a given expression. The method 200 comprises jointly training a generator neural network 208 and a discriminator neural network 216. The generator neural network 208 is trained using a generator loss function 214. The discriminator neural network 216 is trained using a discriminator loss function 222.

The objective of the generator neural network 208 during training is to learn to generate realistic output facial images 210 with a given target expression, which is defined by the target expression parameters 206. The generator loss function 214 is used to update parameters of the generator neural network 208 during training with the aim of producing more realistic output facial images 210. The objective of the discriminator neural network 216 during training is to learn to distinguish between real facial images 202 and output/generated facial images 210. The discriminator loss function 222 is used to update the discriminator neural network 216 with the aim of better discriminating between output (i.e. “fake” ) facial images 210 and input (i.e. “real” ) facial images 202.

The objective of the discriminator neural network during training is to learn to distinguish between real facial images 202 and output/generated facial images 210. Additionally, the discriminator neural network may include a regression layer which estimates expression parameter vectors for an image, which may be a real image or a generated image. This regression layer may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic. In some embodiments, the discriminator neural network 216 may be a relativistic discriminator, generating predicted classifications indicating the probability of an image being relatively more realistic than a generated one. Example structures of the discriminator neural network are described below, in relation to Figure 5b.

During the training process, the generator neural network 208 and the discriminator neural network compete against each other until they reach a threshold/equilibrium condition. For example, the generator neural network 208 and the discriminator neural network compete with each other until the discriminator neural network can no longer distinguish between real and synthetic facial images. Additionally or alternatively, the adversarial loss calculated by the discriminator neural network may be used to determine when to terminate the training procedure. For example, the adversarial loss of the discriminator neural network may correspond with a distance metric between the distributions of real images and generated images. The training procedure may be terminated when this adversarial loss reaches a threshold. Alternatively, the training procedure may be terminated after a fixed number of training epochs, for example 80 epochs.

The training uses a set of training data (also referred to herein as a training set) . The training set may comprise of a set of K training examples of the form

where I _org is an input facial image 202, p _org is an initial expression parameter vector 204, and p _trg is a target expression parameter vector 206. The initial expression parameters 204 correspond to the expression of the input facial image 202. For example, a 3D facial mesh model which is configured to receive expression parameters may depict the same expression as the expression of the input facial image 202 when instantiated with the initial expression parameters 204.

In some embodiments, the training set may additionally or alternatively comprise a set of L training examples of the form

where I _org, p _org, and p _trg are defined as above, and I _trg is a known target facial image. The target expression parameters 206 may correspond with the target facial image. The method of training the generator neural network 208 where at least one example of the training set includes a target facial image is described below in more detail with respect to Figure 4.

During the training process, the generator neural network 208 is applied to an input facial image 202, I _org, and a corresponding set of target expression parameters 206, p _trg, taken from a training example included in the training data. The output of the generator neural network 208 is a corresponding output facial image 210, I _gen. The output facial image 210 may be represented symbolically as

where

represents the mapping performed by the generator neural network 208.

In some embodiments, instead of generating the output image directly, the generator neural network 208 may generate an attention mask

(also referred to as a smooth deformation mask) . The mask may have the same spatial dimension as the input facial image 202. In these embodiments, the generator neural network may also generate a deformation image,

The output facial image 210 may then be generated as a combination of the input facial image 202 and the generator output. For example, the output facial image 210 may be given by:

The values of the mask may be constrained to lie between zero and one. This may be implemented, for example, by the use of a sigmoid activation function.

In some embodiments, the input facial image 202 and the generated facial image 210 may additionally be processed by a face recognition module (not shown) . The input facial image 202 may be processed by the face recognition module in order to produce an input embedding. Similarly, the output/generated facial image 210 may be processed by the face recognition module in order to produce an output embedding. The input/output embeddings may represent the identity of faces in facial images. The face recognition module may be a component of the loss function 214, as described below in further detail.

The generator neural network 208 is then applied to the output facial image 210, and the set of initial expression parameters 204, p _org, taken from the same training example of the training data. In other words, the initial expression parameters 204 correspond with the expression depicted in the input facial image 202. The output of the generator neural network in this case is a reconstructed facial image 212, I _rec. Similarly, as before, this may be represented symbolically by the mapping

In some embodiments, the generator neural network 208 may generate a mask

(also referred to herein as an attention mask) along with a deformation image

The mask may have the same spatial dimension as the input facial image 202. The reconstructed facial image 212 may then be generated as a combination of the generated facial image 210 and the generator output. For example, the reconstructed facial image 212 may be given by

The generator neural network 208 is configured to process the output facial image 210 and the corresponding set of initial expression parameters 204 in an attempt to produce a reconstructed facial image 212 replicates the input facial image 202. The use of this step during training of the generator neural network 208 can result in the contents of the input facial image 202 (e.g. background elements, presence of glasses and other accessories) being retained when generating output facial images 210 with a trained generator neural network 208.

The discriminator neural network 216 is applied to the input facial image 202 to generate a first predicted classification 218,

The discriminator neural network is also applied to the output facial image 202 to generate a second predicted classification 220,

The predicted classifications may be represented as:

with

representing the processing of the facial image by the discriminator neural network 216 prior to the classification layer and σ represents the action of the classification layer (for example, an activation function) . These

classifications

218, 220 may comprise a probability distribution over image patches in the input facial image 202 and/or the output facial image 210 indicative of whether the image patches are real or synthetic.

In some embodiments, the discriminator neural network 216 may be a relativistic discriminator. The relativistic discriminator generates predicted classifications indicative of the probability of an image being relatively more realistic than a generated one. Examples of relativistic discriminators can be found in “The relativistic discriminator: a key element missing from standard GAN” (Alexia Jolicoeur-Martineau, arXiv: 1807.00734) , the contents of which are incorporated by reference. For example, the predicted classifications may be represented by;

The relativistic discriminator may be a relativistic average discriminator, which averages over a mini-batch of generated images. The classifications generated by a relative average discriminator may be dependent on whether the input to the relativistic average discriminator is a real facial image or a generated facial image. For example, the classifications

for a real image I may be given by:

and for generated images, I, by:

where

and

are averages over generated images and real images in a mini batch respectively.

In some embodiments, in addition to generating these

classifications

218, 220, the discriminator neural network 216 may also output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210. These predicted parameters may be represented by the mapping

for an image I. The discriminator neural network 216 may include one or more regression layers to determine this mapping. These one or more regression layers may be parallel to a classification layer which may output probabilities over image patches in the image indicative of whether the image is real or synthetic. Example structures of the discriminator neural network 216 are described below, in relation to Figure 6b.

The input facial image 202, the reconstructed facial image 212, the first predicted classification 218 and second predicted classification 220 are used to calculate a generator loss function 214. Calculating the generator loss function 214 may comprise comparing the input facial image 202 and the reconstructed facial image 212. The generator loss function 214 may include a plurality of component loss functions, as will be described in greater detail below. The generator loss is used to update the parameters of the generator neural network 208. The parameters of the generator neural network 208 may be updated using an optimisation procedure that aims to optimise the generator loss function 214.

The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by backpropagating the generator loss function 214 to the parameters of the generator neural network 208.

The first predicted classification 218 and second predicted classification 220 are used to calculate a discriminator loss function 222. In some embodiments, a comparison of the initial expression parameters 204 and the predicted expression parameters of the input image 202 (not shown) may also be used to determine the discriminator loss function 222. The discriminator loss function 222 may include a plurality of component loss functions, as will be described in greater detail below. The discriminator loss function 222 is used to update the parameters of the discriminator neural network 216. The parameters of the discriminator neural network 216 may be updated using an optimisation procedure that aims to optimise the discriminator loss function 222. The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by backpropagating the discriminator loss function 222to the parameters of the discriminator neural network 216.

In some embodiments, the generator loss function 214 and/or discriminator loss function 222 may comprise an adversarial loss function. An adversarial loss function depends on the first predicted classification and the second predicted classification. The adversarial loss function may further comprise a gradient penalty term. The gradient penalty term may be based on a gradient of the second classification with respect to the input facial image. Adversarial loss functions may correspond with a distance metric between the distributions of the generated images and the real images.

An example of such an adversarial loss function is given by:

Where

indicates an expectation value taken over a set of input facial images 202,

indicates an expectation value taken over a set of input facial images and corresponding target expression parameters, and

indicates an expectation value taken over a set of generated facial images 210. Expectation operators may be approximated by sampling from the respective distributions from which the expectations are taken, performing the calculation inside the expectation operator, and then dividing by the number of samples taken. In some embodiments,

may be replaced by the relativistic discriminator,

described above.

The term

in the equation above may be referred to as a gradient penalty term. A gradient penalty term may be based on a gradient of the second classification with respect to the input facial image. While the norm in this example is a L ² norm, it will be appreciated that alternative norms may be used. A coefficient λ _gp may control the contribution of the gradient penalty term to the adversarial loss function. For example, this may be set to a number in the range 1-100, such as between 5 and 15, for example 10.

It will be appreciated that other adversarial loss functions may be used, wherein the adversarial loss function depends on an output of a discriminator neural network 216. A generator neural networks 208 trained with adversarial loss functions may generate synthesised images with greater photorealism than when trained with other loss functions. In some embodiments, when the discriminator neural network 216 is a relativistic discriminator, both real and generated images are included in the generator part of the adversarial loss function. This may allow the generator to benefit by the gradients of both real and fake images, generating output facial images 210 with sharper edges and more detail which also better represent the distribution of the real data.

An example of a generator loss function 214 may then be given by:

An example of a discriminator loss function 214 may then be given by:

One or more additional loss functions may be included in the generator loss function 214 and/or discriminator loss function, as described below.

In some embodiments, generator loss function 214 may further comprise a reconstruction loss function. The reconstruction loss function provides a measure of the difference between the input facial image 202 and the reconstructed facial image 212. The reconstruction loss function is based on a comparison between the input facial image 202 and the reconstructed facial image 212. For an input facial image 202 with width W and height H, an example of a reconstruction loss function is given by:

where I _rec is the reconstructed facial image 212, and can be represented symbolically as

An example of a generator loss function 214 incorporating the reconstruction loss is given by

where λ _rec controls the contribution of the reconstruction loss function to the generator loss function. For example, λ _rec may be set to a number in the range 1-100, such as between 5 and 15, for example 10.

Generator neural networks 208 trained using reconstruction loss functions may generate output facial images 210 which better preserve the contents of the input facial image 202 than when trained without reconstruction loss functions. For example, background elements and the presence of accessories (e.g. sunglasses, hats, jewellery) in the input facial image 202 may be retained in the output facial image 210. Other reconstruction loss functions may be used. For example, different norms may be used in the calculation instead of the L1 norm, or the images may be pre-processed before calculating the reconstruction loss.

In some embodiments, generator loss function 214 may further comprise an attention mask loss function. The attention mask loss function compares the attention mask generated from the input facial data 202 to a reconstructed attention mask generated from the first output facial image 212 and a corresponding set of initial expression parameters 204. The attention mask loss may encourage the generator to produce attention masks which are sparse and do not saturate to 1. The attention mask may minimise the L1 norm of the produced masks for both the generated and reconstructed images. For an input facial image 202 with width W and height H, an example of an attention mask loss function is given by

It will be appreciated that norms other than an L1 may alternatively be used.

An example of a generator loss function 214 incorporating the attention mask loss is given by

where λ _att controls the contribution of the attention mask loss function to the generator loss function. For example, λ _att may be set to a number greater than zero, such as between 0.005 and 50, for example 0.3.

Generator neural networks 208 trained with an attention mask loss function may generate output facial images 210 which better preserve the content and the colour of the input facial image 202.

In some embodiments, the generator loss function 214 may include an identity loss function. An identity loss function depends on the input facial image 202 and the output facial image 210. In some embodiments, the identity loss calculation system 216 may include a face recognition module. The face recognition module may be a pre-trained neural network configured to produce identity embeddings. Identity embeddings represent the identity of people in facial images. For example, two images of the same person may be processed by the face recognition module to produce similar embeddings that are closer (as defined by some metric, such as Euclidean distance) to identity embeddings corresponding to different people. The input facial image 202 may be processed by the face recognition module to produce an input embedding, e _org. The output facial image 210 may be processed by the face recognition module to produce an output embedding, e _gen. The identity loss function may depend on the input embedding and the output embedding. An example of an identity loss function is:

An example of generator loss function 214 may then be given as

where λ _id controls the contribution of the identity loss function to the generator loss function. For example, λ _id may be set to a number in the range 1-100, such as between 2 and 10, for example 5. The loss function may additionally include a reconstruction loss term and/or attention mask loss function, as described above.

Generator neural networks 208 trained using identity loss functions may generate output facial images 210 which better maintain the identity of the face in the input facial image 202 than when trained without identity loss functions. Other identity loss functions may be used, for example, different similarity/distance metrics may be used to compare the identity embeddings.

In various embodiments, the generator loss function 214 and/or discriminator loss function may further include an expression loss function. An expression loss is calculated based on an output of the discriminator neural network. The discriminator neural network may output a set of predicted expression parameters from the input facial image 202 and/or the output facial image 210. For example, a regression layer may be used to determine the expression parameters from an input image. The predicted expression parameters may be compared with the initial and/or target expression parameters using an expression parameter loss function. These predicted parameters may be represented by the mapping

for an image I. A different expression loss function may be used in the generator loss function 214 and the discriminator loss function 222.

For an N-dimensional parameter vector, an example expression loss function for the generator loss function 214 is:

A generator loss function may then be given as

where λ _exp controls the contribution of the expression loss function to the generator loss function. For example, λ _exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10. The generator loss function 214 may additionally include a reconstruction loss term and/or attention mask loss term and/or an identity loss term, as described above.

Generator neural networks 208 trained using expression loss functions in the generator loss function 214 and/or discriminator loss function 222 may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206 than when trained without expression loss functions. In detail, the generator neural network 208 may generate output facial images which have an accurate expression according to the discriminator neural network.

For an N-dimensional parameter vector, an example expression loss function for the discriminator loss function 222 is:

The discriminator loss function may then be given as

where λ _exp controls the contribution of the expression loss function to the discriminator loss function. For example, λ _exp may be set to a number in the range 1-100, such as between 5 and 15, for example 10.

Discriminator neural networks 508 trained with expression loss functions may accurately estimate the expression parameters of an input facial image. In turn, the generator neural network may be configured to generate output facial images which, according to the discriminator neural network, accurately depict the expression corresponding to the target expression parameters.

It will be appreciated that any combination of the loss functions described above may be used to create a generator loss function. For example, a generator loss function may be given as:

Alternatively, one or more of the components of the above generator loss function may be omitted.

Referring now to Figure 3, an overview of an example method 300 of training a generator neural network 208 for generating a facial image with a given expression is shown. In some embodiments, the training set may comprise a set of L training examples of the form

with L being at least one. In other words, one or more of the training examples may further comprise a target facial image 302, denoted by I _trg, in addition to the input image 202, target expression parameters 206 and original expression parameters 204. This training set may be in addition to or as an alternative to the training set K.

In many respects, the method 300 of Figure 3 proceeds in substantially the same manner as the method 200 shown in Figure 2.

The target facial image 302 comprise a set of pixel values in a two-dimensional array. The target facial image 302 may have the same dimensions as the input facial image 202 and output facial image 210. For example, in a colour image,

where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB) . The target facial images 302 may, in some embodiments, be in black-and-white/greyscale.

The target facial image 302 may depict a face with the same identity as the face in input facial image 202, but with expression corresponding with the target expression parameters 206.

The target facial image 302 may be generated synthetically. Using methods described below, models for 3D face reconstruction from 2D images may be fitted to a set of facial images. The input facial image 202 may be processed by the fitted 3D face reconstruction model in order to produce a target facial image 302 with expression corresponding to target expression parameters 206. In this way, the target expression parameters 206 corresponding to the target facial image 302 are known, and thus the target facial image 302 may be referred to as a ground truth output facial image.

In some embodiments, the generator loss 214 may include a generation loss function. The generation loss function compares the target facial image with the output facial image. For example, the generator loss may be based on a difference between the target image 302 and the generated image 210. The difference may be measured by a distance metric. For input/output facial images with width W and height H, an example generation loss function may be given as:

where the generated image 210 may be represented symbolically as

An generator loss function 214 may then be given as

where λ _gen controls the contribution of the generation loss function to the generator loss function. For example, λ _gen may be set to a number in the range 1-100, such as between 5 and 15, for example 10. The generator loss function 214 may additionally include an expression loss function, a reconstruction loss term, attention mask loss function and/or an identity loss term, as described above in relation to Figure 2.

Generator neural networks 208 trained using generation loss functions may generate output facial images 210 which more accurately replicate the expression corresponding to the target expression parameters 206. In detail, the generator neural network 208 may generate output facial images with fewer artefacts than when trained without generation loss functions.

In various embodiments, the generator neural network 208 may be trained by separate loss functions depending on whether the training example includes a target facial image 302 or not. In other words, when the training data comprises the set K and the set L, a generator loss function 214 is selected based on whether the training data for the current iteration is taken from set K or set L. Training examples including a target facial image 302 may be referred to as paired data, and training examples without target facial images 302 may be referred to as unpaired data.

An example of such a generator loss function is given by:

Alternatively, one or more of the terms in the above generator loss function 214 may be omitted.

The training process of the method 200 and/or 300 may update the generator neural network 208 after one or more updates made to the discriminator neural network. The updates to the parameters of the generator and/or discriminator neural networks may be determined using backpropagation. Where the adversarial loss corresponds with a distance metric between distributions, updating the discriminator neural network more often than the generator neural network 208 may lead to the adversarial loss better approximating the distance metric between the distributions of the real and generated images. This in turn may provide a more accurate signal for the generator neural network 208 so that output facial images 210 are generated with more realism.

As described above, initial and/or target

expression parameters

204, 206 may correspond with the expression parameters used to vary expression in 3D facial mesh models. Expression parameters used for training the methods described herein may be extracted by first fitting models for 3D reconstruction of faces to a set of facial images. These models may include 3D Morphable Models, or 3DMM, neural networks, and other machine-learned models. These models for 3D face reconstruction may be configured to optimize three parametric models: a shape model, a texture model, and a camera model, in order to render a 2D instance of the 3D facial reconstruction as close to the initial facial image as possible.

Focussing on a shape model, where the shape parameters are derived using principle component analysis. The 3D facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector, and summing these products with a mean shape vector. For example, a 3D facial mesh comprising N vertices may be represented as a vector:

An identity parameter vector, p _s, id, may control variations in identity in 3D facial shapes. The 3D facial mesh may be calculated as:

s= m _s+ U _s, idp _s, id

where m _s is a mean shape vector and U _s, id is a matrix of basis vectors corresponding to the principle components of the identity subspace. These basis vectors may be learned from 3D facial scans displaying a neutral expression and the identity parameters may be used to represent identity variations by instantiating a 3D shape instance.

Expression parameters may also be derived using principle component analysis and are used to generate the 3D facial mesh. The 3D facial mesh may be generated by finding the product of each of the identity parameters and a respective basis vector and each of the expression parameters and a respective basis vector, and summing these products with a mean shape vector. For example, the 3D facial mesh may be calculated as:

s=m _s+ U _s, idp _s, id+ U _s, exop _s, exp

or equivalently,

s=m _s+ [U _s, id, U _s, exp] [ _s, id ^T, p _s, exp ^T] ^T

where m _s is a mean shape vector, where U _s, id is a matrix of basis vectors for identity variations and U _s, exp is a matrix of basis vectors for expression variations, and p _s, id are identity parameters controlling identity variations in the 3D facial mesh and p _s, exp are expression parameters controlling expression variations in the 3D facial mesh. The expression basis vectors may be learned from displacement vectors calculated by comparing shape vectors from 3D facial scans depicting a neutral expression and from 3D facial scans depicting a posed expression. These expression parameters may be used to represent expression variations by, in addition to the identity parameters, instantiating a 3D shape instance.

Expression parameters may be configured to be in the range [-1, 1] . Where expression parameters are derived using principle component analysis, the expression parameters may be normalised by the square root of the eigenvalues e_i, i=1, …, N of the PCA blendshape model. Additionally, the zero expression parameter vector, where each element of the vector is set to 0, may correspond with a 3D facial mesh depicting a neutral expression. Moreover, the magnitude of an expression parameter may correspond with the intensity of the expression depicted in the 3D facial mesh. For example, setting a certain expression parameter to -1 may correspond with an intense frown in the 3D facial model. The same expression parameter set to -0.5 may correspond with a more moderate frown. The 3D facial mesh may depict an intense smile when the same parameter is set to 1.

However, methods described herein may not require the 3D facial mesh model to generate a facial image with a given expression; instead the expression parameters may be processed by a generator neural network trained by the methods described herein. Additionally or alternatively, the expression parameters may be predicted by a discriminator neural network trained by the methods described herein.

Thus, by first fitting models for 3D reconstruction of faces from a set of facial images, identity and expression parameters p _s, id , p _s, exp may be extracted from any input facial image 202. Based on the independent shape parameters for identity and expression, expression parameters may be extracted to compose an annotated dataset of K images and their corresponding vector of expression parameters

with no manual annotation cost. This dataset may be used in part to produce a training set for training the generator neural network 208.

The target expression parameters 206 may be expression parameters determined (using the methods described herein or otherwise) from a facial image different to the input facial image 202. Additionally or alternatively, the target expression parameters 206 may be selected by generating a 3D facial mesh with a desired target expression, and using the corresponding expression parameters that produce the desired target expression. Additionally or alternatively, the target expression parameters 104 may be randomly selected, for example they may be sampled from a multivariate Gaussian distribution.

Figure 4 shows a flow diagram of an example method of training a generator neural network for generating a facial image with a given expression.

At operation 4.1, training data comprising a plurality of training examples, each training example comprising an input facial image a set of initial expression parameters and a set of target expression parameters is received by a generator neural network. The expression parameters may correspond to continuous parameters of a 3D facial model. Examples of 3D facial models may include 3D blendshape models and/or 3D morphable models.

In some embodiments, the training data further comprises a target image corresponding to the target expression parameters. The target image may be a ground truth image from which the target expression parameters were extracted. Alternatively or additionally, the target facial image may be a synthetic image.

At operation 4.2, the generator neural network is applied to an input facial image and a corresponding set of target parameters to generate an output facial image.

At operation 4.3, the generator neural network is applied to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image.

At operation 4.4, a discriminator neural network is applied to the output facial image to generate a second predicted classification.

At operation 4.5, the discriminator neural network is applied to the output facial image to generate a second predicted classification. The second predicted classification may comprise a probability distribution over image patches of the output image indicative of whether those patches are real (i.e. from ground truth images) or synthetic (i.e. generated by the generator neural network)

Operations 4.1 to 4.5 may be iterated over the training examples in the training data (e.g. K and/or L described above) to form an ensemble of training examples.

At operation 4.6, the parameters of the generator neural network are updated in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image. The update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5. In these embodiments, the generator loss function may comprise one or more expectation values taken over the ensemble of training examples.

In embodiments where a target image is used, the parameters of the generator neural network may also be updated in dependence on a comparison of the target facial image with the output facial image.

At operation 4.7, the parameters of the discriminator neural network are updated in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification. The update may be performed after each iteration of operations 4.1 to 4.5. Alternatively, the update may be performed after a number of iterations of operations 4.1 to 4.5. In these embodiments, the discriminator loss function may comprise one or more expectation values taken over the ensemble of training examples.

The parameters of the generator neural network and/or discriminator neural network may be updated using an optimisation procedure applied to the generator loss function and/or discriminator loss function respectively. Examples of such an optimisation include, but are not limited to, gradient descent methods and/or gradient free methods.

Operations 4.1 to 4.7 may be iterated until a threshold condition is met. The threshold condition may be, for example, a threshold number of training epochs. The threshold number of training epochs may lie in the range 50-150, such as 70-90, for example 80. A predetermined number of epochs may be used for each training set. For example, a first number of training epochs may be performed on the training set K, followed by a second number of training epochs on training set L. The number of training epochs for each training set may lie in the range 20-100, such as 30-50, for example 40.

Different training examples from the training dataset may be utilised for each iteration of operations 4.1 to 4.7. For example, the training set K (or some subset of it) may be used in one or more of the iterations. The training set L (or some subset of it) may be used in one or more of the other iterations. The training sets used in each iteration may have a predefined batch size. The batch size may, for example lie in the range 5-100, such as between 10 and 20, for example 16.

Figures 5a and 5b show an example structure of a generator neural network 500 for generating a facial image with a given expression and an example structure of a discriminator neural network 512 for predicting the expression parameters and calculating an adversarial loss for a facial image.

Figure 5a shows an example structure of a generator neural network 500. The generator neural network 500 is a neural network configured to process an input facial image 502 and expression parameters 504 to generate an output facial image 508. The output facial image 508 corresponds to the input facial image 502, but with an expression dependent on the expression parameters 504. In some embodiments, instead of generating the output image directly, the generator neural network 106 may generate an attention mask

520 (also referred to as a smooth deformation mask) . The mask 520 may have the same spatial dimension as the input facial image 502. In these embodiments, the generator neural network 500 may also generate a deformation image,

522. The output facial image 508 may then be generated as a combination of the input facial image 502 and the generator output 522.

The generator neural network 500 comprises a plurality of layers of nodes 506, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the generator neural network 500 may be convolutional layers.

Figure 5b shows an example structure of a discriminator neural network 512. The discriminator neural network 512 is a neural network configured to process a facial image 512 to output a classification 516. The generator neural network may, in some embodiments, further output a set of predicted expression parameters 518. The predicted expression parameters 518 may be determined by one or more regression layers (not shown) . The regression layers may in parallel to the other layers 514 of the discriminator neural network 512. Alternatively, determination of the predicted expression parameters 518 may be performed as part of the other layers 514 of the discriminator neural network 512. The discriminator neural network 512 comprises a plurality of layers of nodes 514, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the discriminator neural network 512 may be convolutional layers.

Figure 6 shows an overview of an example method 600 of predicting the expression parameters 606 of a facial image 602 using a trained discriminator neural network 604.

The discriminator neural network may, for example, be trained by the methods described in relation to Figure 2.

The input image 602 comprise a set of pixel values in a two-dimensional array. For example, in a colour image,

where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (i.e. RGB) . The input image 602 may, in some embodiments, be in black-and-white/greyscale.

Expression parameters 606 are a set of continuous variables that encode a facial expression (also referred to herein as an expression vector) . Additionally, the expression parameters may encode facial deformations that occur as a result of speech. Expression parameters 606 may be represented by an N-dimensional vector, e.g.

The expression parameters may correspond to parameters of a 3D facial model, such as a 3DMM and/or a linear 3D blendshape model, as described above. The trained discriminator neural network may be trained to minimise an expression loss function, so as to accurately regress the expression parameters 606 to correspond with the input image 602.

Discriminator and generator neural networks trained by the methods disclosed herein may be used for expression transfer, i.e. the transfer of an expression from a source facial image to a target facial image. For example, a trained generator neural network may process the expression parameters from the source facial image, and the target facial image. The expression parameters from the source facial image may be extracted by a trained discriminator neural network 604. The output of the trained generator neural network may be a facial image depicting the identity and other elements of the target facial image but with the expression depicted in the source facial image.

Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.

The apparatus (or system) 700 comprises one or more processors 702. The one or more processors control operation of other components of the system/apparatus 700. The one or more processors 702 may, for example, comprise a general purpose processor. The one or more processors 702 may be a single core device or a multiple core device. The one or more processors 702 may comprise a Central Processing Unit (CPU) or a graphical processing unit (GPU) . Alternatively, the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.

The system/apparatus comprises a working or volatile memory 704. The one or more processors may access the volatile memory 704 in order to process data and may control the storage of data in memory. The volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM) , Dynamic RAM (DRAM) , or it may comprise Flash memory, such as an SD-Card.

The system/apparatus comprises a non-volatile memory 706. The non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions. The non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM) , a Flash memory or a magnetic drive memory.

The one or more processors 702 are configured to execute operating instructions 808 to cause the system/apparatus to perform any of the methods described herein. The operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as well as code relating to the basic operation of the system/apparatus 700. Generally speaking, the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory 704 to temporarily store data generated during execution of said operating instructions 708.

Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits) , computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 8, cause the computer to perform one or more of the methods described herein.

Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.

Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.

Claims

A method of training a generator neural network to generate a facial image with a target expression from a facial image and a set of target expression parameters, wherein the training comprises:

receiving training data comprising a plurality of training examples, each training example comprising an input facial image, a set of initial expression parameters and a set of target expression parameters;

applying the generator neural network to an input facial image and a corresponding set of target parameters to generate an output facial image;

applying the generator neural network to the first output facial image and a corresponding set of initial expression parameters to generate a reconstructed facial image;

applying the discriminator neural network to the input facial image to generate a first predicted classification;

applying the discriminator neural network to the output facial image to generate a second predicted classification; and

updating parameters of the generator neural network in dependence on a generator loss function depending on the first predicted classification, the second predicted classification and a comparison between the input facial image and the reconstructed facial image; and/or

updating parameters of the discriminator neural network in dependence on a discriminator loss function depending on the first predicted classification and the second predicted classification.
The method of claim 1, wherein the first classification and/or the second classification comprises a probability distribution over image patches in the input facial image and/or output facial image indicative of whether the image patches are real or synthetic.
The method of claims 1 or 2, wherein the generator neural network and/or discriminator neural network comprises one or more convolutional layers.
The method of any preceding claim, wherein the set of initial expression parameters and/or the set of target expression parameters correspond to continuous parameters of a linear three-dimensional blendshape model.
The method of any preceding claim, wherein one or more of the training examples further comprises a target facial image corresponding to the set of target expression parameters and wherein the generator loss function further depends on a comparison of the target facial image with the output facial image when using training examples comprising a target facial image.
The method of any preceding claim, wherein one or more of the training examples comprising a target facial image is generated synthetically.
The method of any preceding claim, wherein the method further comprises:

applying a face recognition module to the input facial image to generate an input embedding; and

applying the face recognition module to the output facial image to generate an output embedding, and

wherein the generator loss function further depends on a comparison of the input embedding with the output embedding.
The method of claim 7, wherein the face recognition module comprises a pre-trained neural network.
The method of any preceding claim, wherein the generator loss function and/or discriminator loss function comprises a gradient penalty term based on a gradient of the second classification with respect to the input facial image.
The method of any preceding claim, wherein updating parameters of the generator neural network and/or parameters of the discriminator neural network is performed using backpropagation.
The method of any preceding claim, wherein the discriminator neural network comprises a regression layer, and wherein the discriminator neural network further generates a set of predicted expression parameters from the input facial image and/or the output facial image.
The method of claim 11, wherein the discriminator loss function further depends on a comparison of the set of predicted expression parameters for an input image with the set of initial expression parameters for the input image.
The method of any of claims 11 or 12, wherein updating parameters of the generator neural network is further based on a comparison of the set of predicted expression parameters for an output image with the set of target expression parameters used to generate the output image.
The method of any preceding claim, wherein:

the generator neural network further generates an attention mask from the input facial data and a reconstructed attention mask from the first output facial image and a corresponding set of initial expression parameters; and

wherein updating parameters of the generator neural network is further in dependence on a comparison of the attention mask and the reconstructed attention mask.
The method of claim 14, wherein the generator neural network further generates deformation image from the input facial data, and wherein the output image is generated by combining the deformation image, the attention mask and the input facial image.
The method of any preceding claim, wherein the discriminator neural network is a relativistic discriminator.
A method of generating a target image from an input image and a set of target expression parameters, the method comprising:

receiving the set of target expression parameters, the target expression parameters taken from a continuous range of target expression parameters;

receiving the input image;

applying a generator neural network to the input image and the set of target expression parameters to generate the target image,

wherein the generator neural network is trained according to the method of any preceding claim.
A method of determining a set of expression parameters from an input facial image, the method comprising:

receiving an input image; and

applying a discriminator neural network to the input image to generate the set of expression parameters,

wherein the discriminator neural network has been trained according to the method of any of claims 11 to 13.