WO2021008444A1

WO2021008444A1 - Generating three-dimensional facial data

Info

Publication number: WO2021008444A1
Application number: PCT/CN2020/101206
Authority: WO
Inventors: Baric GECER; Stefanos ZAFEIRIOU
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-07-15
Filing date: 2020-07-10
Publication date: 2021-01-21
Also published as: GB2585708B; GB2585708A; GB201910114D0

Abstract

Generating 3D facial data using a neural network. According to a first aspect, this specification describes a computer implemented method of generating three-dimensional facial data using a generator neural network, the method comprising: inputting, into the generator neural network, initialization data, the initialization data comprising noise data (2.1); processing the initialization data through a plurality of neural network layers of the generator neural network to generate UV maps in a plurality of modalities, the UV maps comprising a shape UV map of a face, a texture UV map of the face, and a normal UV map of the face (2.2); and outputting, from the generator neural network, facial data comprising the UV maps in the plurality of modalities (2.3). The generator neural network (302) comprises: an initial set (320) of neural network layers configured to generate a plurality of feature maps (322) from the initialization data; a first branch (326) of neural network layers configured to generate the shape UV map (314) of the face from one or more of the plurality of feature maps (322); a second branch (328) of neural network layers configured to generate the texture UV map (316) of the face from one or more of the plurality of feature maps (322); and a third branch (330) of neural network layers configured to generate the normal UV map (318) of the face from one or more of the plurality of feature maps (322).

Description

Generating Three-Dimensional Facial Data

Field of the Invention

This specification relates to generating 3D facial data using a neural network and methods for training a neural network to generate three dimensional facial data.

Background

Generating realistic 3D faces is of high importance in a number of fields, including computer graphics, movie postproduction and computer games. Generally, research on 3D face generation revolves around linear statistical models of the facial surface, which fail to fully capture aspects of the facial geometry and textures that lead to realistic results.

Currently, 3D face generation in computer games and movies is performed by expensive capturing systems or by professional technical artists. The current state-of-the art methods generate faces, which can be suitable for applications such as caricature avatar creation in mobile devices but do not generate high-quality photo-realistic faces.

Summary

According to a first aspect, this specification describes a computer implemented method of generating three-dimensional facial data using a generator neural network, the method comprising: inputting, into the generator neural network, initialization data, the initialization data comprising noise data; processing the initialization data through a plurality of neural network layers of the generator neural network to generate UV maps in a plurality of modalities, the UV maps comprising a shape UV map of a face, a texture UV map of the face, and a normal UV map of the face; and outputting, from the generator neural network, facial data comprising the UV maps in the plurality of modalities. The generator neural network comprises: an initial set of neural network layers configured to generate a plurality of feature maps from the initialization data; a first branch of neural network layers configured to generate the shape UV map of the face from the plurality of feature maps; a second branch of neural network layers configured to generate the texture UV map of the face from the plurality of feature maps; and a third branch of neural network layers configured to generate the normal UV map of the face from the plurality of feature maps.

The generator neural network may comprise one or more further neural network branches, each further neural network branch configured to generate a UV map of the face having a further modality.

The further modalities may comprise one or more of: a cavity UV map of the face; a gloss UV map of the face; or a scatter UV map of the face; a specular albedo UV map of the face; a detail normal UV map of the face; a translucency UV map of the face; a roughness UV map of the face; and/or a detail weight UV map of the face.

The initialization data may further comprise one or more emotional parameters, and wherein the facial data has an expression corresponding to the expression parameters.

The neural network layers of the generator neural network may comprise one or more convolutional layers. The neural network layers of the generator neural network may comprise neural network comprise one or more upscaling layers.

The noise data may comprise Gaussian noise.

The method may further comprise generating a three dimensional model using the shape UV map of the face, the texture UV map of the face, and the normal UV map of the face, the three dimensional model comprising the face. Generating a three dimensional model may comprise applying an identity generic rendering map to the face. The three dimensional model may comprise a full head model.

According to a further aspect, this specification describes a computer implemented method of training a generator neural network to generate three-dimensional facial data, the method comprising: generating facial data using any of the methods described herein; inputting the generated facial data into a discriminator neural network; processing the generated facial data through a plurality of neural network layers of the discriminator neural network to generate a first realism score; inputting training facial data into the discriminator neural network, the training facial data comprising a shape UV map of a facial scan, a texture UV map of the facial scan, and a normal UV map of the facial scan; processing the training facial data through a plurality of neural network layers of the discriminator neural network to generate a second realism score; updating parameters of the generator neural network in dependence on the first realism score; updating parameters of the discriminator neural network in dependence on the first realism score and the second realism score; and iterating the method until a threshold condition is met. The discriminator neural network comprises: a plurality of input branches configured to generate a plurality of feature maps from the facial data, each input branch receiving a UV map in one of the plurality of modalities and comprising a plurality of neural network layers; a combined set of neural network layers configured to jointly process the features maps from the first input branch, second input brand and third input branch to generate a realism score.

Updating parameters of the generator neural network in dependence on the first realism score and/or updating parameters of the discriminator neural network in dependence on the first realism score and the second realism score may be performed using one or more loss functions. The one or more loss functions may comprise a WGAN-GP Wasserstein loss function.

The discriminator neural network may generate one or more predicted emotional labels from an input shape UV map, a texture UV map and normal UV map, and updating parameters of the generator neural network is further in dependence on a comparison of the predicted emotional label to a known emotional label used by the generator neural network and/or updating parameters of the discriminator neural network is further in dependence on a comparison of the predicted emotional label to a known emotional label of the training facial data. The known emotional label of the training facial data may be determined from an expression recognition neural network.

The neural network layers of the discriminator neural network may comprise one or more convolutional layers. The neural network layers of the discriminator neural network may comprise neural network comprises one or more downsampling layers.

According to a further aspect, this specification describes a computer implemented method of training a facial recognition neural network, the method comprising: applying the facial recognition neural network to a plurality facial images from a training dataset to generate a set of features for each of the three-dimensional facial images; updating parameters of the facial recognition neural network in dependence on a comparison of the sets of features generated for the three-dimensional facial images to corresponding known sets of features for the three-dimensional facial images. The plurality of facial images comprises: a first plurality of facial images, each generated from a three-dimensional model that has been generated according to any of the methods described herein; and a second plurality of facial images generated captured from real world-images.

According to a further aspect, this specification describes apparatus comprising one or more processors and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, causes the apparatus to perform any of the methods described herein.

According to a further aspect, this specification describes a computer program product comprising computer readable code that, when executed by a computer, causes the computer to perform any of the methods described herein.

As used herein, the term facial modality is preferably used to connote a representation of one or more properties of a facial image. Facial modalities may include, but are not limited to: shape; texture; normal directions; scatter; translucency; specular albedo; roughness; detail normal maps; and detail weight maps.

Brief Description of the Drawings

Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which:

Figure 1 shows an example of a method of generating 3D facial data using a neural network;

Figure 2 shows a flow diagram of a method of generating 3D facial data using a neural network;

Figure 3 shows an example of a method of training a neural network to generate 3D facial data;

Figure 4 shows a flow diagram of a method of training a neural network to generate 3D facial data;

Figure 5 shows an example of a method of training a neural network to generate 3D facial data with a given expression;

Figure 6 shows an example of generating a UV map from a head scan;

Figure 7 shows an example of a method of generating a 3D render of a face using 3D facial data; and

Figure 8 shows an example of a system/apparatus for performing the method described herein.

Detailed Description

A neural network is used to synthesize realistic 3D facial data for use in the rendering of 3D facial images. The neural network has a trunk-branch based GAN architecture that is jointly trained to generate a plurality of facial modalities that comprise shape, texture and normal modalities. The neural network architecture allows correlations between facial data in different modalities to be maintained, while tolerating domain specific differences in the modalities. The combination of shape, texture and normal modalities can allow for more realistic 3D facial renders to be generated.

In some embodiments, the neural network can be conditioned with expression parameters to generate 3D facial data with a given expression. The trunk-branch based GAN architecture of the neural network allows texture and normal modalities to be correlated with the expression parameters in addition to the shape modality, resulting in more realistic 3D facial renders of expressions.

Figure 1 shows an example of a method 100 of generating 3D facial data using a neural network 102 (also referred to herein as a “generator neural network” ) . Initialisation data 104 (z, also referred to herein as “input data” ) is fed into the neural network 102 and processed through a plurality of neural network layers to generate a set of UV maps 106 (also referred to herein as “coupled UV maps” ) of 3D facial data in a plurality of facial modalities. The plurality of facial modalities comprises a shape UV map 108 of a face, a texture UV map 110 of the face, and a normal UV map 112 of the face. The set of UV maps 106 of 3D facial data may be used to generate a 3D model of a face 114, for example as described below in relation to Figure 7. The set of UV maps 106 refer to the same underlying facial image.

The initialisation data 104 may comprise one or more random numbers. For example, the initialisation data may comprise a set of random noise, such as Gaussian random noise. In some embodiments, the initialisation data may further comprise expression data, as described in more detail below with reference to Figure 5.

A UV map is a 2D representation of a 3D surface or mesh, and thus may be considered to be 3D data. Points in 3D space (for example described by (x, y, z) co-ordinates) are mapped onto a 2D space (described by (u, v) co-ordinates) . A UV map may be formed by unwrapping a 3D mesh in a 3D space onto the u-v plane in the 2D UV space, and storing parameters associated with the 3D surface at each point in UV space. For example, a shape UV map 108 may be formed by storing the (x, y, z) co-ordinates of vertices of the 3D surface/mesh in the 3D space at corresponding points in the UV space. A texture UV map 110 may be formed by storing colour values of the vertices of the 3D surface/mesh in the 3D space at corresponding points in the UV space. A normal UV map 112 may be formed by storing the normal orientation (n _x, n _y, n _z) of the vertices of the 3D surface/mesh in the 3D space at corresponding points in the UV space. Other examples of properties/modalities that can be stored in a UV map are also possible.

The shape modality describes the location of points in a mesh of a 3D face (for example, the (x, y, z) coordinates of mesh points) . The texture modality describes the colour of the mesh points (and associated regions) of the 3D face. These may, for example, be RGB values or CIELAB values. The normal modality provides surface normal for the points in a mesh of the 3D face that can be used to perform lighting calculations when rendering the face.

Further UV maps in further modalities that may output by additional branches of the generator neural network may comprise one or more of: scattering that defines the intensity of subsurface scattering of the skin; translucency that defines an amount of light that travels inside the skin and which may be emitted in different directions; specular albedo that gives an intensity of specular highlights (which may differ between different areas of the face, such as hair-covered areas, the eyes and the teeth) ; roughness that describes the scattering of specular highlights and controls a glossiness of the skin; a detail normal map, which may supplement the normal map to provide additional fine details (for example, to mimic skin pores) ; and a detail weight map that controls the appearance and location of the detail normals (for example, so that they do not appear on the eyes, lips and hair) .

While the set of UV maps 106 has been described as comprising multiple UV maps, it will be understood that the UV maps may be combined into a single UV map (e.g. by being concatenated) , with points in the single UV map being associated with data from multiple modalities.

The neural network 102 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. The neural network 102 with L layers may be represented symbolically as

where x is the input to the neural network 102.

One or more of the layers of the neural network 102 may be convolutional layers, e.g. layers configured to apply a 2D convolutional filter to the output of a previous layer or the initialisation data. One or more of the layers of the neural network 102 may be an upscaling layer, e.g. a layer configured to increase the dimensionality of the output of a previous layer or the initialisation data. One or more of the layers of the neural network 102 may be a fully connected layer.

The plurality of layers of nodes of the neural network 102 comprise an initial set of neural network layers 116 (also referred to herein as “modality correlation layers” and/or “trunk layers” ) to generate a plurality of feature maps 118 from the initialisation data 104. The initial set of neural network layers 116 may be represented symbolically as

where x is the input to the neural network 102, and d < L is the number of layers of the initial set of neural network layers 116.

The neural network 102 further comprises a plurality of branches of neural network layers 120 (also referred to herein as “modality-specific layers” ) . Each branch of neural network layers takes as input one or more of the feature maps 118 and generates a UV map of facial data in a given modality. The plurality of branches of neural network layers comprises a first branch 122 of neural network layers (denoted

with L-d being the number of layers of the first branch) configured to generate the shape UV map 108 of the face from the plurality of feature maps 118, a second branch 124 of neural network layers (denoted

with L-d being the number of layers of the second branch) configured to generate the texture UV map 110 of the face from the plurality of feature maps 118 and a third branch 126 of neural network layers (denoted

with L-d being the number of layers of the third branch) configured to generate the normal UV map 112 of the face from the plurality of feature maps 118. While the branches described above each have the same number of layers, it will be appreciated that one or more of the branches may have a different number of layers to another branch.

The plurality of branches of neural network layers may further comprise branches associated with other modalities. These branches may be configured to generate a UV map of the face in a further modality from the plurality of feature maps.

By sharing the initial set of layers before splitting into a plurality of branches, correlated sets of UV maps in different modalities can be generated. The branches each specialise in a given modality, while the trunk network maintains local correspondences among them. Facial renders/models produced from these coupled UV maps can show enhanced realism when compared to other methods.

Figure 2 shows a flow diagram of a method of generating 3D facial data using a neural network. The method may be implemented on a computer.

At operation 2.1, initialization data is input into a generator neural network. The initialization data may comprise noise data. An example of such noise data is Gaussian noise.

At operation 2.2, the initialization data is processed through a plurality of neural network layers of the generator neural network to generate UV maps in a plurality of modalities. The UV maps comprise a shape UV map of a face, a texture UV map of the face, and a normal UV map of the face.

The generator neural network comprises: an initial set of neural network layers configured to generate a plurality of feature maps from the initialization data; a first branch of neural network layers configured to generate the shape UV map of the face from the plurality of feature maps; a second branch of neural network layers configured to generate the texture UV map of the face from the plurality of feature maps; and a third branch of neural network layers configured to generate the normal UV map of the face from the plurality of feature maps. It may further comprise one or more further neural network branches, each further neural network branch configured to generate a UV map of the face having a further modality. The further modalities may be one or more of: a cavity UV map of the face; a gloss UV map of the face; or a scatter UV map of the face; a specular albedo UV map of the face; a detail normal UV map of the face; a translucency UV map of the face; a roughness UV map of the face; and/or a detail weight UV map of the face.

At operation 2.3, facial data comprising the UV maps in the plurality of modalities is output from the generator neural network. The UV maps may be used to generate a three dimensional model of a face.

Figure 3 shows an example of a method 300 of training a neural network 302 to generate 3D facial data. Once trained, the neural network 302 may be used in the methods described above in relation to Figures 1 and 2. The method uses a generative adversarial training process to train the generator neural network 302 to produce realistic facial data. A discriminator neural network 304 is jointly trained with the generator neural network 304 to distinguish between “real” facial data 306 (also referred to herein as “training facial data” ) taken from a training set 308 and “fake” facial data 310 (also referred to herein as “generated facial data” ) generated by the generator neural network 302. During training, the generator neural network 302 and discriminator neural network 304 compete until some threshold condition is met.

The generator neural network 302 generates fake facial data 310 as described above in relation to Figures 1 and 2. Initialisation data 312, z, is fed into the generator neural network 302 and processed through a plurality of neural network layers to generate a set of fake UV maps 310 of 3D facial data in a plurality of facial modalities. The plurality of facial modalities comprises a shape UV map 314 of a face, a texture UV map 316 of the face, and a normal UV map 318 of the face.

The plurality of layers of nodes of the generator neural network 302 comprises an initial set 320 of neural network layers to generate a plurality of feature maps 322 from the initialisation data. The initial set 320 of neural network layers may comprise one or more convolutional layers configured to apply a convolutional filter to the input 312 and/or the output of a previous layer. The initial set 320 of neural network layers may comprise one or more upscaling layers configured to upscale the input 312 and/or the output of a previous layer.

The generator neural network 302 further comprises a plurality of branches 324 of neural network layers. Each branch of neural network layers takes as input one or more of the feature maps 322 and generates a UV map of facial data in a given modality. The plurality of branches of neural network layers comprises a first branch 326 of neural network layers configured to generate the shape UV map 314 of the face from the plurality of feature maps 322, a second branch 328 of neural network layers configured to generate the texture UV map 316 of the face from the plurality of feature maps 322 and a third branch 330 of neural network layers configured to generate the normal UV map 318 of the face from the plurality of feature maps 322. The plurality of branches 324 of neural network layers may further comprise branches associated with other modalities. These branches may be configured to generate a UV map of the face in a further modality from the plurality of feature maps 322. The each branch of neural network layers may comprise one or more convolutional layers configured to apply a convolutional filter to one or more of the feature maps 322 and/or the output of a previous layer. The each branch of neural network layers may comprise one or more upscaling layers configured to upscale the input 312 and/or the output of a previous layer.

The fake facial data 310 and/or real facial data 306 is input into the discriminator neural network 304 and processed through a plurality of neural network layers to generate an output 332 indicative of whether said input is fake or real. The output 332 may be one or more realism scores. The realism score may indicate a probability that the input is a real or fake set of facial data. The realism score may be regression of a real/fake score, for example with a score of one indicting realism. When the input is fake facial data 310, the corresponding realism score 332 may be referred to as a “first realism score” . When the input is real facial data 306, the corresponding realism score 332 may be referred to as a “second realism score” .

The discriminator neural network 304 may be denoted

where x is the input and L is the number of layers.

The discriminator neural network 304 comprises a set of input branches 334 of neural network layers. Each input branch receives as input a UV map in one of the facial modalities and process the UV map through a plurality of neural network layers to generate one or more corresponding feature maps for that modality (also referred to herein as modality feature maps) . The input branches comprise a first input branch 336 of neural network layers (denoted

with L-d being the number of layers of the first input branch) configured to receive a shape UV map as input, a second input branch 338 of neural network layers (denoted

with L-d being the number of layers of the first input branch) configured to receive a texture UV map as input, and a third input branch 340 of neural network layers (denoted

with L-d being the number of layers of the first input branch) configured to receive a normal UV map as input. The input braches may further comprise one or more further input branches that each take as input a UV map in a further facial modality. The each input branch of neural network layers 334 may comprise one or more convolutional layers configured to apply a convolutional filter to the corresponding input UV map and/or the output of a previous layer. The each branch of neural network layers 334 may comprise one or more down-sampling layers configured to down-sample the input UV map and/or the output of a previous layer. While the input branches described above each have the same number of layers, it will be appreciated that one or more of the input branches may have a different number of layers to another input branch.

The discriminator neural network 304 further comprises a combined set of neural network layers 342 (denoted

with d < L being the number of layers of the combined set of neural network layers) configured to jointly process the features maps output by the input branches 334 to generate a realism score. The combined set 342 of neural network layers (also referred to herein as “modality correlation layers” and/or “discriminator trunk layers” ) may comprise one or more convolutional layers configured to apply a convolutional filter to the modality feature maps and/or the output of a previous layer. The combined set 342 of neural network layers may comprise one or more down-sampling layers configured to down-sample the feature maps and/or the output of a previous layer. The combined set 342 of neural network layers may comprise one or more fully connected layers. For example, the final (i.e. output) layer may be a fully connected layer.

The generator neural network 302 is trained in dependence on the realism scores generated from the fake facial data 310 (i.e. the first realism scores) . Parameters of the generator neural network are updated based on the first realism score. A generator loss function (

also referred to as a “generator objective function” ) may be used to update the parameters of the generator neural network. The parameters of the generator neural network may be updated using an optimisation procedure on the generator loss function. An example of such an optimisation procedure is gradient descent, though other optimisation procedures may alternatively be used. An example of a generator loss function is the generator loss from the WGAN-GP Wasserstein loss function:

The discriminator neural network 304 is trained in dependence on the realism scores generated from the fake facial data 310 (i.e. the first realism scores) and realism scores generated from the training data 310 (i.e. the second realism scores) . Parameters of the discriminator neural 304 network are updated based on the first realism score and the second realism score. A discriminator loss function (

also referred to as a “discriminator objective function” ) may be used to update the parameters of the discriminator neural network. The parameters of the discriminator neural network may be updated using an optimisation procedure on the discriminator loss function. An example of such an optimisation procedure is gradient descent, though other optimisation procedures may alternatively be used. An example of a discriminator loss function is the discriminator loss from the WGAN-GP Wasserstein loss function:

where

where a denotes uniform random numbers between 0 and 1, and λ is a balancing parameter.

It will be appreciated that other GAN loss functions may alternatively be used to train the generator and discriminator neural networks.

Figure 4 shows a flow diagram of a method of training a neural network to generate 3D facial data. The method may be implemented on a computer. The method may be iterated until a threshold condition is met.

At operation 4.1, facial data is generated using a generator neural network. This generated “fake” facial comprises a plurality of UV maps. The plurality of UV maps comprises UV maps in a plurality of modalities. The plurality of modalities comprises a shape modality, a texture modality and a normal modality. The facial data may be generated using any of the methods descr4ibed above in relation to Figures 1-3. For each iteration, new input data (e.g. a new random noise) may be input into the generator neural network to generate the facial data.

In some embodiments, a plurality of sets of facial data is generated during each iteration of the method. Each set of facial data comprises a plurality of UV maps in a plurality of modalities, and is generated from a different set of input data.

At operation 4.2, the generated facial data is input into a discriminator neural network. The discriminator neural network comprises a plurality of input branches configured to generate a plurality of feature maps from the facial data, each input branch receiving a UV map in one of a plurality of modalities and comprising a plurality of neural network layers. The discriminator neural network further comprises a combined set of neural network layers configured to jointly process the features maps from the plurality of input branches to generate a realism score. The discriminator neural network may be the discriminator neural network described above in relation to Figure 3.

At operation 4.3, the generated facial data is processed through a plurality of neural network layers to generate a first realism score.

In embodiments where the facial data comprises a plurality of sets of facial data, a first realism score may be generated for each of the sets of facial data.

At operation 4.4, training facial data is input into the discriminator neural network. The training facial data comprises a plurality of UV maps in the same modalities as the fake UV maps generated by the generator neural network. The training data may be generated from ground-truth facial scans, for example using the method described in relation to Figure 6. Each sample in the training data may comprise a shape UV map of a given facial scan, a texture UV map of the facial scan, and a normal UV map of the facial scan.

In some embodiments, a plurality of training samples from the training data is input into the discriminator neural network at each iteration.

At operation 4.5, the input training data is processed through a plurality of layers to generate a second realism score.

In embodiments where a plurality of training samples is input into the discriminator neural network, a second realism score may be generated for each of the training samples.

At operation 4.6, parameters of the generator neural network are updated in dependence on the first realism score. A generator objective/loss function may be used to determine how to update the parameters of the generator neural network. An optimisation procedure, such as gradient descent, may be applied to the loss function in order to determine the updated parameters of the generator neural network.

At operation 4.7, parameters of the discriminator neural network are updated in dependence on the first realism score and the second realism score. A discriminator objective/loss function may be used to determine how to update the parameters of the discriminator neural network. An optimisation procedure, such as gradient descent, may be applied to the loss function in order to determine the updated parameters of the discriminator neural network.

At operation 4.8, a threshold condition is checked. If the threshold condition is satisfied, the training procedure is terminated at operation 4.9, and the trained generator and discriminator neural networks are output. If the threshold condition is not satisfied, the method returns to operation 3.1 (i.e. another iteration is performed) .

Examples of threshold conditions include one or more of: a threshold number of iterations/training epochs; an equilibrium condition between the generator loss function and discriminator loss function; and/or a threshold change in the values of the generator loss function and/or discriminator loss function falling below a predefined value.

Figure 5 shows an example of a method 500 of training a neural network to generate 3D facial data with a given expression. The method 500 is an extension of the methods described above to allow generation of faces with a given expression using a generator neural network 502, i.e. producing a conditional GAN conditioned on expression paramters. The generator neural network may be trained using a generative-adversarial approach, where the generator neural network is in completion with a discriminator neural network 504.

In addition to random noise 506, input data to the generator neural network 502 comprises one or more expression parameters 508. The expression parameters 508 encode a facial expression. Examples of expression parameters include: discrete expression labels (such as happy, sad, angry etc. ) ; a vector encoding a facial expression; and/or continuous facial expression parameters, such as activation units and/or valence/arousal values. In general, any facial expression parametrisation may be used.

The generator neural network 504 is configured to jointly process the random noise 506 and expression parameters 508 to generate UV maps (not shown) with an expression relating to the input expression parameters 508 in a plurality of modalities, the UV maps comprising a shape UV map of a face, a texture UV map of the face, and a normal UV map of the face. The generator neural network 502 may have a similar structure to the generator neural networks described above in relation to Figures 1-4, with the input/initial layers configured to receive one or more expression parameters 508 in addition to the random noise 506.

The generated UV maps are input into the discriminator neural network 504 and processed through a plurality of neural network layers to generate one or more realism scores 510 and a one or more sets of predicted expression parameters 512. The discriminator neural network 504 may have a similar structure to the discriminator neural networks described above in relation to Figures 3-4, with the output/combined set of neural network layers configured to output one or more predicted expression parameters 512 in addition to the realism score 510. Realism scores 510 generated from the “fake” UV maps may be referred to as first realism scores.

Training data 514 comprising “real” UV maps of facial scans is also input into the discriminator neural network and processed through a plurality of neural network layers to generate one or more further realism scores 510 and a one or more further sets of predicted expression parameters 512. Realism scores 510 generated from the “real” UV maps may be referred to as second realism scores.

In some embodiments, the training data 512 is also input into a pre-trained expression recognition neural network 516. The expression recognition neural network 516 processes the training data and outputs a ground-truth expression parameter 518. In some embodiments, the training data 514 is pre-processed to generate a facial render 520 that is input into the expression recognition neural network 516.

Alternatively or additionally, the training data may be labelled with the ground-truth expression parameters 516 manually.

During training, the parameters of the generator neural network 502 are updated in dependence on the first realism scores and a comparison of the corresponding predicted expression parameters 512 and input expression parameters 508. A generator objective/loss function may be used to determine the parameter updates as described above in relation to Figures 3-4, with an additional term comparing the predicted expression parameters 512 and input expression parameters 508 included.

Parameters of the of the discriminator neural network 504 are updated in dependence on the second realism scores and a comparison of the corresponding predicted expression parameters 512 and ground truth expression parameters 518. The parameters of the discriminator neural network 504 may be updated further in dependence on the first realism scores and a comparison of the corresponding predicted expression parameters 512 and input expression parameters 508. A discriminator objective/loss function may be used to determine the parameter updates as described above in relation to Figures 3-4, with an additional terms comparing the predicted expression parameters 510 and ground-truth expression parameters 518 and/or the predicted expression parameters 512 and input expression parameters 508 included.

The resulting generator neural network 502 is capable of generating coupled texture, shape and normal UV maps (and potentially maps in further modalities) with a controlled expression. Identity-expression correlation is maintained due to the correlated supervision provided by the training data 514.

Figure 6 shows an example of a method 600 for generating a UV map from a head scan. This method may be used to generate the training data used in the generative adversarial taring described above from 3D head/facial scans. The inverse of this method 600 may be used to generate 3D facial/head renders from UV maps.

The method 600 is used to convert raw 3D facial scans 602 into UV facial maps 604 in one or more modalities that can be processed by the discriminator neural networks.

The raw 3D facial scans 602 are each mapped to a template mesh (T) that describes them both with the same topology to produce a registered 3D scan 606. An example of such a template is the LSFM model, and may correspond to the mean face of the LSFM data set. The template comprises a plurality of vertices sufficient to depict high levels of facial detail (in the example of the LSFM model, 54,000 vertices) . The template is non-rigidly morphed to each facial scan 602, for example using a non-rigid iterative closest point algorithm to generate the registered 3D scan 606.

The meshes of the registered 3D scan 606 are then converted to a sparse spatial UV map 608. UV maps are usually utilised to store texture information. The mesh of the registered 3D scan 606 is unwrapped into UV space to acquire UV coordinates of the mesh vertices. The mesh may be unwrapped, for example, using an optimal cylindrical unwrapping technique. Points in the UV map are associated with modality values of the corresponding points in the template mesh 606. For example, a shape UV spatial map has (x, y, z) coordinates associated with each point in the UV map, a texture UV spatial map has (R, G, B) values associated with each point in the UV map, and a normal UV spatial map has components of a normal vector associated with each point in the UV map.

In some embodiments, prior to storing the 3D co-ordinates in UV space, the mesh is aligned by performing a General Procrustes Analysis (GPA) . The meshes may also be normalised to a [-1, 1] scale.

The sparse spatial UV map 608 is then converted to the UV maps 604 with a higher number of vertices. Two-dimensional interpolation may be used in the UV domain to fill out the missing areas to produce a dense illustration of the originally sparse UV map 608. Examples of such interpolation methods include two-dimensional nearest point interpolation or barycentric interpolation.

In embodiments where the number of vertices is more than 50,000, the UV map size may be chosen to be 256x256x3, which can assist in retrieving a high precision point cloud with negligible resampling errors.

Figure 7 shows an example of a method of generating a 3D render of a face using 3D facial data.

A 3D facial shape 702 is generated from a shape UV spatial map using, for example, the inverse of the process described in relation to Figure 6. The corresponding texture and normal UV maps (collectively referred to as ID-specific UV maps 704) are similarly combined with the 3D facial shape 702 using the inverse of the process described in Figure 6 to generate a 3D facial render 706. Any other UV spatial maps of further modalities output by the generator neural network may similarly be combined to generate the 3D facial render 706.

In some embodiments, one or more generic UV maps 708 of further spatial modalities may additionally be used to generate the 3D facial render 706. This can improve the quality of the 3D facial render 706 without the need for the generator neural network to generate additional UV maps in these modalities. This can be useful where there is a lack of datasets in these further modalities to train the generator neural network on. The ID-generic UV maps may include one or more of: a scattering UV map; a translucency UV map; a specular albedo UV map; a roughness UV map; a detail normal UV map; and/or a detail weight map.

3D facial renders generated from the UV maps output by the generator neural network may be used to train a facial recognition neural network. A training set of 3D facial images is generated. The training set comprises a plurality of facial images for each of a plurality of facial identities. For example, the training set may comprise 10,000 identities, each with 50 facial images. The facial images associated with each identity are generated from a 3D render of a face generated using any of the methods described herein.

A plurality of 3D renders/models is randomly synthesized from the proposed shape and texture generation models described herein. For each identity, a plurality of facial images, each with random camera and illumination parameters is generated. For example, a Gaussian distribution of the 300W-LP dataset may be used to generate the random facial images. The plurality of facial images generated in this way form a generated training set over a plurality of facial identities with a plurality of poses. This effectively provides a training set with a wider range of facial poses than is typically available from real-world collected data.

In some embodiments, the generated training set may be augmented with real-world facial images to create an augmented training set.

The training dataset (whether generated or augmented) can be used to train a pose invariant facial recognition neural network. The facial recognition neural network may comprise an embedding network (e.g. a ResNet, such as ResNet 50) . The embedding network may comprise one or more convolutional layers. The facial recognition neural network may further comprise a BN-Dropout-FC-BN structure.

The facial recognition neural network may be applied to the plurality facial images from a training dataset to generate a set of features for each of the facial images. This generated set of features may be compared to a corresponding known set of features for that image in order to update the parameters of the facial recognition neural network. For example, an objective/loss function may be used to compare the generated feature embedding to the corresponding known feature embedding, and an optimisation procedure applied to the objective/loss function in order to determine the parameter updates.

Figure 8 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.

The apparatus (or system) 800 comprises one or more processors 802. The one or more processors control operation of other components of the system/apparatus 800. The one or more processors 802 may, for example, comprise a general purpose processor. The one or more processors 802 may be a single core device or a multiple core device. The one or more processors 802 may comprise a Central Processing Unit (CPU) or a graphical processing unit (GPU) . Alternatively, the one or more processors 802 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.

The system/apparatus comprises a working or volatile memory 804. The one or more processors may access the volatile memory 804 in order to process data and may control the storage of data in memory. The volatile memory 804 may comprise RAM of any type, for example Static RAM (SRAM) , Dynamic RAM (DRAM) , or it may comprise Flash memory, such as an SD-Card.

The system/apparatus comprises a non-volatile memory 806. The non-volatile memory 806 stores a set of operation instructions 808 for controlling the operation of the processors 802 in the form of computer readable instructions. The non-volatile memory 806 may be a memory of any kind such as a Read Only Memory (ROM) , a Flash memory or a magnetic drive memory.

The one or more processors 802 are configured to execute operating instructions 808 to cause the system/apparatus to perform any of the methods described herein. The operating instructions 808 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 800, as well as code relating to the basic operation of the system/apparatus 800. Generally speaking, the one or more processors 802 execute one or more instructions of the operating instructions 808, which are stored permanently or semi-permanently in the non-volatile memory 806, using the volatile memory 804 to temporarily store data generated during execution of said operating instructions 808.

Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits) , computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 7, cause the computer to perform one or more of the methods described herein.

Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.

Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.

Claims

A computer implemented method of generating three-dimensional facial data using a generator neural network, the method comprising:

inputting, into the generator neural network, initialization data, the initialization data comprising noise data;

processing the initialization data through a plurality of neural network layers of the generator neural network to generate UV maps in a plurality of modalities, the UV maps comprising a shape UV map of a face, a texture UV map of the face, and a normal UV map of the face; and

outputting, from the generator neural network, facial data comprising the UV maps in the plurality of modalities,

wherein the generator neural network comprises:

an initial set of neural network layers configured to generate a plurality of feature maps from the initialization data;

a first branch of neural network layers configured to generate the shape UV map of the face from one or more of the plurality of feature maps;

a second branch of neural network layers configured to generate the texture UV map of the face from one or more of the plurality of feature maps; and

a third branch of neural network layers configured to generate the normal UV map of the face from one or more of the plurality of feature maps.
The method of claim 1, wherein the generator neural network comprises one or more further neural network branches, each further neural network branch configured to generate a UV map of the face having a further modality.
The method of claim 2, wherein the further modalities are one or more of: a cavity UV map of the face; a gloss UV map of the face; or a scatter UV map of the face; a specular albedo UV map of the face; a detail normal UV map of the face; a translucency UV map of the face; a roughness UV map of the face; and/or a detail weight UV map of the face.
The method of any preceding claim, wherein the initialization data further comprises one or more emotional parameters, and wherein the facial data has an expression corresponding to the expression parameters.
The method of any preceding claim, wherein the neural network layers of the generator neural network comprise one or more convolutional layers.
The method of any preceding claim, wherein the neural network layers of the generator neural network comprise neural network comprises one or more upscaling layers.
The method of any preceding claim, wherein the noise data comprises Gaussian noise.
The method of any preceding claim, wherein the method further comprises generating a three dimensional model using the shape UV map of the face, the texture UV map of the face, and the normal UV map of the face, the three dimensional model comprising the face.
The method of claim 8, wherein generating a three dimensional model comprises applying an identity generic rendering map to the face.
The method of any of claims 8 or 9, wherein the three dimensional model comprises a full head model.
A computer implemented method of training a generator neural network to generate three-dimensional facial data, the method comprising:

generating facial data using the method of any of claims 1-7;

inputting the generated facial data into a discriminator neural network;

processing the generated facial data through a plurality of neural network layers of the discriminator neural network to generate a first realism score;

inputting training facial data into the discriminator neural network, the training facial data comprising a shape UV map of a facial scan, a texture UV map of the facial scan, and a normal UV map of the facial scan;

processing the training facial data through a plurality of neural network layers of the discriminator neural network to generate a second realism score;

updating parameters of the generator neural network in dependence on the first realism score;

updating parameters of the discriminator neural network in dependence on the first realism score and the second realism score; and

iterating the method until a threshold condition is met,

wherein the discriminator neural network comprises:

a plurality of input branches configured to generate a plurality of feature maps from the facial data, each input branch receiving a UV map in one of the plurality of modalities and comprising a plurality of neural network layers;

a combined set of neural network layers configured to jointly process the features maps from the first input branch, second input brand and third input branch to generate a realism score.
The method of claim 11, wherein updating parameters of the generator neural network in dependence on the first realism score and/or updating parameters of the discriminator neural network in dependence on the first realism score and the second realism score is performed using one or more loss functions.
The method of claim 12, wherein the one or more loss functions comprises a WGAN-GP Wasserstein loss function.
The method of any of claims 11-13, wherein the discriminator neural network generates one or more predicted emotional labels from an input shape UV map, a texture UV map and normal UV map, and wherein:

updating parameters of the generator neural network is further in dependence on a comparison of the predicted emotional label to a known emotional label used by the generator neural network; and/or

updating parameters of the discriminator neural network is further in dependence on a comparison of the predicted emotional label to a known emotional label of the training facial data.
The method of claim 14, wherein the known emotional label of the training facial data is determined from an expression recognition neural network.
The method of any preceding claim, wherein the neural network layers of the discriminator neural network comprise one or more convolutional layers.
The method of any of claims 11-15, wherein the neural network layers of the discriminator neural network comprise neural network comprises one or more downsampling layers.
A computer implemented method of training a facial recognition neural network, the method comprising:

applying the facial recognition neural network to a plurality facial images from a training dataset to generate a set of features for each of the three-dimensional facial images;

updating parameters of the facial recognition neural network in dependence on a comparison of the sets of features generated for the three-dimensional facial images to corresponding known sets of features for the three-dimensional facial images,

wherein the plurality of facial images comprises:

a first plurality of facial images, each generated from a three-dimensional model that has been generated according to any of claims 8-10; and

a second plurality of facial images generated captured from real world-images.
Apparatus comprising one or more processors and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, causes the apparatus to perform the method of any preceding claim.
A computer program product comprising computer readable code that, when executed by a computer, causes the computer to perform the method of any of claims 1-18.