GB2581374A

GB2581374A - 3D Face reconstruction system and method

Info

Publication number: GB2581374A
Application number: GB1902067.6A
Authority: GB
Inventors: Gecer Baris; Zafeiriou Stefanos
Original assignee: Facesoft Ltd
Current assignee: Facesoft Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2020-08-19
Anticipated expiration: 2039-02-14
Also published as: GB201902067D0; GB2581374B; WO2020165557A1

Abstract

A computer implemented method 500 for generating a 3D facial reconstruction based on an input facial image is presented. The method comprises generating an initial reconstruction 510 and iteratively updating 520 the reconstruction. The initial reconstruction is generated by generating 512 a facial mesh, using a shape model, and a texture map 514. The reconstruction is iteratively updated by using a differentiable renderer to project 522 the 3D reconstruction onto a 2D plane to form a rendered facial image, calculating 524 a loss function based on a comparison of the rendered facial image and the input facial image, and using gradient descent to generate 526 an updated facial reconstruction. The facial mesh may use an expression model (figure 1, 102) and the projection may be affected by illumination (figure 1, 103) and camera (figure 1, 104) parameters. The generation of the reconstruction may comprise the use of a generative adversarial network (GAN). The loss function calculation may use a comparison of the presence and/or position of facial landmarks between the input and rendered images. The loss function calculation may use the pixel-to-pixel dissimilarity (figure 3, 320) between the images. There may be more than one input image.

Description

Intellectual Property Office Application No. GII1902067.6 RTM Date:31 October 2019 The following term is a registered trade mark and should be read as such wherever it occurs in this document: TensorFlow (pages 7, 11, 13, and 24) Intellectual Property Office is an operating name of the Patent Office www.gov.uk /ipo -1 - 3D Face Reconstruction System and Method

Field of the Invention

The present invention relates to 3D face reconstruction, and in particular to systems and methods for reconstructing 3D faces from one or more images of a face using a neural network system.

Background

3D face reconstruction relates to reconstructing shape and texture of a face from one or more images of the face. 3D face reconstruction has wide ranging applications including, but not limited to, rendering the face with different expressions and illumination, and from different camera angles and positions; enabling faces to be represented in 3D media, such as 3D video games and animations; and compact representation of facial characteristics.

Summary

According to an embodiment, a computer implemented method for generating a 3D facial reconstruction based on an input facial image is provided, the method including generating an initial 3D facial reconstruction by: generating, using a shape model, a facial mesh based on a plurality of shape parameters; and generating a texture map based on a plurality of texture parameters; iteratively updating the 3D facial reconstruction by: projecting, using a differentiable renderer, the 3D facial reconstruction onto a 2D plane to form a rendered facial image; calculating a loss function based on a comparison of the rendered facial image and the input facial image; and generating an updated 3D facial reconstruction by updating the shape parameters and texture parameters based on the calculated loss function using gradient descent.

Generating the updated 3D facial reconstruction may include updating both the shape parameters and texture parameters in a single iteration.

Generating the facial mesh may use an expression model and may be based on a plurality of expression parameters.

Generating the texture map may use a generator network of a generative adversarial network.

The generative adversarial network maybe a progressive growing generative adversarial network.

Projecting the 3D facial reconstruction onto a 2D plane maybe based on a plurality of illumination parameters and camera parameters.

Generating the updated 3D facial reconstruction may include updating the camera parameters and illumination parameters.

The camera parameters may define a camera position and a focal length. The illumination parameters may define a direct light source colour, a direct light source position and an ambient lighting colour.

The method may include performing landmark detection on the input facial image using a landmark detection network to output a plurality of facial landmark locations; and performing landmark detection on the rendered facial image using the landmark detection network to output a plurality of facial landmark locations.

Calculating the loss function may include calculating a landmark loss function based on a comparison of the facial landmark locations of the input facial image and the facial landmark locations of the rendered facial image.

The method may include, prior to generating the initial 3D facial reconstruction, performing landmark detection on the input facial image using a landmark detection network to output a plurality of facial landmark locations; generating a prior 3D facial reconstruction based on shape parameters; projecting, using the differentiable renderer, the prior 3D facial reconstruction onto a 2D plane to form a prior rendered facial image based on camera parameters; performing landmark detection on the prior rendered facial image using a landmark detection network to output a plurality of facial landmark locations; comparing the facial landmark locations of the input facial image and the facial landmark locations of the prior rendered facial image; and optimising the shape and camera parameters to align a facial shape and facial pose of the prior rendered facial image with the input facial image based on the comparison of the facial landmark locations. -3 -

Calculating the loss function may include calculating a pixel loss function based on a comparison of pixel values for the input facial image and pixel values for the rendered facial image.

The method may include performing facial recognition using a facial recognition network on the input facial image and rendered facial image.

Calculating the loss function may include calculating an identity loss function based on a comparison of the facial recognition network output for the input facial image and the facial recognition network output for the rendered facial image.

Calculating the loss function may include calculating a content loss function based on a comparison of at least one intermediate layer output of the facial recognition network output for the input facial image and a corresponding intermediate layer output of the facial recognition network output for the rendered facial image.

The method may include forming a rendered sampled facial image by projecting the 3D facial reconstruction onto a 2D plane with a random camera positioning, a random expression and random lighting; and performing facial recognition using the facial recognition network on the second rendered facial image.

Calculating the loss function may include calculating an additional identity loss function based on a comparison of the facial recognition network output for the input facial image and the facial recognition network output for the second rendered facial image.

The method may include receiving at least one additional input facial image; generating an initial 3D facial reconstruction for the at least one additional input facial image; generating a combined 3D facial reconstruction by averaging shape parameters and texture parameters corresponding to the input facial image and the at least one additional input facial image; iteratively updating the 3D facial reconstruction for the at least one additional input facial image; and updating the combined 3D facial reconstruction based on the updated 3D facial reconstruction for the at least one additional input facial image and the updated 3D facial reconstruction for the input facial image. -4 -

According to another aspect, a computer-readable storage medium is provided according to claim 15.

According to another aspect, a data processing apparatus is provided according to claim 16.

Brief Description of the Drawings

Certain embodiments of the present invention will now be described, by way of example, with reference to the following figures.

Figure i is a schematic block diagram illustrating an example of a system for generating a face image; Figure 2 is a schematic block diagram illustrating an example of a system for fitting parameters based on an input face image; Figure 3 is a schematic block diagram illustrating an example of a system for calculating a loss function based on a comparison of an input face image and one or more rendered face images; Figure 4 is a schematic block diagram illustrating an example of the progressive growing of a generative adversarial network for generating textures; Figure 5 is a flow diagram of an example method for generating a 3D facial reconstruction; and Figure 6 is a flow diagram of an example method for initially aligning a 3D facial reconstruction with an input facial image.

Detailed Description

Example implementations provide system(s) and method(s) for improved 3D face reconstruction from one or more images. 3D face reconstruction has wide ranging applications including, but not limited to, rendering the face with different expressions and illumination, and from different camera angles and positions; enabling faces to be represented in 3D media, such as 3D video games and animations; and compact representation of facial characteristics. Therefore, improving 3D face reconstruction, may improve systems directed at these applications. For example, 3D video games using the system(s) and method(s) described herein maybe able to more realistically represent players' faces than systems and methods in the state of the art. -5 -

Improved 3D face reconstruction may be achieved by various example implementations comprising a 3D facial reconstruction method, a loss calculation system and/or a texture generation neural network.

The 3D face reconstruction method is a method for deriving parameters for 3D face reconstruction based on a comparison of an input image and a rendered image. The method includes an initial 3D facial reconstruction component and an iterative update component.

In the initial 3D facial reconstruction component, a facial mesh is generated based on shape parameters, and a texture map is generated based on texture parameters. The facial mesh may include a number of vertices which represent the shape of the face. The shape parameters may be a latent representation of the face shape. A shape generator may generate the facial mesh from the shape parameters. The texture map may be a texture map, e.g. an image, in the UV space for mapping on to the facial mesh. The texture parameters may be a latent representation of the texture map. A texture generator may generate the texture map from the texture parameters.

In the iterative update component, the 3D facial reconstruction is projected onto a 2D plane using a differentiable renderer to produce a rendered image; a loss function is calculated based on a comparison of a rendered image and an input image; an updated 3D facial reconstruction is generated by updating the shape parameters and the texture parameters based on the loss function. The projection of the 3D facial reconstruction onto a 2D plane may use lighting and camera parameters. Said lighting and camera parameters may indicate to the differentiable renderer the lighting conditions for and the pose of the projected facial reconstruction. The loss function may be calculated using the loss calculation system. The shape parameters and texture parameters may be updated by backpropagating the loss function through the loss calculation system, differentiable renderer, shape generator and texture generator to derive one or more gradients. Said gradients may be used to update the shape and texture parameters by gradient descent.

The loss calculation system is a system for calculating a loss function based on an input face image and a rendered face image. The loss calculation system may include a face recognition network. The face recognition network may be used to derive respective embeddings for the input face image and the rendered face image. An 'identity loss' -6 -may be derived based on the respective embeddings for the input face image and the rendered face image. The embeddings for the rendered face image may be compared with the embeddings for the input face image in order to calculate the 'identity loss'.

Mid-level features of the face recognition network may be extracted for the input face image and the rendered face image. A 'content loss' may be derived based on the respective mid-level features for the input face image and the rendered face image. The mid-level features extracted for the rendered face image may be compared to the midlevel features extracted for the input face image to calculate the 'content loss'.

The loss calculation system may include a landmark detector. The landmark detector may be used to derive facial landmark positions for the input image and the rendered image. The difference between these positions may be used to derive a 'landmark loss'.

The pixels of the input image and the rendered image may be directly compared to derive a 'pixel loss'.

The calculated loss function may be a weighted sum of one or more of the identity loss, content loss, landmark loss and pixel loss.

The texture generation neural network may be the generator component of a progressively grown generative adversarial network. The training data used to train the progressively grown generative adversarial network may be a number of UV texture maps for faces. During training, the generator and discriminator networks are initially configured to generate and process, respectively, small texture maps, e.g. 4x4 texture maps. Over the training process, layers may be progressively added to both the generator and discriminator networks to allow them to handle larger texture maps until the generator can generate texture maps of the desired size, e.g. 1024x1024 texture maps. The resulting generator network can generate high fidelity facial texture maps from a latent vector.

Face Image Generation System Referring to Figure 1, a system for generating a face image is shown. -7 -

The face image generation system 100 may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the face image generation system 100 is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The face image generation system 100 may be implemented using a machine learning framework such as MXNet, TensorFlow, or PyTorch.

The face generation system includes a face image generator 110 which generates a rendered face image 120 based on parameters 101-105. The parameter sets 101-105 may include shape parameters 101, expression parameters 102, lighting parameters 103, camera parameters m4 and/or texture parameters 105. The parameter sets 101-105 may be represented as parameter vectors.

The shape parameters 101, the expression parameters 102 and the texture parameters may be latent representations of the shape, expression and texture of the face respectively. The latent representations may be derived using a statistical inference method, e.g. principal component analysis (PCA), or by using a suitable neural network, e.g. a generator network of a generative adversarial network or an decoder network of an autoencoder.

The lighting parameters 103 may be a parameter vector indicating the position and/or colour of a direct light source. The lighting parameters 103 may further include a colour of an ambient light source. The position of the direct light source may be represented as Cartesian coordinates indicating the position of the direct light source with respect to the face. The direct light source colour and the ambient light source colour may each be represented in the parameter vector as RGB colour components.

The camera parameters 104 may be a parameter vector indicating the position of the camera, the position at which the camera is directed and the focal length. The position of the camera and the position at which the camera is directed may be represented as Cartesian coordinates indicating the respective position with respect to the face. The -8 -focal length may be represented as an integer, a floating point number or a binary-coded decimal number.

The face image generator no includes a facial mesh generator 112, a texture generator 112 and a differentiable renderer 116.

The facial mesh generator 112 receives shape parameters um and may receive expression parameters 102. The parameters are used by the facial mesh generator 112 to generate a 3D facial mesh. The 3D facial mesh may be represented using N vertices, where each of the N vertices has Cartesian coordinates.

Where the shape parameters un are derived using principle component analysis, the 3D facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector, and summing these products with a mean shape vector. For example, a 3D facial mesh comprising N vertices may be represented as a vector: S- = y,, z,, xN, yN, zly] A shape parameter vector, Ps, may comprise principle components of variations in 3D facial shape. The 3D facial mesh may be calculated as: s = ms + lisps where na, is a mean shape vector and Us is a matrix of basis vectors corresponding to the principle components.

Where expression parameters 102 are also derived using principle component analysis and are used to generate the 3D facial mesh, the shape parameters un may be learned from facial scans displaying a neutral expression and represent identity variations. The expression parameters 102 may be learned from displacement vectors and represent expression variations. The 3D facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector and each of the expression parameters and a respective basis vector, and summing these products with a mean shape vector. For example, the 3D facial mesh may be calculated as: -9 -S = ms,e Us,ePs,e where ms,, is a mean shape vector, Use = [U,, Lie] where U, is a matrix of basis vectors for identity variations and U, is a matrix of basis vectors for expression variations, and Ps,e = [Ps* Pe] where p, are identity principal components and pe are expression principle components. ;Alternatively, the facial mesh generator 112 may use a neural network. For example, the facial mesh generator 112 may comprise a generator network of a generative adversarial network or a decoder network of an autoencoder. The shape parameters 101 may be the input to the network. The output of the network may be a vector representation of the coordinates of the vertices of a 3D facial mesh. Alternatively, the output of the network may be an intermediate representation of the 3D facial mesh, e.g. a UV map representing spatial deviations from a mean 3D facial mesh. The facial mesh generator 112 may perform additional processing to convert the intermediate representation into the coordinates of the vertices. The neural network may accept the expression parameters 102 as an input to the network in addition to the shape parameters, and use these in generating the vector representation or intermediate representation. ;In some embodiments, the facial mesh generator 112 may include a shape generator for generating a neutral 3D facial mesh based on the shape parameters 101, and an expression generator for generating an expression representation, e.g. an expression displacement, based on the expression parameters. The facial mesh generator may use a neural network. The expression generator may use a neural network or a linear model, such as PCA. The facial mesh generator 112 may output a 3D facial mesh by combining the neutral 3D facial mesh and expression representation using one or more linear operations or using a combiner neural network. ;The texture generator 114 receives texture parameters 105. The texture parameters los are used by the texture generator 114 to generate a texture map for mapping on to a 3D facial mesh. ;The texture parameters 105 may be derived using PCA. For example, the texture map may be generated by finding the product of each of the texture parameters and a respective basis vector, and summing these products with a mean texture vector. For -10 -example, a texture parameter vector, pt may comprise principle components of variations in facial texture, and a texture map, T(pt), may be calculated as: T(pt) = mt + Utrit where mt is a mean texture vector and Ur is a matrix of basis vectors corresponding to the principle components. ;In some embodiments, the texture generator 114 may use a neural network, such as a generator network of a generative adversarial network or a decoder network of an autoencoder. The texture generator may be the generator network of a progressively grown generative adversarial network such as generator network 510-N of Figure 5. The texture parameters 101 may be the input to the network. The output of the network may be a vector representation of the texture map, e.g. a vector comprising the RGB components for each pixel of the texture map, or a matrix representation of the texture map. ;The differentiable renderer n6 receives the 3D facial mesh from the facial mesh generator 112, the texture map from the texture generator 114, the lighting parameters 103 and the camera parameters 104. These are used by the differentiable renderer n6 to produce a rendered face image 12o. ;Prior to rendering, the texture map from the texture generator 114 may be used to texture the 3D facial mesh from the facial mesh generator 112 to create a 3D textured facial mesh. The differentiable renderer n6 may render the 3D textured facial mesh by projecting the 3D textured facial mesh at the centre Cartesian origin onto a 2D image plane. The differentiable renderer 116 may use a pinhole camera model with the camera standing at the and directed towards the positions indicated by the camera parameters 104, and with the focal length indicated by the camera parameters 104. The illumination of the rendering may be modelled by phong shading given the positioning of the direct light source, the colour of the direct light source and the colour of the ambient lighting as indicated by the lighting parameters 103. The differentiable renderer n6 may use barycentric interpolation to interpolate the colour and normal attributes at each vertex at the corresponding pixels. ;Gradients may be easily backpropagated through the differential renderer 116. Furthermore, the differentiable renderer 116 may be implemented using a machine learning framework, e.g. TensorFlow or MXNet, to facilitate backpropagation through the renderer. As gradients may be easily backpropagated through the differentiable renderer n6, the differentiable renderer u6 may be easily incorporated into methods and systems using gradient descent or another gradient-based optimisation method. Systems using gradient descent include, but are not limited to, neural network training systems and parameter fitting systems. Gradient descent may be significantly more computationally efficient than other optimisation methods. ;The rendered face image 120 may be a vector or matrix representation of the rendered face image, e.g. a matrix or vector containing RGB components for each pixel of the rendered face image. Alternatively, the rendered face image 120 may be in a suitable image format, e.g. JPEG, PNG or TIFF. ;Parameter Fitting System Referring to Figure 2, a parameter fitting system 200 for fitting parameters, based on an input face image, for use in generating a 3D facial reconstruction is shown. ;The parameter fitting system 200 may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the parameter fitting system 200 is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The parameter fitting system may be implemented using a suitable machine learning framework such as MXNet, TensorFlow, and PyTo rch. ;The parameter fitting system 200 includes the face image generator 110, as described in relation to Fig. 1; the loss calculator 230, and a parameter optimiser 240. ;The face image generator 110 receives foundational parameters 202 and instance parameters 204. The parameters are used by the face image generator 110 to generate -12 -the rendered face image 224. The foundational parameters 202 are parameters describing the fixed properties of a face, e.g. the shape parameters 101 and the texture properties 105. The instance parameters 204 are parameters describing changeable properties of a rendering of a 3D facial reconstruction, e.g. the expression parameters 102, the lighting parameters 103 and the camera parameters 104. ;A parameter sampler 210 may be used to generate sampled instance parameters 406. The sampled instance parameters 206 may be (pseudo-)random parameters describing changeable properties of a face. The parameter sampler 210 may generate the sampled instance parameters 206 by sampling parameters based on one or more distributions. For example, where the sampled instance parameters 206 include expression, lighting and camera parameters, the expression parameters may be generated by sampling from a zero-mean normal distribution with a variance of o.5; and the lighting and camera parameters may be generated by sampling from the Gaussian distribution of the 3ooW3D dataset. ;The face image generator 110 may receive the foundational parameters 202 and the sampled instance parameters 206, and use them to generate the sampled rendered face image. The face image generator no may comprise a differentiable renderer and be differentiable. ;The loss calculator 23o may receive the input face image 222, the rendered face image 224, and, if used, the sampled rendered face image 226 and uses them to calculate a loss. A loss function is calculated based on a comparison of a rendered image and an input image. The loss function may be calculated using a loss calculation system. The loss calculator 23o may be differentiable. ;The parameter optimiser 240 receives the loss from the loss calculator 210. Based on the loss, the parameter optimiser may update the foundational parameters 202 and/or instance parameters 204 using gradient descent. The derivatives used to update the parameters by gradient descent may be derived by backpropagating the loss through the loss calculator 230 and the face image generator 11o. The parameter optimiser 240 may use a suitable optimisation algorithm to update the parameters, e.g. the Adam optimisation algorithm. ;-13 -By using a differentiable loss calculator and a differentiable face image generator, derivatives of the loss function relative to the instance and foundational parameters may be computed efficiently using backpropagation. The derivatives maybe used to update the foundational parameters and instance parameters by gradient descent. Gradient descent locates values of the parameters which minimise the loss function in a computationally efficient manner. ;Loss Calculation System Referring to Figure 3, a system 30o for calculating a loss function based on a comparison of an input face image and one or more rendered face images is shown. ;The loss calculation system 30o may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the loss calculation system 30o is implemented on a plurality of suitable computing devices, the computing devices maybe configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The loss calculation system may be implemented using a suitable machine learning framework such as MXNet, TensorFlow, and PyTorch. ;The loss calculation system includes a loss calculator 23o which calculates a loss based on an input face image 222, a rendered face image 224 and, optionally, a rendered sampled face image 226. The input face image 222 is an image of a face from which a 3D face reconstruction is to be derived. The rendered face image 224 is a rendering of a 3D face reconstruction by a face image generator derived using parameters being optimised to match the input face image 222. The rendered sampled face image 226 is a rendering of a 3D face reconstruction by a face image generator derived using shape and texture parameters being optimised to match the input face image 222 and other (pseudo-)randomly sampled parameters, e.g. lighting, camera and/or expression parameters. ;The loss calculator 230 includes a landmark detector 31o, a landmark loss calculator 316, a pixel loss calculator 32o, a face recognition network 33o, a content loss calculator 35o, an identity loss calculator 370 and a loss combiner 380. ;The landmark detector 310 receives an image and processes the received image to detect landmark locations in that image. The landmark detector 310 may be a deep face alignment network, e.g. a cascade multi-view hourglass model. The detected landmark locations may be indicated by a matrix containing the 2D positions of each of a number of landmarks in the image. The detected landmark locations output by the landmark detector 220 for an image I may be denoted as MO). The landmark detector 310 processes the input face image 222, 10, to obtain input face landmark locations 312, M(1'), and processes the rendered face image 224, 12, to obtain rendered face landmark locations 314, .M(12). ;The landmark loss calculator 316 receives the input face landmark locations 312, M(1°), and the rendered face landmark locations 314, Mow), and compares them to compute a landmark loss. The landmark loss, flan, may be the Euclidean distance between the input face landmark locations 312 and the rendered face landmark locations 314, i.e.: Giarz = 11M(r) -3vr030112 The landmark loss is sensitive to misalignment between the input face image 222 and the rendered face image 224. The landmark loss may be particularly useful in optimising the camera parameters used to render the rendered face image. ;The pixel loss calculator 320 receives the input face image 222 and the rendered face image 224 and uses them to calculate a pixel loss. The pixel loss may indicate the distance between the components of each of the corresponding pixels of the input face image 222, 10, and the rendered face image 224, 1'2. For example, the pixel loss may be the pixel level /1 loss function: Lim = 111° -0I1, which, for RGB colour images, may be calculated as: =1(1° -12 I +I10 -12 I+ IP -12 I ptx x,y,r x,y,r x,y,g x,y,g x,y,b x,y,b) x,y where 1,,3,,r, Izy,9, and Ly,i, are the red, green and blue components respectively of the pixel at position (x, y) of an image I. The pixel loss may be particularly useful in optimising the lighting parameters used to render the rendered face image. The pixel loss may be of a higher resolution than one or more of the other losses and may be particularly useful in optimising the shape and texture parameters used to generate the 3D face reconstruction such that the 3D face reconstruction reflects high resolution details of the input face image. ;The face recognition network 33o receives an image as an input and processes the received image to produce an embedding indicating identity related features of that image. As part of the processing of the received image, the face recognition network 33o may also generate intermediate representations of the image. Such intermediate representations may be robust to pixel-level deformations but, unlike the embedding, are not too abstract to miss some non-identity related details, e.g. variations that depend on age. The intermediate representations may be activations of one or more hidden layers of the face recognition network 33o. ;The face recognition network 33o may be a convolutional neural network. The convolutional neural network may be a high capacity deep convolutional neural network, e.g. a variant of the ResNet architecture, trained using an additive angular margin (ArcFace) loss function. ;The activations of the face recognition network 33o at a layer j for an image I may be denoted as Y1(i) The face recognition network 33o processes the input image 222, i0, to produce the activations, Ti (1°), at the jth layer. The input face intermediate representation 342 comprises these activations for one or more layers of the network. The face recognition network 33o processes the rendered image 224, 1R, to produce the activations, Fi (1R), at the jth layer. The rendered face intermediate representation 344 comprises these activations for one or more layers of the network. ;-16 -The content loss calculator 350 receives the input face intermediate representations 342 and the rendered face intermediate representation 344, and compares them to compute a content loss. The content loss. L""may be the normalized Euclidean distance between the intermediate activations for the input face image 222 and the rendered face image 224, i.e.: MY1(I1)) Ti(12)112 -Cum 1-13j XMcj X C:0 j=k where the activations of each of the layers from the kth layer to the 1th layer are being used to calculate the content loss. HR j, W.T) and Cr; denote the height, width and number of channels, respectively, of layer j of the face recognition network 240. ;The content loss may be robust to pixel-level deformations while also reflecting nonidentity related differences, e.g. variations that depend on age. The content loss may be particularly useful in optimising the shape and texture parameters used to generate the 3D face reconstruction such that the 3D face reconstruction reflects these non-identity related differences. ;Where the face recognition network 33o has rt layers, the output of the nth layer of the network, i.e. the embedding indicating identity related features, for an image I may be denoted as I' (1). The face recognition network 33o processes the input face image 222, 10, to produce the input face embedding 362, I' (I°), and processes the rendered face image 204, I R, to produce the rendered face embedding 364, Yn(liO. Where a rendered sampled face image 226, 12, is used, it is also processed by the face recognition network 33o to produce a rendered sampled face embedding 366, Tn(12). ;The identity loss calculator 370 receives the input face embedding 362 and the rendered face embedding 364, and compares them to calculate an identity loss. The identity loss, Lid, may be the cosine distance between the input face embedding 362, Fll(1°), and the rendered face embedding 364, 7" (I'), i.e.: n(1°). 711(12) L -1 (19112MY n(18)112 -17 -The identity loss may measure the extent to which identity-related features of the rendered face image differ from those of the input face image. The identity loss may be particularly useful in optimising the shape and texture parameters used to generate the 3D face reconstruction such that the 3D face reconstruction is recognised as being the face of the individual captured in the input face image. ;The identity loss calculator 370 may also receive the input face embedding 362 and the rendered sampled face embedding 366, and compare them to calculate an additional identity loss. The additional identity loss, Lid, may be the cosine distance between the input face embedding 362, Yr (I°), and the rendered sampled face embedding 366, (139, i.e.: (1°). yf Cr) did = 1 115n(19112Vnelf10112 The additional identity loss may measure the extent to which identity-related features of the rendered sampled face image differ from those of the input face embedding. The identity loss may be particularly useful in optimising the shape and texture parameters used to generate the 3D face reconstruction such that the 3D face reconstruction resembles the face of the individual captured in the input face image under a range of different conditions, e.g. with different expressions, poses and/or lighting. ;The loss combiner 38o calculates an overall loss, L, based on any of the landmark loss, the pixel loss, the content loss and the identity loss. The overall loss may also be based on the additional identity loss and/or a regularization term for regularizing at least some of the parameters used to generate the rendered face image 204. The overall loss, A may be a weighted sum, e.g.: L = AidLid + AidLid + )conLcon ApirLpix AregReg(tPs,e, pi}) where the A terms are weighting parameters and Re g({pse, pi)) is a regularization term for regularizing the shape, expression and lighting parameters. ;-18 -By using a loss calculation system incorporating a face recognition network, a combined loss function reflecting identity related features and other facial characteristics may be calculated which is robust to pixel level distortions. The quality of 3D face reconstructions produced by systems utilising such loss calculation systems may be enhanced. ;Using a differentiable face image generator, such as that described in relation to Figures 1 and 2, may facilitate the use of such a loss calculation system. Where the face image generator and the loss calculation system are differentiable, derivatives of the loss function relative to the parameters used to generate a 3D facial reconstruction and render an associated face image may be efficiently computed. The derivatives may then be used to update the parameters by gradient descent. Gradient descent locates values of the parameters which minimise the loss function in a computationally efficient manner. ;Progressive Growing of Texture Generative Adversarial Network Referring to Figure 4, a diagram 400 illustrating the progressive growing of a generative adversarial network for generating textures is presented. ;A texture generator, e.g. the texture generator 114, may be the generator network of a trained generative adversarial network. A number of parameters, e.g. a latent vector, may be the input to the generator network. The output of the generator network may be a vector representation of the texture map, e.g. a vector comprising the RGB components for each pixel of the texture map, or a matrix representation of the texture map. The generator network may have been trained using a plurality of UV texture maps for a facial mesh. Using the generator network of a generative adversarial network to generate texture maps may be advantageous as the generated textures may preserve high-frequency details. The generator network may also be able to generate realistic UV texture maps resembling faces differing substantially from those used for training, e.g. the generator may generalise well to unseen data. ;The use of the generator network for texture generation may be facilitated by using a differentiable renderer and a differentiable loss calculation system. Where the renderer and the loss calculation system are differentiable, derivatives of the loss function relative to the input parameters for the generator network, e.g. the latent vector, may be efficiently computed. The derivatives may then be used to update the parameters by gradient descent. Gradient descent locates values of the input parameters which minimise the loss function in a computationally efficient manner. ;While a progressively grown generative adversarial network is presented, other forms of generative adversarial network may be used and achieve the stated advantages. ;The growing of the generative adversarial network begins with a generator 410-1 having a small convolutional layer 412-1, e.g. a 4x4 convolutional layer, and a discriminator 430-1 having a small convolutional layer 432-1, e.g. a 4x4 convolutional layer. ;The generator 410-1 receives a latent vector 402-1 and processes it using the small single convolutional layer 412-1 to produce a generator output 422-1. The latent vector 402-1 may be a (pseudo-)randomly generated vector of a fixed sized. The generator output 422-1 may be a small image, e.g. a 4x4 image. ;The discriminator 430-1 alternately receives a generator output 422-1 and a downscaled texture map 424-1. The downscaled texture map may be a downscaling of a larger texture map, e.g. a 512x512 texture map, where the larger texture map is a UV texture map for a facial mesh. The discriminator 430-1 processes the input using the small convolutional layer 432-1 to produce an output, where the output of the discriminator is a number indicating the probability, according to the discriminator, that the input is a downscaled texture map, rather than a generator output. ;A loss function may be calculated based on the correctness of this probability. A high loss may be calculated when the input is a downscaled texture map and the output indicates a low (or zero) probability of the input being a downscaled texture map, and when the input is a generator output and the output indicates a high probability of the input being a downscaled texture map. A low loss may be calculated when the input is a generator output and the output indicates a low (or zero) probability of the input being a downscaled texture map, and when the input is a downscaled texture map and the output indicates a high probability of the input being a downscaled texture map. ;The calculated loss is used to update the parameters, i.e. weights and/or biases, of the generator 410-1 and the discriminator 430-1. The above steps of receiving a latent vector 402-1, generating a generator output 422-1, processing a generator output 422-1 and a downscaled texture map 424-1 using the discriminator, calculating a loss, and updating the network constitute a training iteration of the network. Training iterations may be repeated until a maximum iteration threshold is reached and/or one or more criteria are satisfied. The criteria may indicate that no significant improvement is likely with further training and/or that the discriminator is able to adequately discriminate between downscaled texture maps and generator outputs. A suitable criterion for indicating that no significant improvement is likely with further training may be the value of the loss function not decreasing or only decreasing by less than a threshold amount over a given number of iterations. A suitable criterion for indicating that the discriminator is able to adequately discriminate between downscaled texture maps and generator outputs may be the loss being below a desired value. On completion of training, the generator 410-1 should be able to generate outputs from a latent vector which resemble downscaled texture maps, e.g. 4x4 images resembling downscaled texture maps. ;The generator and discriminator may be grown by adding a layer to each. For example, a larger convolutional layer 412-2, e.g. an 8x8 layer, is added to the generator 410-2. The generator output 422-2 produced by the grown generator 510-2 is larger than the generator output 422-1 produced by the pre-growth generator 510-1. For example, the grown generator output 422-2 may be an 8x8 image while the pre-growth generator output 422-1 may have been a 4x4 image. ;A larger convolutional layer 432-2, e.g. an 8x8 layer, may be added to the discriminator 430-2. The grown discriminator is capable of processing the larger generator outputs 422-2 and correspondingly larger texture maps 424-2. The grown generator 410-2 and discriminator 430-2 are then trained according to the process previously described. On completion of training, the grown generator 410-1 should be able to generate outputs from a latent vector which resemble the larger downscaled texture maps, e.g. 8x8 images resembling downscaled texture maps. ;The generator and discriminator can be progressively grown and trained, as described, until they reach a desired size. The generator 410-N and discriminator 430-N of the desired size are trained using texture maps 424-N, e.g. 512x512 texture maps. On completion of training, the generator 410-N should be able to generate outputs from a -21 -later vector which resemble the texture maps, e.g. 512x512 images resembling texture maps. ;3D Facial Reconstruction Method Figure 5 is a flow diagram illustrating an example method 5() for generating a 3D facial reconstruction based on an input facial image. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the parameter fitting system 200. ;In submethod 510, an initial 3D facial reconstruction is generated. The initial generation submethod 310 includes a facial mesh generation step 512, and a texture map generation step 514. ;In the facial mesh generation step 512, a facial mesh is generated based on shape parameters. The facial mesh may include a number of vertices which represent the shape of the face. The shape parameters may be a latent representation of the face shape. The generation of the facial mesh may also be based on expression parameters. The facial mesh may be a 3D facial mesh represented using N vertices, where each of the N vertices has Cartesian coordinates. ;Where the shape parameters have been derived using principle component analysis, the facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector, and summing these products with a mean shape vector. For example, a 3D facial mesh comprising N vertices may be represented as a vector: s T = [Xi, XN - XN, yN, Z NV' A shape parameter vector, ps, may comprise principle components of variations in 3D facial shape. The 3D facial mesh may then be calculated as: s = ms + UsPs where ms is a mean shape vector and Us is a matrix of basis vectors corresponding to the principle components. ;Where expression parameter have also been derived using principle component analysis and are used to generate the facial mesh, the shape parameters um may be learned from facial scans displaying a neutral expression and represent identity variations, and the expression parameters may be learned from displacement vectors and represent expression variations. The facial mesh may be generated by finding the product of each of the shape parameters and a respective basis vector and each of the expression parameters and a respective basis vector, and summing these products with a mean shape vector. For example, the facial mesh may be calculated as: s = ms,e + Us,eps,, where ms,e is a mean shape vector, Us" = [Us, Lie] where Us is a matrix of basis vectors for identity variations and LI, is a matrix of basis vectors for expression variations, and = [Ps* pc] where ps are identity principal components and pe are expression principle components.

Alternatively, the facial mesh may be generated using a neural network. The facial mesh may be generated using a generator network of a generative adversarial network or a decoder network of an autoencoder. The shape parameters may be the input to the network. The output of the network may be a vector representation of the coordinates of the vertices of a 3D facial mesh. Alternatively, the output of the network may be an intermediate representation of the 3D facial mesh, e.g. a UV map representing spatial deviations from a mean 3D facial mesh, with additional processing being performed to convert the intermediate representation into the coordinates of the vertices. The neural network may accept the expression parameters as an input to the network in addition to the shape parameters, and use these in generating the vector representation or intermediate representation.

in some embodiments, the facial mesh may be generated by generating a neutral 3D facial mesh based on the shape parameters and an expression representation, e.g. an expression displacement, based on the expression parameters, and combining the results. The neutral 3D facial mesh may be generated using a neural network. The expression representation may be generated using a neural network or a linear model, such as PCA. The neutral 3D facial mesh and the expression representation generator be combined using one or more linear operations or using a combiner neural network.

In the texture generation step 514, a texture map for mapping onto the facial mesh is generated based on texture parameters.

The texture map may be generated using a neural network. The texture map may be generated using a generator network of a generative adversarial network or a decoder network of an autoencoder. In particular, the texture may be generated using the generator network of a progressively grown generative adversarial network such as generator network 410-N of Figure 4. The texture parameters may be the input to the network. The output of the network may be a vector representation of the texture map, e.g. a vector comprising the RGB components for each pixel of the texture map, or a matrix representation of the texture map.

Alternatively, where the texture parameters have been derived using PCA, the texture map may be generated by finding the product of each of the texture parameters and a respective basis vector, and summing these products with a mean texture vector. For example, a texture parameter vector, Pt may comprise principle components of variations in facial texture, and a texture map, T(pt), may be calculated as: T(pt) = mt + Utpt where mt is a mean texture vector and Ur is a matrix of basis vectors corresponding to the principle components.

In submethod 52o, the 3D facial reconstruction is iteratively updated. The iterative update submethod 52o includes a 3D facial reconstruction projection step 522, a loss function calculation step 524, and an updated 3D facial reconstruction generation step 526. These steps may be repeated a plurality of times until a maximum iteration threshold is reached and/or one or more criteria are satisfied. The criteria may indicate that no significant improvement is likely with further iterative updates and/or that the updated 3D facial reconstruction is of a desired quality. A suitable criterion for indicating that no significant improvement is likely with further iterative updates may be the value of the loss function not decreasing or only decreasing by less than a threshold amount over a given number of iterations. A suitable criterion for indicating that the updated 3D facial reconstruction is of a desired quality may be the loss function being below a desired value.

In the 3D facial reconstruction projection step 522, the 3D facial reconstruction is projected onto a 2D plane, using a differentiable renderer, to form a rendered facial image. The differentiable renderer may use any of the facial mesh, the texture map, and/or lighting and camera parameters to form the rendered facial image.

Prior to projection, the texture map may be used to texture the facial mesh to create a 3D textured facial mesh. Then, the 3D textured facial mesh may be projected onto the 2D plane at the centre Cartesian origin by a pinhole camera model with the camera standing at and directed towards positions indicated by the camera parameters, and with a focal length indicated by the camera parameters. The illumination of the projection may be modelled by phong shading given the positioning of the direct light source, the colour of the direct light source and the colour of the ambient lighting as indicated by the lighting parameters. Barycentric interpolation may be used to interpolate the colour and normal attributes at each vertex at the corresponding pixels, so that gradients may be easily backpropagated through the differential renderer. Furthermore, the differentiable renderer may be implemented using a machine learning framework, e.g. TensorFlow or MXNet, to facilitate backpropagation through the renderer. The rendered facial image may be a vector or matrix representation of the rendered facial image, e.g. a matrix or vector containing RGB components for each pixel of the rendered facial image.

In the loss function calculation step 524, a loss function is calculated based on a comparison between the input facial image and the rendered facial image. The calculation of the loss function may be further based on a comparison between the input facial image and a rendered sampled facial image, i.e. a rendered facial image produced using the same shape and texture parameters as the rendered facial image while using (pseudo-)randomly sampled expression, camera and/or lighting parameters.

The calculation of the loss function may include detecting landmark locations in the input facial image and the rendered facial image, and using the detected landmark locations to compute a landmark loss.

The landmark locations may be detected using a deep face alignment network, e.g. a cascade multi-view hourglass model. The landmark locations may be indicated by a matrix containing the 2D positions of each of a number of landmarks in the image. The landmark locations for an image I may be denoted as M(I). Therefore, the landmark locations for the input facial image, I°, may be denoted as.M(1°), and the landmark locations for the rendered facial image, I R, may be denoted as M(I').

The input facial image landmark locations, M(1°), and the rendered facial image landmark locations, N(IR), may be compared to compute a landmark loss. The landmark loss, Lian, may be the Euclidean distance between the input facial image landmark locations and the rendered facial image landmark locations, i.e.: Llan = Ilm(1°) -m(0)11, The landmark loss may be sensitive to misalignment between the input facial image and the rendered facial image. The landmark loss may be particularly useful in optimising the camera parameters used to form the rendered facial image.

The calculation of the loss function may also include computing a pixel loss based on a comparison between the pixel values for the input facial image and the pixel values for the rendered facial image.

The pixel loss may indicate the distance between the components of each of the corresponding pixels of the input facial image, I°, and the rendered facial image, I2. For example, the pixel loss may be the pixel level 11 loss function: Lplx = 111° -1'11, which, for RGB colour images, may be calculated as: Lpix = 1R3,71 + Iq,y.b 123cZ,Y,b I) x,y where I x,yr lx,ya and Ix,y,b are the red, green and blue components respectively of the pixel at position (x, y) of an image I. The calculation of the loss function may also include computing a content loss based on a comparison of an intermediate representation of the input facial image and an intermediate representation of the rendered facial image. The intermediate representations may be robust to pixel-level deformations.

The intermediate representations may be generated by a face recognition network, and may be the activations of one or more hidden layers of the face recognition network for the respective image. The face recognition network may be a convolutional neural network. The convolutional neural network may be a high capacity deep convolutional neural network, e.g. a variant of the ResNet architecture, trained using an additive angular margin (ArcFace) loss function.

The activations of the face recognition network at a layer j for an image I may be denoted as F1(1) The face recognition network may process the input facial image, 10, to produce the activations, T1(1°), at the jth layer. The input facial image intermediate representation may comprise these activations for one or more layers of the network. The face recognition network may process the rendered facial image, IR, to produce the activations, Fin, at the jth layer. The rendered facial image intermediate representation may comprise these activations for one or more layers of the network.

The content loss, L -con may be computed as the normalized Euclidean distance between the activations constituting the input facial image intermediate representation and the activations constituting the rendered facial image intermediate representation, i.e.: con - + MY] (Ia) -TJ(IR)M2 L-d HiEj X WT./ X CT j 1=k where the activations of each of the layers from the kth layer to the lth layer of the face recognition network are being used to calculate the content loss. 11,1,IN1 and Cr, denote the height, width and number of channels, respectively, of layer j of the face recognition network.

-27 -The calculation of the loss function may also include computing an identity loss based on a comparison of an embedding of the input facial image and an embedding of the rendered facial image. The embeddings may indicate identity related features.

The embeddings may be generated by a face recognition network, and may be the output of the face recognition network. The face recognition network may be a convolutional neural network. The convolutional neural network may be a high capacity deep convolutional neural network, e.g. a variant of the ResNet architecture, trained using an additive angular margin (ArcFace) loss function. The face recognition network may be the same face recognition network as that used to generate the intermediate representations.

Where the face recognition network has it layers, the output of the nth layer of the network, i.e. the embedding, for an image I may be denoted as YEW. The face recognition network may process the input facial image, I°, to produce the input facial image embedding, r(1°), and may process the rendered facial image 204, 1R, to produce the rendered facial image embedding, Fr' (13) The identity loss, Lid, may be computed as the cosine distance between the input facial image embedding, .Fn (1°), and the rendered facial image embedding, F'1(I2), i.e.: Fn(I°). Fn(13) Lid =1 IIF n (10)11211F n Of10112 Where the calculation of the loss function is further based on a comparison between the input facial image and a rendered sampled facial image, the calculation of the loss function may also include computing an additional identity loss based on a comparison of the embedding of the input facial image and an embedding of the rendered sampled facial image. The embedding of the rendered sampled facial image may have the same form as the embedding of the input facial image and may be produced by the same face recognition network. The face recognition network may process the rendered sampled facial image, iR, to produce a rendered sampled facial image embedding, Fn(r).

-28 -The additional identity loss, Ltd, may be computed as the cosine distance between the input facial image embedding, PI (10), and the rendered sampled facial image embedding, n(19, i.e.: F' (10). y n (r) Ltd -1 (0)11211Tn (131112 The calculation of the loss function may culminate in the calculation of an overall loss, A which may be output as the result of the loss function calculation. The overall loss may be based on all or any combination of the computed losses The overall loss may also be based on a regularization term for regularizing at least some parameters used to generate the rendered facial image. The overall loss, L, may be a weighted sum of the computed losses, e.g.: = A Lid + AidLid + AwnLcon +At (tP,,,,P ID an L/an ±A Reg where the A terms are weighting parameters and Reg({ps e, pi)) is a regularization term for regularizing the shape, expression and lighting parameters.

In step 526, an updated 3D facial reconstruction is generated. To generate the updated 3D facial reconstruction, the shape and texture parameters used to generate the initial 3D facial reconstruction may be updated using gradient descent based on the calculated loss function. Both the shape and texture parameters may be updated in a single iteration. The camera, lighting and/or expression parameters may be updated using gradient descent. A 3D facial reconstruction may be generated by using the updated shape parameters and, if used, the updated expression parameters to generate an updated texture mesh in the same way as in facial mesh generation step 512, and using the updated texture parameters to generate a texture map in the same way as in texture map generation step 514.

The derivatives used to update the parameters by gradient descent may be derived by backpropagating the loss function through the systems used to calculate the loss and to render the rendered facial image. The updates by gradient descent may be performed using any suitable optimiser, e.g. the Adam Solver.

The proposed approach may also be applied to iteratively update a 3D facial reconstruction based on multiple input facial images. Where multiple input facial images are used, at least one additional input facial image is received. The at least one facial image is used to generate an initial 3D facial reconstruction for the at least one additional input facial image using the initial generation submethod 310. A combined 3D facial reconstruction is then generated by averaging the foundational parameters, i.e. the identity reconstruction parameters, for the 3D facial reconstruction corresponding to the input facial image and the 3D facial reconstruction(s) corresponding to each of the at least one additional input facial image(s). The foundational parameters may be the shape and texture parameters. The 3D facial reconstruction for each of the at least one additional input facial image(s) is iteratively updated using the iterative update submethod 52o. The combined 3D facial reconstruction is then updated based on the updated 3D facial reconstruction(s) corresponding to each of the at least one additional input facial image(s) and the updated 3D facial reconstruction corresponding to the input facial image. The combined 3D facial reconstruction may be updated by averaging the foundational parameters for the updated 3D facial reconstruction corresponding to the input image and the 3D facial reconstruction(s) corresponding to each of the at least one additional input facial image(s).

Initial Alignment Method Figure 6 is a flow diagram illustrating an example method 600 for initially aligning a 3D facial reconstruction with an input facial image. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the parameter fitting system 40o. The method may be performed prior to the example method 300 for generating a 3D facial reconstruction based on an input facial image.

In step 61o, landmark detection is performed on the input facial image to detect landmark locations of the input facial image.

The landmark locations may be detected using a deep face alignment network, e.g. a cascade multi-view hourglass model. The landmark locations may be indicated by a matrix containing the 2D positions of each of a number of landmarks in the image. The landmark locations for an image I may be denoted as MO). Therefore, the landmark locations for the input facial image, I°, may be denoted as M(1°).

In step 62o, a prior 3D facial reconstruction is generated. The prior 3D facial reconstruction may be generated in the same way as the initial 3D facial reconstruction, e.g. using the submethod 310. The generation of the prior 3D facial reconstruction is based on shape parameters. The generation of the prior 3D facial reconstruction may also be based on expression parameters and texture parameters.

In step 63o, the prior 3D facial reconstruction is projected onto a 2D plane, using a differentiable renderer, to form a rendered facial image. The differentiable renderer uses the prior 3D facial reconstruction, camera parameters, and, optionally, lighting parameters to form the prior rendered facial image. The differentiable renderer may be the same differentiable renderer as that used to project the 3D facial reconstruction in the 3D facial reconstruction projection step 322 of the 3D facial reconstruction method 300. The prior 3D facial reconstruction may be projected in the same way as in the 3D facial reconstruction projection step 322.

In step 64o, landmark detection is performed on the input facial image to detect landmark locations of the prior rendered facial image.

As in step 610, the landmark locations may be detected using a deep face alignment network, e.g. a cascade multi-view hourglass model, and indicated by a matrix containing the 2D positions of each of a number of landmarks in the image. The landmark locations for the prior rendered facial image, I P, may be denoted as M(1T).

In step 65o, the facial landmark locations of the input facial image and the facial landmark locations of the prior rendered facial image are compared. The facial landmark locations may be compared by calculating a landmark loss. The landmark loss, Lt,"" may be the Euclidean distance between the input facial image landmark locations and the prior rendered facial image landmark locations, i.e.: = II(r) -3vr0"9112 -31 -In step 66o, the shape and camera parameters are optimised to align the facial shape and facial pose of the prior rendered facial image with the input facial image based on the comparison of the facial landmark locations. The expression parameters may also be similarly optimised. These parameters may be optimised by gradient descent using the landmark loss. The derivatives used for gradient descent may be obtained by backpropagation through the systems used to calculate the landmark loss, detect the landmarks and render the prior rendered facial image. The optimisation may be performed using a suitable optimisation algorithm, e.g. the Adam optimisation algorithm.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the invention, the scope of which is defined in the appended claims. Various components of different embodiments may be combined where the principles underlying the embodiments are compatible.

Claims

Claims 1. A computer implemented method for generating a 3D facial reconstruction based on an input facial image comprising: generating an initial 3D facial reconstruction by: generating, using a shape model, a facial mesh based on a plurality of shape parameters; and generating a texture map based on a plurality of texture parameters; iteratively updating the 3D facial reconstruction by: projecting, using a differentiable renderer, the 3D facial reconstruction onto a 2D plane to form a rendered facial image; calculating a loss function based on a comparison of the rendered facial image and the input facial image; and generating an updated 3D facial reconstruction by updating the shape parameters and texture parameters based on the calculated loss function using gradient descent.
2. The method of claim 1, wherein generating the updated 3D facial reconstruction comprises updating both the shape parameters and texture parameters in a single iteration.
3. The method of any preceding claim, wherein generating the facial mesh further uses an expression model and is further based on a plurality of expression parameters.
4. The method of any preceding claim, wherein generating the texture map uses a generator network of a generative adversarial network.
5. The method of claim 4, wherein the generative adversarial network is a progressive growing generative adversarial network.
6. The method of any preceding claim, wherein projecting the 3D facial reconstruction onto a 2D plane is based on a plurality of illumination parameters and camera parameters; and wherein generating the updated 3D facial reconstruction further comprises updating the camera parameters and illumination parameters.
7. The method of claim 6, wherein the camera parameters define a camera position and a focal length, and the illumination parameters define a direct light source colour, a direct light source position and an ambient lighting colour.
8. The method of any preceding claim, further comprising: performing landmark detection on the input facial image using a landmark detection network to output a plurality of facial landmark locations; and performing landmark detection on the rendered facial image using the landmark detection network to output a plurality of facial landmark locations, wherein calculating the loss function comprises calculating a landmark loss function based on a comparison of the facial landmark locations of the input facial image and the facial landmark locations of the rendered facial image.
9. The method of any preceding claim further comprising, prior to generating the initial 3D facial reconstruction: performing landmark detection on the input facial image using a landmark detection network to output a plurality of facial landmark locations; generating a prior 3D facial reconstruction based on shape parameters; projecting, using the differentiable renderer, the prior 3D facial reconstruction onto a 2D plane to form a prior rendered facial image based on camera parameters; performing landmark detection on the prior rendered facial image using a landmark detection network to output a plurality of facial landmark locations; comparing the facial landmark locations of the input facial image and the facial landmark locations of the prior rendered facial image; and optimising the shape and camera parameters to align a facial shape and facial pose of the prior rendered facial image with the input facial image based on the comparison of the facial landmark locations.
10. The method of any preceding claim, wherein calculating the loss function comprises calculating a pixel loss function based on a comparison of pixel values for the input facial image and pixel values for the rendered facial image.
The method of any preceding claim, further comprising: performing facial recognition using a facial recognition network on the input facial image and rendered facial image; wherein calculating the loss function comprises calculating an identity loss function based on a comparison of the facial recognition network output for the input facial image and the facial recognition network output for the rendered facial image.
12. The method of claim 11, wherein calculating the loss function comprises calculating a content loss function based on a comparison of at least one intermediate layer output of the facial recognition network output for the input facial image and a corresponding intermediate layer output of the facial recognition network output for the rendered faci al image.
13. The method of claim 11 or claim 12, further comprising: forming a rendered sampled facial image by projecting the 3D facial reconstruction onto a 2D plane with a random camera positioning, a random expression and random lighting; and performing facial recognition using the facial recognition network on the second rendered facial image; wherein calculating the loss function comprises calculating an additional identity loss function based on a comparison of the facial recognition network output for the input facial image and the facial recognition network output for the second rendered facial image.
14. The method of any preceding claim, further comprising: receiving at least one additional input facial image; generating an initial 3D facial reconstruction for the at least one additional input facial image; generating a combined 3D facial reconstruction by averaging shape parameters and texture parameters corresponding to the input facial image and the at least one additional input facial image; iteratively updating the 3D facial reconstruction for the at least one additional input facial image; and updating the combined 3D facial reconstruction based on the updated 3D facial reconstruction for the at least one additional input facial image and the updated 3D facial reconstruction for the input facial image.
15. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 14.
16. A data processing apparatus comprising means for carrying out the method of any one of claims 1 to 14.