GB2596777A

GB2596777A - Facial re-enactment

Info

Publication number: GB2596777A
Application number: GB2007052.0A
Authority: GB
Inventors: Zafeiriou Stefanos; Koujan Rami; Doukas Michail-Christos
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2022-01-12
Also published as: WO2021228183A1; GB202007052D0

Abstract

A first plurality of sequential source face coordinate images and gaze tracking images are generated based on a plurality of source input images of a first source subject or actor 12. The first plurality of source face coordinate images comprise source identity parameters and source expression parameters, wherein the source expression parameters are represented as offsets from the source identity parameters. A plurality of sequential target face coordinate images and gaze tracking images of the first target subject or actor 14 are generated based on a plurality of target input images of a first target subject. Using a first neural network, a plurality of sequential output images 16 are generated using a mapping module 10, wherein the plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters. The neural network may be a generative adversarial network trained to generate the output images by inputting the source and target coordinate and gaze tracking images. A loss function may be used based on ground truth inputs and the sequential output images.

Description

Facial Re-enactment

Field

The present specification relates to facial re-enactment.

Background

Facial re-enactment aims at transferring the expression of a source actor to a target face. Example approaches include transferring the expressions of the source actor by modifying deformations within the inner facial region of the target actor. There io remains a need for further developments in this area.

Summary

In a first aspect, this specification describes a method comprising: generating, based on a plurality of source input images of a first source subject, a first plurality of sequential /5 source face coordinate images and gaze tracking images of the first source subject, wherein the first plurality of source face coordinate images comprise source identity parameters and source expression parameters, wherein the source expression parameters are represented as offsets from the source identity parameters; generating, based on a plurality of target input images of a first target subject, a plurality of sequential target face coordinate images and gaze tracking images of the first target subject; wherein the target face coordinate images comprise target identity parameters and target expression parameters, wherein the target expression parameters are represented as offsets from the target identity parameters; and generating, using a first neural network, a plurality of sequential output images, wherein the plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.

In one example, the first plurality of source face coordinate images further comprising source imaging parameters, wherein the source imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the source subject in the plurality of source input images. The target face coordinate images may further comprise target imaging parameters, wherein the target imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the target subject in the plurality of target input images; Some examples include training the first neural network with the first plurality of source face coordinate images and the plurality of target face coordinate images; Some examples include training the first neural network with a plurality of source face 5 coordinate images associated with a plurality of source subjects.

In some examples, the first neural network comprises a generative adversarial network, wherein the generative adversarial network comprises: a generator module for generating the plurality of output images; and at least one discriminator module for /o generating at least one loss function respectively based on the one or more ground truth inputs and the plurality of output images generated by the generator module.

Some examples include training the generative adversarial network, wherein the training comprises: receiving one or more ground truth inputs related to the target subject; updating the generative adversarial network based on the at least one loss function.

In some examples, the at least one discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.

In some examples, training the first neural network further comprises determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the target subject, wherein the generator module generates the plurality of output images based at least in part on the or identity feature vector of the target subject. The identity feature vector may be determined from concatenating target face coordinate images and an output of the identity embedder.

In some examples, generating the plurality of output images using the trained first neural network comprises: initializing the generator module; initializing the at least one discriminator modules; and mapping the source expression parameters and the source gaze tracking images on the target identity parameters based on at least one determined identity feature vector of the target subject.

In some examples, the source face coordinate images and the target face coordinate images comprise normalized mean face coordinates, wherein the normalized mean face -3 -coordinates comprise an encoding function of a rasterizer and a normalized version of a three dimensional morphable model.

In some examples, the target face coordinate images are used as a conditional input to the first neural network.

In a second aspect, this specification describes a method of training a neural network to generate a plurality of sequential output images, the method comprising: inputting a first plurality of sequential source face coordinate images and gaze tracking images of a jo first source subject, wherein the sequential source face coordinate images and gaze tracking images are based on a plurality of source input images of the first source subject; inputting a plurality of sequential target face coordinate images and gaze tracking images of a first target subject, wherein the plurality of sequential target face coordinate images and gaze tracking images are based on a plurality of target input images of the first target subject; generating, at a generator module of the neural network, the plurality of sequential output images based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters; generating, at a discriminator module of the neural network, at least one loss function based on one or more ground truth inputs and the plurality of sequential output images; and updating one or more parameters of the neural network based on the at least one loss function.

Some examples include inputting a plurality of source face coordinate images associated with a plurality of source subjects, wherein the plurality of sequential output images is further based on the plurality of source face coordinate images associated with the plurality of source subjects.

In some examples, the discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.

Some examples include determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the first target subject, wherein the generator module of the neural network generates the plurality of sequential output images based at least in part on the identity feature vector of the target subject. -4 -

In some examples, generating the plurality of sequential output images may comprise: initializing the generator module of the neural network; initializing the discriminator module of the neural network; mapping the source expression parameters and the source gaze tracking images on target identity parameters based on at least one determined identity feature vector of the first target subject.

In a third aspect, this specification describes a system comprising one or more processors and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform a jo method of the first aspect or the second aspect.

In a fourth aspect, this specification describes a computer program product comprising computer readable instructions that, when executed by a computing system, cause the computing system to perform a method of the first aspect or the second aspect.

Brief description of the drawings

Example embodiments will now be described, by way of example only, with reference to the following schematic drawings, in which: FIG. i shows an image mapping module in accordance with an example embodiment. FIG. 2 is a block diagram of a system in accordance with an example embodiment; FIG. 3 is a block diagram of a system in accordance with an example embodiment; FIG. 4 shows an example of feature extraction in accordance with an example embodiment; or FIG. 5 is a flow diagram showing an algorithm in accordance with an example embodiment; FIG. 6 is a flow diagram showing an algorithm in accordance with an example embodiment; FIG. 7 is a block diagram of a system in accordance with an example embodiment; FIG. 8 is a data structure in accordance with an example embodiment; FIG. 9 is a block diagram showing an example of a structure of a neural network for use in an example embodiment; FIG. to shows an example discriminator module in accordance with an example embodiment; FIG. 11 is a block diagram of a system in accordance with an example embodiment; FIG. 12 is a flow diagram of an algorithm in accordance with an example embodiment; -5 -FIG. 13 is a flow diagram showing an algorithm in accordance with an example embodiment; and FIG. 14 is a block diagram of a system in accordance with an example embodiment.

Detailed description

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in the specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the jo invention.

In the description and drawings, like reference numerals refer to like elements throughout.

FIG. 1 shows an image mapping module, indicated generally by the reference numeral to, in accordance with an example embodiment. The image mapping module io has a first input receiving images (e.g. video images) of a source actor 12, a second input receiving images (e.g. video images) of a target actor 14, and an output 16. The source may be considered to provide "expressions". The target may be considered to be an "identity" to which the expressions of the source actor are to be applied.

The image mapping module 10 seeks to transfer the expressions of the source actor 12 onto the target actor 14 to provide the output 16, thereby implementing facial reenactment. The inputs 12 and 14 and the output 16 may all be videos.

As discussed below, the module 10 seeks to transfer the pose, facial expression, and/or eye gaze movement from the source actor 12 to the target actor 14, thereby providing the output 16. Given that people may easily detect mistakes in the appearance of a human face, specific attention may be given to the details of the mouth and the teeth such that photo-realistic and temporally consistent videos of faces are generated as outputs.

FIG. 2 is a block diagram of a system, indicated generally by the reference numeral 20, in accordance with an example embodiment. The system 20 comprises a first module 35 22, which module may be implemented as part of the image mapping module 10 described above. The first module 22 receives source input images 24 (e.g. source video -6 -images). The source input images 24 may be obtained from the source actor 12 described above. As described further below, the first module 22 generates source face coordinate images and gaze tracking images from the source input images 24.

FIG. 3 is a block diagram of a system, indicated generally by the reference numeral 30, in accordance with an example embodiment. The system 30 comprises a first module 32, which module may be implemented as part of the image mapping module to described above. The first module 32 receives target input images 34 (e.g. target video images). The target input images 34 may be obtained from the target actor 14 described above. As described further below, the first module 32 generates target face coordinate images and gaze tracking images from the target input images 34.

As described in detail below, the source face coordinate images and the target face coordinate images described above may comprise normalized mean face coordinates (NMFC images), wherein the normalized mean face coordinates comprise an encoding function of a rasterizer and a normalized version of a three dimensional morphable model.

FIG. 4 shows an example of feature extraction (e.g. extraction of pose, facial expression, and/or eye gaze movement of a source actor) in accordance with an example embodiment. More specifically FIG. 4 shows an example face coordinate image 42 (such as the source coordinate images and target coordinate images described above) and an example gaze tracking image 44 (such as the gaze tracking images of the source actor 12 or the target 14 described above).

FIG. 5 is a flow diagram showing an algorithm, indicated generally by the reference numeral 50, in accordance with an example embodiment.

The algorithm 50 starts at operation 52, where facial reconstruction and tracking is carried out using the systems 20 and 30 described above. At operation 54, conditioning images are generated based on the outputs of the operation 52. Finally, at operation 56, video rendering is performed based on the conditional images generated in the operation 54.

Example implementation details of the algorithm 50 are provided below by way of example. -7 -

In an example embodiment, the facial reconstruction and tracking at operation 52 may comprise generating target face coordinate images, gaze tracking images, source face coordinate images, and source gaze tracking images. For example, human head characteristics (e.g. related to identity of a person) may be separated from facial expressions, pose, and/or eye gaze movements of the person. In one example, three dimensional morphable models (3DMM5) may be used for 3D reconstruction of faces (e.g. face of a source actor and/or a target actor) in an input image sequence (e.g. an input video). For example, given a sequence of T number of frames (F), the 3D facial reconstruction and tracking at operation 52 may produce two sets of parameters comprising shape parameters and imaging parameters. The shape parameters may comprise at least one of identity parameters and the expression parameters. The shape parameters may be represented by S -fat st C - Ti. The imaging parameters may be represented by P = ict = 1, " .,111. The imaging parameters may represent at least one of a rotation, translation, and/or orthographic scale of a subject (source or target actor) in the plurality of input images.

In an example embodiment, with 3D morphable models, a 3D facial shape xt = * VI, * *-*t" IN R3N may be represented mathematically according to the equation 1 below: x't 'Sit 4-where X C R3N may be an overall mean shape vector of the 3D morphable model, given by.7.X: kid, where 5-c-td represents the identity of the model, and is".1) represents the expression of the model. Um E R31\1 xmi may be an orthonormal basis for the identity parameters with ni principal components 04 and u 701,'"'n-, may be an orthonormal basis for the expression parameters with De principal components (n <C 3N) . The identity parameters arc denoted by s.

and the expression parameters are denoted by.4 C. The shape parameters (e.g. the joint identity and expression parameters) of a frame t may be denoted by In an example embodiment, the face coordinate images (e.g. the source face coordinate images and/or the target face coordinate images), denoted by the 3D facial shape x,

-

may be a function of the identity parameters and the expression parameters (xls'i s -8 -For example, expression variations may be represented as offsets from a given identity shape.

In an example embodiment, the identity parameters and expression parameters of a source actor and/or target actor maybe determined based at least partially on a plurality of video frames from an input source video 24 and/or target video 34 respectively. Facial videos (e.g. videos of the face of a source actor or videos of the face of a target actor) may comprise dynamic information that may be useful for determining the 3D morphable model (e.g. the 3D facial shape x, for example, face /o coordinate images 42) of a source actor or a target actor. Facial landmark' ng may be used for achieving reliable landmark localisation such that the landmark information may be combined with 3D morphable models in order to perform the facial reconstruction. For example, it may be assumed that the camera used for obtaining the videos may perform scaled orthographic projection (SOP). The identity parameters may be fixed over all the video frames (as the video is assumed to be of a single person with the same identity, e.g. either the source actor or the target actor), and the expression parameters and/or imaging parameters may vary over the video frames (as the expression, pose, and/or camera or imaging angles may change through the video). For example, for a given image sequence (e.g. a video), a cost function may be minimized, where the cost function may comprise a) a sum of squared 2D landmark reprojection errors over all frames (E), b) a shape priors term (Ep,) that imposes a quadratic prior over the identity and per-frame expression parameters and c) a temporal smoothness term (E) that enforces smoothness of the expression parameters in time, by using a quadratic penalty of the second temporal derivatives of the expression vector, as shown in equation 2 below: E{S, P = taS, 4-:LT In an example embodiment, box constraints may be imposed on identity parameters and expression parameters (e.g. per frame expression parameters) in order to account for outliers (e.g. frames with strong occlusions that cause gross errors in the landmarks). For example, assuming that the camera parameters (P) in equation (2) have been estimated in an initialisation stage, the minimisation of the cost function results in a largescale least squares problem with box constraints, which maybe solved using a reflective Newton method. The initialisation stage of the 3D facial reconstruction and tracking step is described in further detail below. -9 -

In an example embodiment, identity parameters of the 3D morphable model may originate from the Large Scale Morphable Model (LSFM), for example, built from a plurality of (e.g. approximately lo,000) images of different people, with varied demographic information. The expression parameters of the 3D morphable model may originate from using the blendshapes model of Face-warehouse and non-rigid ICP to register the blendshapes model with the LSFM model.

In an example embodiment, the conditioning of image generation at operation 54 may comprise training a neural network (e.g. implemented in a video renderer module Jo similar to the first module 32) with a plurality of images 34 of the target actor, such that the trained neural network may generate target face coordinate images and target gaze tracking images based on target input images, for example, similar to the first module 32 described above. For example, training the neural network may comprise providing a sequence of images (e.g. a video) of the target actor and the target face coordinate images (e.g. identity parameters, expression parameters, and/or imaging parameters) obtained in the facial reconstruction and tracking operation 52. The parameterisation obtained from the target face coordinate images seperates identity of the target actor from expression of the target actor, such that the neural network that is trained on the specific target actor in a training phase may be used for transferring the expression and/or pose of another source actor with different identity during a test phase.

In an example embodiment, the inputs to the neural network comprise the face coordinate images (e.g outputs from system 20 or 30), which, for example, may be determined based on the shape parameters (st) (e.g. including identity parameters and expression parameters), and imaging parameters (pt). The shape parameters and imaging parameters may be used for rasterizing the 3D facial reconstruction of the frame, such that a visibility mask -1\1 ait x) is produced in the image space. For example, each pixel of the visibility mask M may store an identity of a corresponding visible triangle on the 3D face from the respective pixel. The normalised x-y-z coordinates of the centre of this triangle may be encoded in another image, termed as Normalised Mean Face Coordinates IN MF1 E ir \ 3) image, and the NMFC image may be utilised as a conditional input of the video rendering stage at operation 56. The NMFC image may be generated according to equation (3) below: NNIFC, = e( exti (3) -10 -where R is the rasterizer, S is the encoding function and ± is the normalised version of the utilised 3DMM mean face (see equation (1)), so that the x-y-z coordinates of this face [o, 1]. In addition to the NMYC images, the input of the neural video renderer may further be conditioned on the gaze image (G) of the corresponding frame t.

In an example embodiment, the video rendering at operation 56 may comprise using a neural network for rendering a video of the target actor with the expression and/or pose of a source actor. For example, given the video of a source actor yti = T a sequence of NMFC frames ci:t and corresponding sequence of eye gaze frames enT may ro be obtained for the source actor. A conditional input to the neural network xi.T may be computed, for example, by concatenating NM FC frame sequences of the source actor and eye gaze frame sequence of the source actor at the channel dimension, such that /lICP -a {tr. (.r: y)}4 f = " .T and ft f-ifxWx.6 for each time instance t. The neural network may learn to translate its conditional input video to a realistic and temporally coherent output video ii1:7' , which may show the target actor performing similar head motions, facial expressions, and/or eye blinks as the source actor in the source input video.

FIG. 6 is a flow diagram showing an algorithm, indicated generally by the reference numeral 60, in accordance with an example embodiment.

The algorithm 6o starts at operation 62 where a first plurality of sequential source face coordinate images (e.g. NMFC images) are generated. The first plurality of sequential source face coordinate images comprise face coordinate images and gaze tracking images of the first source subject that are generated based on a plurality of source input images of a first source subject (e.g. the source actor 12). The operation 62 may be implemented by the first module 22 of the system 20 described above with reference to FIG. 2. The operation 62 may be implemented using the techniques described with reference to the facial reconstruction and tracking stage 52 as described above. The first plurality of source face coordinate images may comprise source identity parameters and source expression parameters. The source expression parameters are represented as offsets from the source identity parameters.

At operation 64 (which may be carried out before, after, or at the same time as the 3.3 operation 62), a plurality of sequential target face coordinate images (e.g. NMFC images) are generated. The plurality of sequential target face coordinate images comprises face coordinate images and gaze tracking images of the first target subject that are generated based on a plurality of target input images of a first target subject (e.g. the target 14). The operation 64 may be implemented by the first module 32 of the system 20 described above with reference to FIG. 3. The operation 64 may be implemented using the techniques described with reference to the facial reconstruction and tracking stage 52 as described above. The target face coordinate images may comprise target identity parameters and target expression parameters. The target expression parameters may be represented as offsets from the target identity parameters.

The plurality of source face coordinate images may further comprise source imaging parameters (e.g. camera parameters or imaging parameters (P) in equation (2) described above). The source imaging parameters may represent at least one of a rotation, translation, and/or orthographic scale of the source subject in the plurality of source input images. Similarly, the plurality of target face coordinate images may further comprise target imaging parameters (e.g. camera parameters or imaging parameters (P) in equation (2) described above). The target image parameters may represent at least one of a rotation, translation, and/or orthographic scale of the target subject in the plurality of target input images.

At operation 68, a plurality of sequential output images are generated. The plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters. As described in detail below, the plurality of sequential output images may be generated using a first or neural network. Operation 68 may be implemented as part of the video rendering stage 56 described above with reference to FIG. 5.

The neural network used to generate the output images in the operation 68 described above may be trained in an operation 66 of the algorithm 6o. The operation 66 is shown in dotted form between the operation 64 and 68 by way of example only. The operation 66 is optional, since the neural network may be trained at other times. The training of the first neural network at operation 66 may be performed at the conditioning image generation stage 54 described above with reference to FIG. 5.

-12 -FIG. 7 is a Nock diagram of a system, indicated generally by the reference numeral 70, in accordance with an example embodiment. The system 70 may be used to implement the operation 68 or algorithm 60 described above.

The system 70 comprises a first neural network 72. As shown in the system 70, the first neural network 72 receives source face coordinates and gaze tracking images 73 (see the operation 62 of the algorithm 6o described above) and target face coordinate and gaze tracking images 74 (see the operation 64 of the algorithm 6o described above) and generates output images (see the operation 68 of the algorithm 6o described above).

jo The target face coordinate images 74 may be used as a conditional input to the first neural network 72. The output images may comprise a plurality of sequential output images that may be based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.

As described above, the algorithm 60 may include the training of the first neural network 72. For example, the first neural network may be trained with the plurality of source face coordinate images and the plurality of target face coordinate images. The plurality of source face coordinate images may comprise face coordinate images of the source actor 12, and the plurality of target face coordinate images may comprise face coordinate images of the target actor 14.

In an example embodiment, the first neural network 72 may be trained, for example, with a plurality of source face coordinate images associated with a plurality of source subjects. As such, the first neural network 72 may be trained with data from a wider range of source subjects, which may allow the neural network to be used for facial reenactment of the target actor even when data available for the target actor (e.g. input video of the target actor) is limited (e.g. a short video of 50 to 500 frames of the target actor's face).

FIG. 8 is a data structure 80 in accordance with an example embodiment. The data structure 80 shows example parameters of the face coordinate images described above (e.g. the source face coordinate images and the target face coordinate images). The data structure 80 of a face coordinate image may comprise identity parameters 82, expression parameters 84, and, optionally, imaging parameters 86. In an example embodiment, the first plurality of source face coordinate images comprises source identity parameters and source expression parameters, wherein the source expression -13 -parameters are represented as offsets from the source identity parameters. The target face coordinate images comprise target identity parameters and target expression parameters, wherein the target expression parameters are represented as offsets from the target identity parameters.

FIG. 9 is a block diagram showing an example of a structure of the neural network 72 described above in accordance with an example embodiment. The neural network 72 comprises a generator module 92 and at least one discriminator module 94 that may collectively form a generative adversarial network (GAN). For example, a GAN jo framework may be adopted for video translation, where the generator G is trained in an adversarial manner, alongside an image discriminator Di and a multiscale dynamics discriminator DD, which ensures that the generated video (e.g. sequential output images) is realistic, temporally coherent and conveys the same dynamics of the target actor's input video. In an example embodiment, a further mouth discriminator DM may be used for improved visual quality of the mouth area. The training of the neural network 72 at operation 66 may be implemented using the generative adversarial network.

For example, the source face coordinate images (e.g. as generated at system 20, at operation 62) and the target face coordinate images (e.g. as generated at system 30, at operation 64) may be provided as inputs to the generator module 92. The generator module 92 may further receive source gaze tracking images and target gaze tracking images. The generator module 92 may provide output images comprising a mapping of the source expression parameters and the source gaze tracking images on the target or identity parameters. The discriminator module 94 may receive ground truth images of the target actor as inputs, and also receive as input the output images generated by the generator module 92. The at least one discriminator module 94 may generate at least one loss function respectively based on the one or more ground truth inputs and the plurality of output images generated by the generator module 92. For example, the loss function may indicate how realistic the output images may be perceived to be. The loss function is then provided as a feedback to the generator module 92, such that the following output images are generated based at least partially on the loss function and the previous output images. As such, the loss function may be used for the training of the generator module 92, such that the output of the generator module 92 is improved in subsequent iterations.

-14 -In an example embodiment, the neural network 72 (e.g. implemented by the generative adversarial network) may be trained and/or initialized based on one or more ground truth inputs. The neural network 72 may receive one or more ground truth inputs related to the target subject, and may then update the generative adversarial network based on the at least one loss function generated by the discriminator module.

In an example embodiment, dependence of the output frames on previous frames may be modelled by conditioning synthesis of the t-th frame Itt using the conditional input z, the previous input images xi-, and x12, as well as the previously generated output io frames.Yi-i and Y-2, as shown in equation 4 below: (4.) In an example embodiment, the generator module 92 may generate output frames sequentially, such that the frames are produced one after the other, until the entire 15 output sequence has been produced.

FIG. 10 shows features of an example discriminator module 94 in accordance with an example embodiment. The discriminator module 94 comprises at least one of an image discriminator 102, a mouth discriminator 104, and a dynamics discriminator 106. In an example embodiment, the image discriminator 102 may be used for distinguishing ground truth images from generated output images. The mouth discriminator 104 may be provided to distinguish between cropped mouth-area ground truth images and generate output images. The mouth discriminator 104 may be provided as a separate module than the image discriminator 102 as providing focussed training for the mouth region may provide more realistic results for the output images. The dynamics discriminator 106 may be provided to distinguish between sequential ground truth images (e.g. ground truth videos) and generated sequential output images (e.g. output video).For example, the dynamics discriminator 106 may comprise a temporal discriminator that is trained to detect videos with unrealistic temporal dynamics (e.g. ensuring that the generated video is realistic, temporally coherent and conveys the same dynamics of the target video) FIG. 11 is a block diagram of a system, indicated generally by the reference numeral 110, in accordance with an example embodiment. The system no may be an example implementation of the generator module 92. The system no may comprise encoding -15 -pipelines in and 112, and decoding pipeline n.6. The encoding pipeline in may receive concatenated NMFC images 113 and eye gaze images 114(r t- ), and the encoding pipeline 112 may receive at least two previously generated output images ( 'f* '"* ).

The encoding pipelines in and 112 may each comprise a convolution, batch normalization, and rectified linear unit module, a downsampling module (e.g. 3x), and a residual module (e.g. 4x). The resulting encoded features of the encoding pipelines in and 112 may be added and passed through the decoding pipeline 116, which provides the output tt in a normalised [-1,+1] range, using a tanh activation. For example, the decoding pipeline may comprise a residual module (e.g. 5x), a downsampling module io (e.g. 3x), and a convolution tanh module.

In an example embodiment, ck may denote a 73(7 Convolution2D, BatchNorm, ReLU layer with k filters and unitary stride; dlu may denote a 3x3 Convolution, BatchNorm, ReLU layer with k filters, and stride 2; uk may denote a 3x3 transpose Convolution2D, BatchNorm, ReLU layer with k filters, and stride of 1/2. Then, the encoding architecture may therefore be c64, d128, d256, d512, R512, R512, R512, R512 and the decoding architecture is R512, R512, R512, R512, R512,11256,11128,1164, cony, tanh.

In an example embodiment, generator module 92 receives a loss function from the discriminator module 94, which discriminator module may comprise the image discriminator 102, mouth discriminator 104, and dynamics discriminator 106. The image discriminator 102 (Dr) and mouth discriminator 104 (DM) may learn to distinguish real frames from synthesized frames (e.g. generated output images). During training, a given time step t' in the range [1,11 may uniformly be sampled. The real pair 25, yr) and the generated pair Cr) are provided as inputs to the image discriminator Dr. The corresponding mouth regions and '1 may be cropped and provided as inputs to the mouth discriminator 104 (DM). In one example, in order to enable the generator module to create high-frequency details in local patches of the frames, a Markovian discriminator architecture (PatchGAN) may be 30 used.

In an example embodiment, the dynamics discriminator 106 (D0) may be trained to detect videos with unrealistic temporal dynamics. The dynamics discriminator 106 may comprise a network that receives a set of K = 3 consecutive real frames Yv't" -16 - (e.g. ground truth frames) or generated frames ill"' lcI (e.g. synthesized frames) as inputs (e.g. the real or generated frames may be randomly drawn from the video. in addition to being conditioned on short video clips of length K, given the optical flow win-, of ground truth video NUT, the dynamics discriminator 106 may further ensure that the flow wty,[..-_, corresponds to the given video clip. Therefore, the dynamics discriminator 106 should learn to identify the pair (wr,-+R-_,; yr,f+R-_,) as real and the pair (Wi'd +K 2; ) as generated. In one example, a multiple scale dynamics discriminator may be employed, which may performs the tasks described above in three different temporal scales. The first scale dynamics discriminator receives sequences in the original frame rate. Then, the two extra scales are formed by choosing not subsequent frames in the sequence, but subsampling the frames by a factor of K for each scale.

In an example embodiment, the objective of our GAN-based framework can be expressed as an adversarial loss. For example, the Least Squares Generative Adversarial Networks (LSGAN) loss may be used, such that the adversarial objective of the generator may be given by the equation 5 below: In one example, two more losses may be added in the learning objective function of the generator, such as a VGG loss I:2,4w and a feature matching loss LGfent which losses may be based on the discriminators. Given a ground truth frame yt and the synthesised frame th, the VGG network may be used for extracting image features in different layers of these frames. The loss is then computed based on equation 6 provided below: with La) being the i-th layer of VGG, with a feature map of size Hi x Wix C. Similarly, the discriminator feature matching loss may be computed by extracting features with the two discriminators 102 (DI) and 166 (DD) and computing the lt distance of these features for a generated frame and the corresponding ground truth m. Therefore, the total objective for G may be given by equation (7) below: -17 -The image, mouth, and dynamics discriminators 102, 104, and 106, may be optimised under their corresponding adversarial objective functions. The image discriminator 102 may be optimized under the objective given by equation 2a below: (2a) The mouth discriminator 104 may be optimized under a similar adversarial loss with which results by replacing the whole images in equation (2a) with the cropped mouth areas, as shown in equation 3a below: (3a) The image cropping may be performed around the centre of the mouth on the real yr, , generated ilf and NMFC images, thus providing images ttt.Y of size 64x64.

The dynamics discriminator 106 is trained to minimize the adversarial loss according to 15 equation 4a below: (4a) In an example embodiment, in order to generate as realistic dynamics (e.g. optical facial flow), the dynamics discriminator DD (106) may be conditioned on a very accurate facial flow of the target video. Human facial performances exhibit non-rigid and composite deformations as a result of very complex facial muscular interactions, their flow may be captured by using target facial videos to train a specific network for this task. For example, optical flow estimation may be performed using a network pretrained on publicly available images. The pretrained models may be used, and their network may be fine-tuned on further datasets (e.g. 4DFAB dataset) comprising -18 -dynamic high-resolution 4D videos of subjects eliciting spontaneous and posed facial behaviours. To create the ground truth 2D flow, the provided camera parameters of the acquisition device (device used for obtaining target video or source video) may be used, and the 3D scans (e.g. 3D moiphable model) of around 750K frames may be rasterized so that the difference between each pair of consecutive frames represents the 2D flow. For the background of the images, the 2D flow estimates of the same 750K frames using the original FlowNet2 may be generated and a masked End-Point-Error (EPE) loss may be used so that the background flow stays the same and the foreground follows the ground truth 2D flow coming from the 4DFAB dataset.

FIG. 12 is a flow diagram of an algorithm, indicated generally by the reference numeral 120, in accordance with an example embodiment. The algorithm 120 starts at operation 122, where facial reconstruction and tracking is carried out (e.g. using the systems 20 and 30 described above). At operation 124, conditional images are generated based on the outputs of the operation 122. At operation 126, video rendering initialization is performed for initializing a neural network (e.g. neural network 72) used for rendering output images (e.g. output video). Finally, at operation 128, the neural network (e.g. video rendering network 72) is updated based, at least in part, on the conditioning images generated in the operation 124, and on at least one loss function generated by at least one discriminator.

In an example embodiment, in the facial reconstruction and tracking operation 122 may be performed similar to operation 52 described above. A reliable estimation of the human facial geometry may be generated, for example, capturing the temporal dynamics while being separable in each frame into the identity and expression contributions of the videoed subject.

In an example embodiment, a sparse landmark-based method (e.g. extracting 68 landmarks from each frame) may be adopted, which method may capitalise on the rich temporal information in videos while performing the facial reconstruction. While carrying out the 3D facial reconstruction, a scaled orthographic projection (SOP) may be postulated (e.g. with the assumption that in each video the identity parameters sit are fixed (yet unknown) throughout the entire video, letting however the expression parameters set as well as the camera parameters (e.g. imaging parameters, scale, 3D pose) to differ among frames. For a given sequence of frames, an energy equation that consists of three terms as demonstrated in equation (2b) may be minimized. The terms -19 -may comprise 1) a data term penalising the 12 norm error of the projected landmarks over all frames (Er), 2) shape regularisation term (4,) that reinforces a quadratic prior over the identity and per-frame expression parameters and 3) a temporal smoothness term that supports the smoothness of the expression parameters throughout the video, by employing a quadratic penalty on the second temporal derivatives of the expression vector. Equation 2b is provided below: P = u1kcS, In an example, gross occlusions may be reduced by imposing box constraints on the identity and per-frame expression parameters. After estimating the camera parameters (P) in equation (2b) in an initialisation stage, the minimisation of the loss function may lead to a large-scale least squares problem with box constraints, which may be addressed by using the reflective Newton method.

In an example embodiment, the training the first neural network may comprise determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the target subject, wherein the generator module generates the plurality of output images based at least in part on the identity feature vector of the target subject. The identity feature vector may be determined by concatenating target face coordinate image and an output of the identity embedder.

FIG. 13 is a flow diagram of an algorithm, indicated generally by the reference numeral 130, in accordance with an example embodiment. The algorithm 130 comprises or generating the plurality of output images using the trained first neural network. The algorithm 130 starts at operation 132 for initializing the generator module 92. Then, at operation 134, at least one discriminator modules 94 is initialized. Finally, the source expression parameters and the source gaze tracking images are mapped on the target identity parameters based on at least one determined identity feature vector of the target subject. The identity feature vector is described in further detail below.

FIG. 14 is a block diagram of a system, indicated generally by the reference numeral 140, in accordance with an example embodiment. System 140 shows a framework used during the initialization training stage. The framework comprises an identity embedder 144 receiving images 141 as inputs, a generator module 145 (similar to the generator (2b) -20 -module 92) receiving NMFC images 142 and previously generated (fake) frames 143, and at least one discriminator module comprising the dynamics discriminator 149, the image discriminator 150, and the mouth discriminator 151. The generator module 145 comprises encoding pipelines 146 and 147 and a decoding pipeline 148. Synthesis is conditioned both on NMFC images (142) and frames generated (143) in previous time steps. The identity feature vector hi computed by the identity embedder (identity embedding network) 144 is concatenated with the identity coefficients of person i, coming from the 3DMM, to form the final identity feature vector h.i. This is then injected to the generator 145 through the adaptive instance normalization layers.

In an example embodiment, the neural network (e.g. video rendering network) initialization operation 126 may comprise a learning phase using a multi-person dataset with N identities. For example, the neural network is trained with a plurality of images of a plurality of source actors. Let be a video of an i-th target person in the multi-person dataset, with removed background, and 211 = the corresponding foreground masks, extracted from the original video. The facial reconstruction method described above may be used to obtain the corresponding NMFC frame sequence for each identity i= 1,...,N. Then, the neural network (e.g. generative adversarial network G) aims to learn a mapping from this conditioning representation to a photo-realistic and temporally coherent video -as well as a prediction of the foreground mask Mii. In one example, the learning may be performed in a self-reenactment setting, thus the generated video should be a reconstruction of 1-1 which serves as a ground truth.

In addition to face expression and head pose, further synthesis may be conditioned based on identity details of the target person i. Although the NMFC sequence contains identity related information coming from the 3DMM, such as the head shape and geometry, the NMFC images may lack information regarding skin texture, hair, or upper body details. The identity may thus be learnt by incorporating an identity embedding network E,,i (144) into the neural network (rather than learning identity from the NMFC input). The identity embedding network 144 learns to compute an identity feature vector h' , inrhich is passed to the generator to condition synthesis. The system 140 further comprises image discriminator 150, a dynamics -21 -discriminator 149 and a dedicated mouth discriminator 151, which may ensure that the generated video looks realistic and temporally coherent.

In an example embodiment, the identity embedder may he used such that, given a target person i, M frames r<1' are randomly selected from the training video r. Each frame may be passed to the embedder, such that the identity feature vector is computed according to equation 4(b) below: (4h) Picking random frames from the training sequence i and averaging the embedding io vectors may automatically render hg independent from the person's head pose appearing in the M random input images. The identity coefficient and vector sl computed during 3D face reconstruction may be used by concatenating the two vectors to get the final identity feature vector for person 1: The generator of system 140 conditions on two separate information sources: a) Xi the head pose and expression representation in NMFC images 142 and b)Itig, the identity representation (retrieved from the identity embedder 144) of the target person. A sequential generator may be employed such that frames and masks are produced one after the other. The dependence of the synthesized frame it and the predicted foreground mask 7714 on past times steps by conditioning synthesis on the two previously generated frames L-1: t-g -. The synthetic frame and the hallucinated foreground mask at t may be computed by the generator, using equation 5b below: (5b) where Xi 2 is the conditional input, corresponding to the current step and the two previous ones, while Iv is the final identity feature vector, which is kept fixed for the synthesis of the entire sequence.

The generator 145 may comprise two identical encoding blocks 146 and 147, and one 3o decoding block 148 (similar to generator module 92 and no). The conditional input -22 -t may be provided to the encoding pipeline 146, and the previously generated frames be provided to the encoding pipeline 147. Then, their resulting features are summated and passed to the decoding pipeline 148, which outputs the generated frame and mask bit * . The identity feature vector h9s may be injected into the generator through its normalization layers. For example, a layer normalization method may be applied to all normalisation layers, both in the encoding and decoding blocks of generator G (145). More specifically, given a normalization layer in G with input x, the corresponding modulated output is calculated according to equation 6b below: where 'Y and =' matrices are the learnable parameters, which project the identity feature vector to the affine parameters of the normalization layer.

In an example embodiment, the image discriminator 156 and mouth discriminator 151 /5 may be used for increasing the photo-realism of generated frames, as they learn to t E distinguish real from fake images. Given a uniformly sampled time step the convolutional part of the image discriminator i5o (DO receives the real frame 156 or the synthesized frame 155 (i)1), along with the corresponding conditional input 157 (x11), and computes a feature vector d. The image discriminator 150 may further keep a learnable matrix e, with each one of its lines representing a different person in the dataset. Given the identity index i, the image discriminator 150 may choose the appropriate row w. Then, the realism score for person i may be calculated according to equation 7b below: - rt,"; ± (7b) where /Do and c may be identity independent learnable parameters of the image discriminator 150. In this way, r may reflect whether or not the head in the input frame is real and belongs to identity i and at the same time corresponds to the NMFC conditional input)6. By selecting, at each iteration, the appropriate row vector of VV, given the identity index i, the vectors tv, acquire person specific values during training, resembling identity embeddings.

-23 -In an example embodiment, a mouth discriminator 151 (Dm) may be used to further improve the visual quality of the mouth area, for example, for improved teeth synthesis. This mouth discriminator 151 receives the cropped mouth regions 159 and 158 from the real frame 156 (14) or the synthesized frame 155 (111) respectively, and may compute 5 the realism score of the mouth area.

In an example embodiment, the dynamics discriminator 149 learns to distinguish between realistic and non-realistic temporal dynamics in videos. A sequence of K consecutive frames 153 ( t,t+h I) is randomly drawn from the ground truth video io and provided to the dynamics discriminator 149, and a sequence K frames 152 t -K-1) is randomly drawn from the generated video and also provided to the dynamics discriminator 149. The dynamics discriminator 149 may also receive and observe the optical flow 154 ( --2) extracted from the real frame sequence 153.

Therefore, the realism score reflects on whether or not the optical flow agrees with the /5 motion in the short input video, forcing the generator to synthesize temporally coherent frames.

In an example embodiment, the objective of GAN-based framework, as shown by system 140, can be expressed as an adversarial loss. The parameters of both the video generator and identity embedder are optimized under the adversarial loss fad" Avi Th1, with each loss term coming from the corresponding discriminator network (e.g. 150, 1515 or 149 respectively). An embedding matching loss may be added to the objective of Eg, taking advantage of the identity representation W, learned by the image discriminator, for each identity in the dataset. More specifically, given the person i, the identity feature vector 17( computed by the identity embedder and the corresponding row u;, of matrix W, the cosine distance between the identity features may be calculated according to equation 8b below:

-T

te 'WA h.' (8b) Then, the total objective of the identity embedder becomes Cr---=-C",( ehr",,,4*5 which is minimized. In an example embodiment, three more terms may be added in the loss function of the generator, such as a Visual Geometry Group (VGG) loss, a feature matching loss, and a mask reconstruction loss. Given a ground truth frame and the = 1.

-24 -synthesised frame, a pretrained VGG network may be used to compute the VGG loss Lugo. The feature matching loss Lfikit is calculated by extracting features with the image and mouth discriminators and computing the i distance between the features extracted from the fake frame and the corresponding ground truth frame. For the mask reconstruction loss, we compute the simple ti distance between the ground truth foreground mask and the foreground mask predicted by the generator. The total objective of the generator G can be written as Ai -tad"' At*00 L, (10 t: ,,,t Finally, all discriminators are optimised alongside Eta and G, under the corresponding adversarial objective functions.

In an example embodiment, the network updating stage at operation 128 comprises fine-tuning the networks (e.g. generator networks, and/or discriminator networks) of the framework in system 140 to a new unseen identity, based at least partially on the network parameters learned from the multiple person dataset, in the previous training stage. In this way, a strong person specific generator may be obtained using a very small number of training samples, in a few epochs. Given the new short frame sequence ym T (e.g. comprising RGB frames of a new target actor) and the extracted foreground masks kr'', the facial reconstruction model described above may be used for computing the conditional input a'(142) to generator 145. Then, each RGB frame 20 may be passed through the identity embedder 144 and the average identity feature may be calculated according to equation 9b below: (9b) which serves as a representation of the new target actor. Again, /view is concatenated 25 with the identity coefficients of the new target actor from the 3DMM, yielding After computing h"ezz' , the embedder Eta may not be further required in the fine-tuning of operation 128.

In an example embodiment, the operation 128 comprises further steps specific to the new target actor, including a further generator initialization stage (e.g. operation 132), discriminators initialization stage (e.g. operation 134), training stage, and synthesis stage (e.g. operation 136) for the new target actor.

-25 -For example, in the generator initialization stage, a generator initialization vector is used to initialize the normalization layers of G. Each normalization layer of G may be replaced with a simple instance normalization layer. Then, the identity projection matrices P-and Pa, learned during the initialization stage, are multiplied with kinv and the resulting vectors = -P-7117' u and Psh?cd' are used as an initialization of the modulation parameters of the instance normalization layer. Other parameters of G may be initialized from the values learned in the first multi-person training stage.

In the discriminators initialization stage, the matrix W, that contains an identity io representation vector for each person in the multi-person dataset, may be replaced with a single vector tu, which plays the role of row -iv, and is initialized with the values of h"or. The convolutional part of the image discriminator is initialized with the values learned from the previous training stage. Similar initialization may be performed for the mouth discriminator and the dynamics discriminator.

In the training stage, after setting the initial values for the generator and the discriminators from the previous stage, the framework is trained in an adversarial manner, with the generator aiming to learn the mapping from the NMFC sequence XI: 7: to the RGB video 1 and foreground mask 1:1" In In the synthesis stage, the generator 145 can be used to perform source-to-target expression and pose transfer during test time. Given a sequence of frames from the source, first we the 3D facial reconstruction is performed, and the NMFC images are computed for each time step, by adapting the reconstructed head shape to the identity parameters of the target person. These NMFC frames are fed to G, which synthesizes the desired photo-realistic frames one after the other.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The 30 software, application logic and/or hardware may reside on memory, or any computer media.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the 35 above-described functions may be optional or may be combined. Similarly, it will also r -0 -26 -be appreciated that the flow diagrams of Figures 5, 6, 12, and 13 are examples only and that various operations depicted therein may be omitted, reordered and/or combined.

It will be appreciated that the above described example embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present specification.

Moreover, the disclosure of the present application should be understood to include /o any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Claims

-27 -Claims 1. A method comprising: generating, based on a plurality of source input images of a first source subject, a first plurality of sequential source face coordinate images and gaze tracking images of the first source subject, wherein the first plurality of source face coordinate images comprise source identity parameters and source expression parameters, wherein the source expression parameters are represented as offsets from the source identity parameters; generating, based on a plurality of target input images of a first target subject, a plurality of sequential target face coordinate images and gaze tracking images of the first target subject; wherein the target face coordinate images comprise target identity parameters and target expression parameters, wherein the target expression parameters are represented as offsets from the target identity parameters; and generating, using a first neural network, a plurality of sequential output images, wherein the plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.
2. A method of claim 1, wherein: the first plurality of source face coordinate images further comprising source imaging parameters, wherein the source imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the source subject in the plurality of source input images; and or the target face coordinate images further comprising target imaging parameters, wherein the target imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the target subject in the plurality of target input images; 3. A method of any one of claims 1 and 2, further comprising: training the first neural network with the first plurality of source face coordinate images and the plurality of target face coordinate images; 4- A method of any one of claims 1 and 2, further comprising: training the first neural network with a plurality of source face coordinate images associated with a plurality of source subjects.-28 - 5. A method of any one of the preceding claims, wherein the first neural network comprises a generative adversarial network, wherein the generative adversarial network comprises: a generator module for generating the plurality of output images; and at least one discriminator module for generating at least one loss function respectively based on the one or more ground truth inputs and the plurality of output images generated by the generator module.6. The method of claim 5, further comprising training the generative adversarial ro network, wherein the training comprises: receiving one or more ground truth inputs related to the target subject; updating the generative adversarial network based on the at least one loss function.7. The method of any of claims 5 to 6, wherein the at least one discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.8. The method of any of the claims 5 to 7, wherein training the first neural network further comprises determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the target subject, wherein the generator module generates the plurality of output images based at least in part on the identity feature vector of the target subject.9. The method of any one of the claims 5 to 8, wherein generating the plurality of output images using the trained first neural network comprises: initializing the generator module; initializing the at least one discriminator modules; mapping the source expression parameters and the source gaze tracking images on the target identity parameters based on at least one determined identity feature vector of the target subject.10. The method of any one of the preceding claims, wherein the source face coordinate images and the target face coordinate images comprise normalized mean 35 face coordinates, wherein the normalized mean face coordinates comprise an encoding -29 -function of a rasterizer and a normalized version of a three dimensional morphable model.it The method of any one of the preceding claims, wherein the target face coordinate images are used as a conditional input to the first neural network.12. A method of training a neural network to generate a plurality of sequential output images, the method comprising: inputting a first plurality of sequential source face coordinate images and gaze /.0 tracking images of a first source subject, wherein the sequential source face coordinate images and gaze tracking images are based on a plurality of source input images of the first source subject; inputting a plurality of sequential target face coordinate images and gaze tracking images of a first target subject, wherein the plurality of sequential target face coordinate images and gaze tracking images are based on a plurality of target input images of the first target subject; generating, at a generator module of the neural network, the plurality of sequential output images based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters; generating, at a discriminator module of the neural network, at least one loss function based on one or more ground truth inputs and the plurality of sequential output images; and updating one or more parameters of the neural network based on the at least one loss function.13. The method of claim 12, further comprising: inputting a plurality of source face coordinate images associated with a plurality of source subjects, wherein the plurality of sequential output images is further based on the plurality of source face coordinate images associated with the plurality of source subjects.14. The method of any one of claims 12 and 13, wherein the discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.15. The method of any one of claims 12 YO 14, further comprising: -30 -determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the first target subject, wherein the generator module of the neural network generates the plurality of sequential output images based at least in part on the identity feature vector of the target subject.16. The method of any one of claims 12 to 15, wherein generating the plurality of sequential output images comprising: initializing the generator module of the neural network; initializing the discriminator module of the neural network; mapping the source expression parameters and the source gaze tracking images on target identity parameters based on at least one determined identity feature vector of the first target subject.17. A system comprising one or more processors and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform a method according to any preceding claim.18. A computer program product comprising computer readable instructions that, when executed by a computing system, cause the computing system to perform a method according to any of claims 1 to 16.