WO2021177596A1

WO2021177596A1 - Fast bi-layer neural synthesis of one-shot realistic images of neural avatar

Info

Publication number: WO2021177596A1
Application number: PCT/KR2021/000795
Authority: WO
Inventors: Egor Olegovich ZAKHAROV; Aleksei Aleksandrovich IVAKHNENKO; Aliaksandra Petrovna SHYSHEYA; Victor Sergeevich LEMPITSKY
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2020-03-03
Filing date: 2021-01-20
Publication date: 2021-09-10

Abstract

Proposed is a new architecture for neural avatars that improves state-of-the-art in several aspects. Proposed system creates neural avatars from a single photograph, provides an order of magnitude inference speedup over previous neural avatars models, and can scale neural avatar modeling to higher resolutions than the training set used to learn the model. Proposed approach models person appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a relatively small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline and is warped and added to the coarse image to ensure high effective resolution of synthesized head views.

Description

FAST BI-LAYER NEURAL SYNTHESIS OF ONE-SHOT REALISTIC IMAGES OF NEURAL AVATAR

An invention relates to fields of computer graphics, deep learning, adversarial learning, talking head synthesis, neural avatars, neural rendering, face synthesis, face animation.

Personalized neural (head) avatars driven by keypoints or other mimics/pose representation is a technology with manifold applications in telepresence, gaming, AR/VR applications, and special effects industry. Modeling human head appearance is a daunting task, due to TODO. For at least two decades, creating neural avatars (talking head models) was done with computer graphics tools using mesh-based surface models and texture maps. The resulting systems fall into two groups. Some are able to model specific people with very high realism after significant acquisition and design efforts are spent on those particular people. Others are able to create talking head models from as little as a single photograph, but fall short of photorealism.

In recent years, neural talking heads have emerged as an alternative to classic computer graphics pipeline striving to achieve both high realism and ease of acquisition. The first works required a video or even multiple videos to create a neural network that can synthesize talking head view of a person. Most recently, several works presented systems that create neural avatars from a handful of photographs (few-shot setting) or as little as a single photograph (one-shot setting), causing both excitement and concerns about potential misuse of such technology.

Methods for the neural synthesis of realistic talking head sequences can be divided into many-shot (i.e. requiring a video or multiple videos of the target person for learning the model) [11,16,18,27] and a more recent group of few-shot/singe-shot methods capable of acquiring the model of a person from a single or a handful photographs [24,28,29]. proposed method falls into the latter category as the authors focus on the one-shot scenario (modeling from a single photograph).

Along another dimension, these methods can be divided according to the architecture of the generator network. Thus, several methods [16,24,27,29] use generators based on direct synthesis, where the image is generated using a sequence of convolutional operators, interleaved with elementwise non-linearities, and normalizations. Identity information may be injected into such architecture, either with a lengthy learning (in the many-shot scenario) [16,27] or by using adaptive normalizations conditioned on person embeddings [4,24,29]. The method [29] effectively combines both approaches by injecting identity through adaptive normalizations, and then fine-tuning the resulting generator on the few-shot learning set. The direct synthesis approach for human heads can be traced back to [23] that generated lips of a person (Obama) in the talking head sequence, and further towards first works on conditional convolutional neural synthesis of generic objects such as [2].

The alternative to the direct image synthesis is to use differentiable warping [12] inside the architecture. The warping can be applied to one of the frames. The X2Face approach [28] applies warping twice, first from the source image to a standardized image (texture), and then to the target image. The codec avatar system [18] synthesizes a pose-dependent texture for simplified mesh geometry. The MareoNETte system [8] applies warping to the intermediate feature representations. The few-shot video-to-video system combines direct synthesis with the warping of the previous frame in order to obtain temporal continuity. The first-order motion model system [21] learns to warp the intermediate feature representation of the generator based on "unsupervised" keypoints that are learned from data. Beyond heads, differentiable warping have recently been used for face rotation, face normalization, full body rendering. Earlier, Deep Warp system [5] used neural warping to alter the appearance of eyes for the purpose of gaze redirection, while that used neural warping for the resynthesis of generic scenes. proposed method combines direct image synthesis with warping in a new way, as the authors use an RGB pose-independent texture comprising fine details alongside with a coarse-grained pose-dependent RGB component that is synthesized by a neural network.

Existing few-shot neural avatar systems achieve remarkable results but are still limited in two ways. First, they have a limited resolution (up to 256x256 pixels TODO CHECK). This limitation stems from the need to collect a large and diverse dataset of in-the-wild videos, which is possible at such a low resolution and much harder at higher ones. Secondly, despite the low resolution and unlike some of the graphics-based avatars, the neural systems are too slow to be deployed to mobile devices and require a high-end GPU to run in real-time. The authors note that most application scenarios of neural avatars, especially those concerned with telepresence, would benefit highly from the capability to run in real-time on a mobile device. TODO: IF the authors GET SUPERQUALITY, DISCUSS THE ONE-SHOT QUALITY LIMITATIONS.

In this invention, the authors address the two limitations of one-shot neural avatar systems, and develop an approach that can run at higher resolution and much faster than previous systems. To achieve this, the authors adopt a bi-layer representation, where the image of an avatar in a new pose is generated by summing two components: a coarse image directly predicted by a rendering network, and a warped texture image. While the warp of the texture is also predicted by the rendering network, the texture itself is estimated at the time of avatar generation and is fixed at runtime. To enable the few-shot capability, the authors use the meta-learning stage on a dataset of videos, where the authors (meta)-train the rendering network, the embedding network, as well as the texture generation network.

The separation of the target frames into two layers allows both to improve the effective resolution and the speed of neural rendering. This is because, the authors can use off-line avatar generation stage to synthesize high-resolution texture, while at test time both the first component (coarse image) and the warping of the texture need not contain high frequency details and can therefore be predicted by a relatively small rendering network. These advantages of proposed system are validated by extensive comparisons with previously proposed neural avatar systems. The authors also report on the smartphone-based real-time implementation of proposed system, which was beyond the reach of previously proposed systems.

Proposed is a hardware, comprising software products that perform method for generation of photorealistic images of neural avatar in one-shot mode by the following stages:

stage of initialization for the creation of neural avatar, comprising the following steps:

encoding a concatenation of a source image

and source pose

, encoded as a landmark image into a stack of predicting embeddings

, by embedder network

;

initializing adaptive parameters from predicting embeddings

and decoding an inpainted high-frequency texture of the source image

by texture generator network

;

creating neural avatar by initializing adaptive parameters of texture generator network

by ues embeddings

, and predicting texture

by texture generator network

;

stage of inference for generation images of the neural avatar comprising the following steps:

initializing adaptive parameters of inference generator network

by use embeddings

, and using target pose

to predict a low-frequency component

of the image of the avatar and a warping field

by inference generator network

, which generates the high-frequency component

of the image of the avatar

by applying to the texture

and the warping field

, namely

,

wherein the image of the avatar

is computed as a sum of the high-frequency component

and the low-frequency component

, namely

.

Wherein the target pose

is defined by the vector of face keypoint coordinates. Stage of initialization is only done once per each avatar. Wherein texture can be high-frequency texture. Stage of initialization further comprises updating the high-frequency texture using the texture updater network that is trained to add person-specific details to the texture by observing the mismatch between the source image

and the avatar image for the source pose

obtained before the update of the texture. Wherein the warping field

is a mapping between coordinate spaces of the texture and the image of avatar. The embedder network

, the texture generator network

, the inference generator network

are trained in an end-to-end fashion. Wherein method for generation of photorealistic images of neural avatar further comprising mapping a real or a synthesized target image, concatenated with the target pose, into realism scores by the discriminator network

. The target pose

is obtained by an external landmark tracking process. Tracking process can be applied to another video sequence of the same or a different person, based on the voice signal of the person, or created in some other way.

The above and/or other aspects will be more apparent by describing exemplary embodiments with reference to the accompanying drawing.

Figure 1 illustrates generation of the output image.

Figure 2 illustrates the general pipeline of the method.

The developed model can be used for synthesis of artificial images of people, guided by the pose expression. The model can run on cloud platforms, desktop solutions and mobile devices.

The proposed invention can be realized in server for the initialization, and a smartphone for inference, namely the initialization component can be transfer to the smartphone.

The model produces a realistic image of a person given a single source image (which is called "one-shot learning" ), and a set of facial keypoints, which encode face expression and head rotation ( "talking head synthesis" ). The key difference between the other models is the capability to run in real time mobile devices. The key novelty of the proposed method is to decompose an output image into a low- and high-frequency components. Low-frequency component, therefore, can be synthesized in real-time using conventional approaches, but with a much "faster" model, compared to the previous work. The high-frequency component is predicted via warping of the texture, with the former fixed during inference.

This allows to "offload" some of the computation that is done conventionally during inference into a person-specific initialization stage. The former accepts a single source image of a person, and initializes internal parameters of the model, specific to that person. Another novelty is an application of an existing "learned gradient descend" method for the texture, which allows to further tailor it to a particular person during the initialization stage and reduce the identity gap of the produced avatar.

Methods

Used is video sequences annotated with keypoints and, optionally, segmentation masks, for training, t-th frame of the i-th video sequence are denoted as xi(t), corresponding keypoints as yi(t), and segmentation masks as mi(t). Used is an index t to denote a target frame, and s - a source frame. Also, the authors will mark all tensors, related to generated images, with a hat symbol, ex.

. Proposed is the spatial size of all frames to be constant and denote it as H x W. In some modules, input keypoints are encoded as an RGB image, which is a standard approach in a large body of previous works [8,25,29], In this application, it is called as a landmark image. But, contrary to these approaches, the authors input the keypoints into the inference generator directly as a vector. This allows to significantly reduce the inference time of the method.

Architecture

As illustrated in Fig. 1. The output image is generated in two stages: initialization and inference. During initialization, the authors predict embeddings using a source frame, initialize adaptive parameters of both inference and texture generators, and predict a high-frequency texture. The stage of initialization is only done once per each avatar. During inference, the authors use target keypoints (target pose) to predict a low-frequency component of the output image and a warping field, which, applied to the texture, gives the high-frequency component. These components, namely the predicted low-frequency component image and the warped texture image, are added together to produce an output.

In the proposed approach, the following networks are trained in an end-to-end fashion:

- The embedder network

encodes a concatenation of a source image and a landmark image into a stack of embeddings

, which are used for initialization of the adaptive parameters inside the generators.

- The texture generator network

initializes its adaptive parameters from the embeddings and decodes an inpainted high-frequency component of the source image, which the authors call a texture

.

The inference generator network

maps target poses into image of the avatar

. This network consists of three parts. A pose embedder part maps a pose vector into a spatial tensor, which is used as an input for the convolutional part. The latter performs upsampling, guided by the adaptive parameters, predicted from the embeddings. The output of the convolutional part is split into

(a low-frequency layer of the output image), which encodes basic facial features, skin color and light sources and

(a mapping between coordinate spaces of the texture and the output image). These outputs are combined in the composing part. The high frequency layer of the output image is obtained by warping the predicted texture:

, and is added to a low-frequency component to produce image of the avatar:

Finally, the discriminator network

, which is a conditional [19] relativistic [14] PatchGAN [11], maps a real or a synthesised target image, concatenated with the target landmark image, into realism scores

.

During training, the generated is an output image in two stages: person-specific initialization and inference (see figure 1). During the initialization stage, the authors first input a source image

and a source pose

, encoded as a landmark image, into the embedder. The outputs of the embedder are

tensor

, which are used to predict the adaptive parameters of the texture generator and the inference generator. Then, a high-frequency texture

of the source image is synthesized by the texture generator, which concludes the initialization. During the inference stage, the authors only input corresponding target pose

into the inference generator. It predicts a low-frequency component of the output image

directly and a high-frequency component

by warping the texture with a prediced filed

. The image of the avatar

is a sum of these two components.

It is important to note that while the texture generator is manually forced to generate only a high-frequency component of the image via the design of the loss functions the authors do not specifically constrain it to perform inpainting. This behavior is emergent from the fact that the used are two different images with different poses for initialization and loss calculation.

Figure 2 illustrates the general pipeline of the method. The initialization module receives an image of the user. It then takes 100ms on an NVIDIA GPU to initialize an avatar, i.e. to precompute the weights of the inference generator network and the texture, as well as to adjust the texture. After such initialization, a new image of an avatar for a new pose defined by facial keypoint positions can be obtained by the inference module in much smaller time (e.g. 42 ms on a mobile Snapdragon 855 GPU).

Training process

Used are multiple loss functions for training. The main loss function responsible for the realism of the outputs is trained in an adversarial way [7]. Also used are pixelwise loss to preserve source lightning conditions and perceptual [13] loss to match the source identity in the outputs. Finally, a regularization of the texture mapping adds robustness to the random initialization of the model.

Adversarial loss

Adversarial loss is optimized by both the generator and the discriminator networks. Usually, it resembles a binary classification loss function between real and fake images, which discriminator is optimized to minimize, and generator - maximize [7]. Authors follow a large body of previous works [1,8,25,29] and use a hinge loss as a substitute for the original binary cross entropy loss. The authors also perform relativistic realism score calculation [14], following its recent success in tasks such as super-resolution [27] and denoising [15]. This addition is supposed to make the adversarial training more stable [14]. Therefore, the authors use equations 2 and 3 to calculate realism scores for real and fake images respectively, with in and tn denoting indices of mini-batch elements, N - a mini - batch size and

Moreover, the authors use PatchGAN [11] formulation of the adversarial learning. In it, the discriminator outputs a matrix of realism scores instead of a single prediction, and each element of this matrix is treated as a realism score for a corresponding patch in the input image. This formulation is also used in a large body of relevant works [8, 25,26] and improves the stability of the adversarial training. If the authors denote the size of a scores matrix

, the resulting objectives can be written as follows:

Equation 4 is the only term that is used for the training of the discriminator. For the generator, the authors also calculate a feature matching loss [26], which has now become a standard component of supervised image-to-image translation models. In this objective, the authors want to minimize a distance between the intermediate feature maps of discriminator, calculated using corresponding target and generated images. If the authors denote as

the features at different spatial resolutions

, then the feature matching objective can be calculated as follows:

Pixelwise and perceptual losses force the predicted images to match the ground truth, and are respectively applied to low- and high-frequency components of the output images. Since usage of pixelwise losses assumes that all pixels in the image are statistically independent, empirically the optimization process leads to blurry images [11], which is most suitable for a low-frequency component of the output. On contrary, the optimization of a perceptual loss leads to crisper and more realistic images [13], which the authors utilize to train a high-frequency component. If remove that separation between the components and train them jointly via a single objective, that makes the method unstable with respect to architecture choices and even the quality of the images in the dataset, with either a low or a high-frequency component receiving all the training signal, and the other getting close to none, which leads to suboptimal performance after convergence is achieved.

Pixelwise loss is calculated by simply measuring a mean L1 distance between a target image and a low-frequency component:

To calculate a perceptual loss, the authors need to use a stop-gradient operator SG, that allows to prevent the gradient flow into a low-frequency component. The input generated image is, therefore, calculated as following:

Following [8] and [29], proposed variant of the perceptual loss consists of two components: features evaluated using an ILSVRC (ImageNet) pre-trained VGG19 network [22], and a VGGFace network [20], trained for face recognition. If the authors denote the intermediate features of these networks as

, and their spatial size as

, the objectives can be written as follows:

It is important to note that, in contrary to these pairwise losses, adversarial loss is back-propagated both into low- and high-frequency components, which leads to better realism and pose preservation in the predicted images.

Texture mapping regularization is proposed to improve the stability of the training. The training signal that the texture generator Gtex receives is first warped by the warping field

predicted by the inference generator. Because of this, random initializations of the networks typically lead to suboptimal textures, in which the face of the source person occupies a small fraction of the total area of the texture. As the training progresses, this leads to a lower effective resolution of the output image, since the optimization process is unable to escape this bad local optima. The authors address the problem by treating the network's output as a delta to an identity mapping, and also by applying a magnitude penalty on that delta in the early iterations. The weight of this penalty is multiplicatively reduced to zero during training, so it does not affect the final performance of the model. More formally, the authors decompose the output warping field into a sum of two terms:

, where

denotes an identity mapping, and apply an

penalty, averaged by a number of spatial positions in the mapping, to the second term:

The generator networks, i.e. the image embedder, the texture generator and the inference generator are jointly trained on a single objective

which is a weighted sum of the objectives 5-7, 9-11, and, optionally,

.

Fine-tuning

Training on a person-specific source data leads to significant improvement in realism and identity preservation of the synthesized images [29], but is computationally expensive, if it involves optimization of the model's parameters or usage of "heavy" objectives like adversarial or perceptual losses. Moreover, when the source data is scarce, like in one-shot scenario, fine-tuning may lead to overfitting and performance degradation, which is observed in [29].

Authors address both of these problems by using a learned gradient descend (LGD) method to optimize only the synthesized texture

. Optimizing with respect to the texture tensor prevents the model from overfitting, while the LGD allows to perform optimization with respect to computationally expensive objectives by doing forward passes through a pre-trained network.

Specifically, the authors introduce a lightweight loss function

(we used a sum of squared errors), that measures the distance between a generated image and a ground-truth in the pixel space, and a texture updating network

, that uses the current state of the texture and a gradient of this function with respect to the texture to produce an update

. During fine-tuning the authors recursively perform M update steps, each time measuring the gradients of

with respect to an updated texture. More formally:

where

denotes an iteration number, with

. During test-time, the authors do the same M updates to the texture and use the obtained

for inference.

The network

is trained by back-propagation through all M steps. For training, the authors use the same objective

that was used during the training of the base model. The authors evaluate it using a target frame

and a generated frame

It is important to highlight that

used for training of

, but simply guides the updates to the texture. Also, the gradients with respect to this loss are evaluated on the source image, while the objective is calculated on a target image, which implies that the network has to produce updates for the whole texture, not just a region "visible" on the source image. Lastly, while the authors do not propagate any gradients into the generator part of the base model, the authors keep training the discriminator using the same objective

. Even though training the updater network jointly with the base generator is possible, and can lead to better quality (following the success of model agnostic meta-learning [3] method), the authors resort to two-stage training due to memory constraints.

Segmentation

Empirically the authors have observed that the average area which the face occupies on target images affects the performance of proposed method. A warping from the texture coordinate space to the image space is trained in an unsupervised way and is heavily influenced by the dataset. For example, if there is a strong correlation between the frames in the training videos, there is no incentive for the texture generator to produce a proper texture with hallucinated features, given the objective to match a target image. In that case, the model can simply decode the source image from the embedding and produce good results. This failure case leads to poor extrapolation on novel viewpoints and can be achieved if, for example, the authors significantly increase the area of background, since it is heavily correlated between the source and target frames.

Therefore, if for enlarging the crop size, such that it would fit a full head, it is necessary to do foreground segmentation in order to filter out the training signal related to background. The authors use a state-of-the-art face and body segmentation model [6] to obtain the ground truth masks. Then, the authors predict the mask

via an inference generator alongside with its other outputs, and train it via a binary cross-entropy loss

to match the ground truth

. To filter out the training signal, related to the background, the authors have explored multiple options. The authors cannot simply mask the gradients that are fed into the generator, because it would lead to overfitting of the discriminator. The authors also cannot apply the ground truth masks to all the images in the dataset, since the model [6] works so well that it produces a sharp border between the foreground and the background, leading to the occurrence of border artifacts during training.

Instead, the authors figured out that usage of predictions

works much better. They are smooth and prevent the discriminator from overfitting to either the lack of background, or sharpness of the border. Used is

to mask all the synthesized images and the ground truth image before the image-based losses are applied. The stop-gradient operator ensures that the training does not converge to a degenerate state.

Implementation details

Proposed networks consist of pre-activation residual blocks [9] with leaky ReLU activations. A minimum number of features in these blocks are set to 64, and a maximum to 512. In a default configuration, the authors use half the number of features in the inference generator, but the authors also evaluate proposed model with full- and quarter-capacity inference part, with the results provided in the experiments.

We use batch normalization [10] in all the networks except for the embedder and the texture updater. Inside the texture generator, the authors pair batch normalization with adaptive SPADE layers [25]. These layers are modified to predict pixelwise scale and bias coefficients using feature maps, which are being treated as model parameters, instead of being input from a different network. This allows to save memory by removing additional networks and intermediate feature maps from the optimization process, and to increase the batch size. Also, following [25], the authors predict weights for all 1 x 1 convolutions in the network from the embeddings

, which includes the scale and bias mappings in AdaSPADE layers, and skip connections in the residual upsampling blocks. In the inference generator, the authors use standard adaptive batch normalization layers [1], but also predict weights for the skip connections from the embeddings. For the vector pose embedding used is a multi-layer perceptron, with its output reshaped into a convolutional part's input.

A simultaneous gradient descend is implemented on parameters of the generator networks and the discriminator using Adam [17] with a learning rate of

. The authors use 0.5 weight for adversarial losses (eq. 4-5), and 10 for all other losses, except for the VGGFace perceptual loss (eq. 10), which is set to 0.01. The weight of the regularizer (eq. 11) is then multiplicatively reduced by 0.9 every 50 iterations. Proposed models are trained on 8 NVIDIA P40 GPUs with the batch size of 48 for the base model, a batch size of 32 for the updater model. The authors set unrolling depth M of the updater to 4 and use a sum of squared elements as the lightweight objective. Batch normalization statistics are synchronized across all GPUs during training. During inference they are replaced with "standing" statistics, similar to [1], which significantly improves the quality of the outputs, compared to the usage of running statistics. Spectral normalization is also applied in all linear and convolutional layers of all networks.

Please refer to the supplementary material for a detailed description of proposed model's architecture, as well as the discussion of training and architectural features that the authors have adopted.

The foregoing exemplary embodiments are examples and are not to be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

References

1. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)

2. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538-1546 (2015)

3. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017 (2017)

4. Fu, C., Hu, Y., Wu, X., Wang, G., Zhang, Q., He, R.: High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv: 1903.12003 (2019)

5. Ganin, Y., Kononenko, D., Sungatul1ina, D., Lempitsky, V.: Deepwarp: Photorealistic image resynthesis for gaze manipulation. In: European Conference on Computer Vision, pp. 311-326. Springer (2016)

6. Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: Universal human parsing via graph transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019 (2019)

7. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 (2014)

8. Ha, S., Kersner, M., Kim, B., Seo, S., Kim, D.: Marionette: Few-shot face reenactment preserving identity of unseen targets. CoRR abs/1911.08139 (2019)

9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Computer Vision - ECCV 2016 - 14th European Conference (2016)

10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015 (2015)

11. Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017)

12. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks, pp. 2017-2025 (2015)

13. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision - ECCV 2016 - 14th European Conference (2016)

14. Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard GAN. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)

15. Kim, D.,Chung, J.R., Jung, S.: GRDN: grouped residual dense network for real image denoising and gan-based real-world noise modeling. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019 (2019)

16. Kim, H.,Garrido, P., Tewari, A., Xu, W., Thies, J., Niefiner, M., Perez, P., Richardt, C., Zollhofer, M., Theobalt, C.: Deep video portraits. arXiv preprint arXiv: 1805.11714 (2018)

17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)

18. Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Transactions on Graphics (TOG) 37(4), 68 (2018)

19. Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR abs/1411.1784 (2014)

20. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference 2015, BMVC 2015 (2015)

21. Siarohin, A., Lathuiliere, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurlPS 2019 (2019)

22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556(2014), http://arxiv.org/abs/1409.1556

23. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36(4), 95 (2017)

24. Tripathy, S., Kannala, J., Rahtu, E.: Icface: Interpretable and controllable face reenactment using gans. CoRR abs/1904.01909 (2019), http://arxiv.org/abs/1904.01909

25. Wang, T., Liu, M., Tao, A., Liu, G., Catanzaro, B., Kautz, J.: Few- shot video-to-video synthesis. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurlPS 2019 (2019)

26. Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 (2018)

27. Wang, X, Yu, K, Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y, Loy, C.C.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Computer Vision - ECCV 2018 Workshops (2018)

28. Wiles, O., Sophia Koepke, A., Zisserman, A.: X2face: A network for controlling face generation using images, audio, and pose codes. In: The European Conference on Computer Vision (ECCV) (September 2018)

29. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.S.: Few-shot adversarial learning of realistic neural talking head models. In: IEEE International Conference on Computer Vision, ICCV 2019 (2019)

Claims

A hardware, comprising software products that perform method for generation of photorealistic images of neural avatar in one-shot mode by the following stages:

stage of initialization for the creation of neural avatar, comprising the following steps:

encoding a concatenation of a source image
and source pose
, encoded as a landmark image into a stack of predicting embeddings
, by embedder network
;

initializing adaptive parameters from predicting embeddings
and decoding an inpainted high-frequency texture of the source image
by texture generator network
;

creating neural avatar by initializing adaptive parameters of texture generator network
by use embedding
, and predicting texture
by texture generator network
;

stage of inference for generation images of the neural avatar comprising the following steps:

initializing adaptive parameters of inference generator network
by use embeddings
, and using target pose
to predict a low-frequency component
of the image of image of the avatar and a warping field
by inference generator network
, which generates the high-frequency component
of the image of the avatar
by applying to the texture
and the warping field
, namely
,

wherein the image of the avatar
is computed as a sum of the high-freauency component
and the low-freauency component
, namely
.
The hardware according to claim 1, wherein the target pose
is defined by the vector of face keypoint coordinates.
The hardware according to claim 1, wherein stage of initialization is only done once per each avatar.
The hardware according to claim 1, wherein texture can be high-frequency texture.
The hardware according to claim 1, wherein stage of initialization further comprises updating the high-frequency texture using the texture updater network that is trained to add person-specific details to the texture by observing the mismatch between the source image
and the avatar image for the source pose
obtained before the update of the texture.
The hardware according to claim 1, wherein the warping field
is a mapping between coordinate spaces of the texture and the image of avatar.
The hardware according to claim 1, wherein the embedder network
, the texture generator network
, the inference generator network
are trained in an end-to-end fashion.
The hardware according to claim 1, further comprising mapping a real or a synthesized target image, concatenated with the target pose, into realism scores by the discriminator network
.
The hardware according to claim 1, wherein the target pose
is obtained by an external landmark tracking process.
The hardware according to claim 9, wherein the tracking process can be applied to another video sequence of the same or a different person, based on the voice signal of the person.