WO2021177596A1 - Fast bi-layer neural synthesis of one-shot realistic images of neural avatar - Google Patents

Fast bi-layer neural synthesis of one-shot realistic images of neural avatar Download PDF

Info

Publication number
WO2021177596A1
WO2021177596A1 PCT/KR2021/000795 KR2021000795W WO2021177596A1 WO 2021177596 A1 WO2021177596 A1 WO 2021177596A1 KR 2021000795 W KR2021000795 W KR 2021000795W WO 2021177596 A1 WO2021177596 A1 WO 2021177596A1
Authority
WO
WIPO (PCT)
Prior art keywords
texture
image
avatar
neural
network
Prior art date
Application number
PCT/KR2021/000795
Other languages
French (fr)
Inventor
Egor Olegovich ZAKHAROV
Aleksei Aleksandrovich IVAKHNENKO
Aliaksandra Petrovna SHYSHEYA
Victor Sergeevich LEMPITSKY
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2020124828A external-priority patent/RU2764144C1/en
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2021177596A1 publication Critical patent/WO2021177596A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation

Definitions

  • An invention relates to fields of computer graphics, deep learning, adversarial learning, talking head synthesis, neural avatars, neural rendering, face synthesis, face animation.
  • Personalized neural (head) avatars driven by keypoints or other mimics/pose representation is a technology with manifold applications in telepresence, gaming, AR/VR applications, and special effects industry. Modeling human head appearance is a daunting task, due to TODO. For at least two decades, creating neural avatars (talking head models) was done with computer graphics tools using mesh-based surface models and texture maps. The resulting systems fall into two groups. Some are able to model specific people with very high realism after significant acquisition and design efforts are spent on those particular people. Others are able to create talking head models from as little as a single photograph, but fall short of photorealism.
  • neural talking heads have emerged as an alternative to classic computer graphics pipeline striving to achieve both high realism and ease of acquisition.
  • the first works required a video or even multiple videos to create a neural network that can synthesize talking head view of a person.
  • Methods for the neural synthesis of realistic talking head sequences can be divided into many-shot (i.e. requiring a video or multiple videos of the target person for learning the model) [11,16,18,27] and a more recent group of few-shot/singe-shot methods capable of acquiring the model of a person from a single or a handful photographs [24,28,29]. proposed method falls into the latter category as the authors focus on the one-shot scenario (modeling from a single photograph).
  • the alternative to the direct image synthesis is to use differentiable warping [12] inside the architecture.
  • the warping can be applied to one of the frames.
  • the X2Face approach [28] applies warping twice, first from the source image to a standardized image (texture), and then to the target image.
  • the codec avatar system [18] synthesizes a pose-dependent texture for simplified mesh geometry.
  • the MareoNETte system [8] applies warping to the intermediate feature representations.
  • the few-shot video-to-video system combines direct synthesis with the warping of the previous frame in order to obtain temporal continuity.
  • the first-order motion model system [21] learns to warp the intermediate feature representation of the generator based on "unsupervised" keypoints that are learned from data.
  • Deep Warp system used neural warping to alter the appearance of eyes for the purpose of gaze redirection, while that used neural warping for the resynthesis of generic scenes.
  • proposed method combines direct image synthesis with warping in a new way, as the authors use an RGB pose-independent texture comprising fine details alongside with a coarse-grained pose-dependent RGB component that is synthesized by a neural network.
  • the authors address the two limitations of one-shot neural avatar systems, and develop an approach that can run at higher resolution and much faster than previous systems.
  • the authors adopt a bi-layer representation, where the image of an avatar in a new pose is generated by summing two components: a coarse image directly predicted by a rendering network, and a warped texture image. While the warp of the texture is also predicted by the rendering network, the texture itself is estimated at the time of avatar generation and is fixed at runtime.
  • the authors use the meta-learning stage on a dataset of videos, where the authors (meta)-train the rendering network, the embedding network, as well as the texture generation network.
  • the separation of the target frames into two layers allows both to improve the effective resolution and the speed of neural rendering. This is because, the authors can use off-line avatar generation stage to synthesize high-resolution texture, while at test time both the first component (coarse image) and the warping of the texture need not contain high frequency details and can therefore be predicted by a relatively small rendering network.
  • Proposed is a hardware, comprising software products that perform method for generation of photorealistic images of neural avatar in one-shot mode by the following stages:
  • stage of initialization for the creation of neural avatar comprising the following steps:
  • creating neural avatar by initializing adaptive parameters of texture generator network by ues embeddings , and predicting texture by texture generator network ;
  • stage of inference for generation images of the neural avatar comprising the following steps:
  • the image of the avatar is computed as a sum of the high-frequency component and the low-frequency component , namely .
  • the target pose is defined by the vector of face keypoint coordinates.
  • Stage of initialization is only done once per each avatar.
  • texture can be high-frequency texture.
  • Stage of initialization further comprises updating the high-frequency texture using the texture updater network that is trained to add person-specific details to the texture by observing the mismatch between the source image and the avatar image for the source pose obtained before the update of the texture.
  • the warping field is a mapping between coordinate spaces of the texture and the image of avatar.
  • the embedder network , the texture generator network , the inference generator network are trained in an end-to-end fashion.
  • method for generation of photorealistic images of neural avatar further comprising mapping a real or a synthesized target image, concatenated with the target pose, into realism scores by the discriminator network .
  • the target pose is obtained by an external landmark tracking process. Tracking process can be applied to another video sequence of the same or a different person, based on the voice signal of the person, or created in some other way.
  • Figure 1 illustrates generation of the output image.
  • Figure 2 illustrates the general pipeline of the method.
  • the developed model can be used for synthesis of artificial images of people, guided by the pose expression.
  • the model can run on cloud platforms, desktop solutions and mobile devices.
  • the proposed invention can be realized in server for the initialization, and a smartphone for inference, namely the initialization component can be transfer to the smartphone.
  • the model produces a realistic image of a person given a single source image (which is called “one-shot learning” ), and a set of facial keypoints, which encode face expression and head rotation ( “talking head synthesis” ).
  • the key difference between the other models is the capability to run in real time mobile devices.
  • the key novelty of the proposed method is to decompose an output image into a low- and high-frequency components.
  • Low-frequency component therefore, can be synthesized in real-time using conventional approaches, but with a much “faster” model, compared to the previous work.
  • the high-frequency component is predicted via warping of the texture, with the former fixed during inference.
  • Used is video sequences annotated with keypoints and, optionally, segmentation masks, for training, t-th frame of the i-th video sequence are denoted as xi(t), corresponding keypoints as yi(t), and segmentation masks as mi(t).
  • Used is an index t to denote a target frame, and s - a source frame.
  • the authors will mark all tensors, related to generated images, with a hat symbol, ex. . Proposed is the spatial size of all frames to be constant and denote it as H x W.
  • input keypoints are encoded as an RGB image, which is a standard approach in a large body of previous works [8,25,29], In this application, it is called as a landmark image. But, contrary to these approaches, the authors input the keypoints into the inference generator directly as a vector. This allows to significantly reduce the inference time of the method.
  • the output image is generated in two stages: initialization and inference.
  • initialization the authors predict embeddings using a source frame, initialize adaptive parameters of both inference and texture generators, and predict a high-frequency texture.
  • stage of initialization is only done once per each avatar.
  • inference the authors use target keypoints (target pose) to predict a low-frequency component of the output image and a warping field, which, applied to the texture, gives the high-frequency component.
  • the embedder network encodes a concatenation of a source image and a landmark image into a stack of embeddings , which are used for initialization of the adaptive parameters inside the generators.
  • the texture generator network initializes its adaptive parameters from the embeddings and decodes an inpainted high-frequency component of the source image, which the authors call a texture .
  • the inference generator network maps target poses into image of the avatar .
  • This network consists of three parts.
  • a pose embedder part maps a pose vector into a spatial tensor, which is used as an input for the convolutional part. The latter performs upsampling, guided by the adaptive parameters, predicted from the embeddings.
  • the output of the convolutional part is split into (a low-frequency layer of the output image), which encodes basic facial features, skin color and light sources and (a mapping between coordinate spaces of the texture and the output image). These outputs are combined in the composing part.
  • the high frequency layer of the output image is obtained by warping the predicted texture: , and is added to a low-frequency component to produce image of the avatar:
  • the discriminator network which is a conditional [19] relativistic [14] PatchGAN [11], maps a real or a synthesised target image, concatenated with the target landmark image, into realism scores .
  • the generated is an output image in two stages: person-specific initialization and inference (see figure 1).
  • the authors first input a source image and a source pose , encoded as a landmark image, into the embedder.
  • the outputs of the embedder are tensor , which are used to predict the adaptive parameters of the texture generator and the inference generator.
  • a high-frequency texture of the source image is synthesized by the texture generator, which concludes the initialization.
  • the authors only input corresponding target pose into the inference generator. It predicts a low-frequency component of the output image directly and a high-frequency component by warping the texture with a prediced filed .
  • the image of the avatar is a sum of these two components.
  • Figure 2 illustrates the general pipeline of the method.
  • the initialization module receives an image of the user. It then takes 100ms on an NVIDIA GPU to initialize an avatar, i.e. to precompute the weights of the inference generator network and the texture, as well as to adjust the texture. After such initialization, a new image of an avatar for a new pose defined by facial keypoint positions can be obtained by the inference module in much smaller time (e.g. 42 ms on a mobile Qualcomm 855 GPU).
  • Adversarial loss is optimized by both the generator and the discriminator networks. Usually, it resembles a binary classification loss function between real and fake images, which discriminator is optimized to minimize, and generator - maximize [7].
  • Authors follow a large body of previous works [1,8,25,29] and use a hinge loss as a substitute for the original binary cross entropy loss.
  • the authors also perform relativistic realism score calculation [14], following its recent success in tasks such as super-resolution [27] and denoising [15]. This addition is supposed to make the adversarial training more stable [14]. Therefore, the authors use equations 2 and 3 to calculate realism scores for real and fake images respectively, with in and tn denoting indices of mini-batch elements, N - a mini - batch size and
  • the authors use PatchGAN [11] formulation of the adversarial learning.
  • the discriminator outputs a matrix of realism scores instead of a single prediction, and each element of this matrix is treated as a realism score for a corresponding patch in the input image.
  • This formulation is also used in a large body of relevant works [8, 25,26] and improves the stability of the adversarial training. If the authors denote the size of a scores matrix , the resulting objectives can be written as follows:
  • Equation 4 is the only term that is used for the training of the discriminator.
  • the authors also calculate a feature matching loss [26], which has now become a standard component of supervised image-to-image translation models.
  • the authors want to minimize a distance between the intermediate feature maps of discriminator, calculated using corresponding target and generated images. If the authors denote as the features at different spatial resolutions , then the feature matching objective can be calculated as follows:
  • Pixelwise and perceptual losses force the predicted images to match the ground truth, and are respectively applied to low- and high-frequency components of the output images. Since usage of pixelwise losses assumes that all pixels in the image are statistically independent, empirically the optimization process leads to blurry images [11], which is most suitable for a low-frequency component of the output. On contrary, the optimization of a perceptual loss leads to crisper and more realistic images [13], which the authors utilize to train a high-frequency component.
  • Pixelwise loss is calculated by simply measuring a mean L1 distance between a target image and a low-frequency component:
  • the input generated image is, therefore, calculated as following:
  • Texture mapping regularization is proposed to improve the stability of the training.
  • the training signal that the texture generator Gtex receives is first warped by the warping field predicted by the inference generator. Because of this, random initializations of the networks typically lead to suboptimal textures, in which the face of the source person occupies a small fraction of the total area of the texture. As the training progresses, this leads to a lower effective resolution of the output image, since the optimization process is unable to escape this bad local optima.
  • the authors address the problem by treating the network's output as a delta to an identity mapping, and also by applying a magnitude penalty on that delta in the early iterations. The weight of this penalty is multiplicatively reduced to zero during training, so it does not affect the final performance of the model. More formally, the authors decompose the output warping field into a sum of two terms: , where denotes an identity mapping, and apply an penalty, averaged by a number of spatial positions in the mapping, to the second term:
  • the generator networks i.e. the image embedder, the texture generator and the inference generator are jointly trained on a single objective which is a weighted sum of the objectives 5-7, 9-11, and, optionally, .
  • the authors introduce a lightweight loss function (we used a sum of squared errors), that measures the distance between a generated image and a ground-truth in the pixel space, and a texture updating network , that uses the current state of the texture and a gradient of this function with respect to the texture to produce an update .
  • a lightweight loss function we used a sum of squared errors
  • a texture updating network that uses the current state of the texture and a gradient of this function with respect to the texture to produce an update .
  • M update steps More formally:
  • the network is trained by back-propagation through all M steps.
  • the authors use the same objective that was used during the training of the base model.
  • the authors evaluate it using a target frame and a generated frame
  • a warping from the texture coordinate space to the image space is trained in an unsupervised way and is heavily influenced by the dataset. For example, if there is a strong correlation between the frames in the training videos, there is no incentive for the texture generator to produce a proper texture with hallucinated features, given the objective to match a target image. In that case, the model can simply decode the source image from the embedding and produce good results. This failure case leads to poor extrapolation on novel viewpoints and can be achieved if, for example, the authors significantly increase the area of background, since it is heavily correlated between the source and target frames.
  • Proposed networks consist of pre-activation residual blocks [9] with leaky ReLU activations.
  • a minimum number of features in these blocks are set to 64, and a maximum to 512.
  • the authors use half the number of features in the inference generator, but the authors also evaluate proposed model with full- and quarter-capacity inference part, with the results provided in the experiments.
  • the authors use standard adaptive batch normalization layers [1], but also predict weights for the skip connections from the embeddings.
  • For the vector pose embedding used is a multi-layer perceptron, with its output reshaped into a convolutional part's input.
  • a simultaneous gradient descend is implemented on parameters of the generator networks and the discriminator using Adam [17] with a learning rate of .
  • the authors use 0.5 weight for adversarial losses (eq. 4-5), and 10 for all other losses, except for the VGGFace perceptual loss (eq. 10), which is set to 0.01.
  • the weight of the regularizer (eq. 11) is then multiplicatively reduced by 0.9 every 50 iterations.
  • Proposed models are trained on 8 NVIDIA P40 GPUs with the batch size of 48 for the base model, a batch size of 32 for the updater model.
  • the authors set unrolling depth M of the updater to 4 and use a sum of squared elements as the lightweight objective.
  • Batch normalization statistics are synchronized across all GPUs during training. During inference they are replaced with "standing" statistics, similar to [1], which significantly improves the quality of the outputs, compared to the usage of running statistics.
  • Spectral normalization is also applied in all linear and convolutional layers of all networks.

Abstract

Proposed is a new architecture for neural avatars that improves state-of-the-art in several aspects. Proposed system creates neural avatars from a single photograph, provides an order of magnitude inference speedup over previous neural avatars models, and can scale neural avatar modeling to higher resolutions than the training set used to learn the model. Proposed approach models person appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a relatively small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline and is warped and added to the coarse image to ensure high effective resolution of synthesized head views.

Description

FAST BI-LAYER NEURAL SYNTHESIS OF ONE-SHOT REALISTIC IMAGES OF NEURAL AVATAR
An invention relates to fields of computer graphics, deep learning, adversarial learning, talking head synthesis, neural avatars, neural rendering, face synthesis, face animation.
Personalized neural (head) avatars driven by keypoints or other mimics/pose representation is a technology with manifold applications in telepresence, gaming, AR/VR applications, and special effects industry. Modeling human head appearance is a daunting task, due to TODO. For at least two decades, creating neural avatars (talking head models) was done with computer graphics tools using mesh-based surface models and texture maps. The resulting systems fall into two groups. Some are able to model specific people with very high realism after significant acquisition and design efforts are spent on those particular people. Others are able to create talking head models from as little as a single photograph, but fall short of photorealism.
In recent years, neural talking heads have emerged as an alternative to classic computer graphics pipeline striving to achieve both high realism and ease of acquisition. The first works required a video or even multiple videos to create a neural network that can synthesize talking head view of a person. Most recently, several works presented systems that create neural avatars from a handful of photographs (few-shot setting) or as little as a single photograph (one-shot setting), causing both excitement and concerns about potential misuse of such technology.
Methods for the neural synthesis of realistic talking head sequences can be divided into many-shot (i.e. requiring a video or multiple videos of the target person for learning the model) [11,16,18,27] and a more recent group of few-shot/singe-shot methods capable of acquiring the model of a person from a single or a handful photographs [24,28,29]. proposed method falls into the latter category as the authors focus on the one-shot scenario (modeling from a single photograph).
Along another dimension, these methods can be divided according to the architecture of the generator network. Thus, several methods [16,24,27,29] use generators based on direct synthesis, where the image is generated using a sequence of convolutional operators, interleaved with elementwise non-linearities, and normalizations. Identity information may be injected into such architecture, either with a lengthy learning (in the many-shot scenario) [16,27] or by using adaptive normalizations conditioned on person embeddings [4,24,29]. The method [29] effectively combines both approaches by injecting identity through adaptive normalizations, and then fine-tuning the resulting generator on the few-shot learning set. The direct synthesis approach for human heads can be traced back to [23] that generated lips of a person (Obama) in the talking head sequence, and further towards first works on conditional convolutional neural synthesis of generic objects such as [2].
The alternative to the direct image synthesis is to use differentiable warping [12] inside the architecture. The warping can be applied to one of the frames. The X2Face approach [28] applies warping twice, first from the source image to a standardized image (texture), and then to the target image. The codec avatar system [18] synthesizes a pose-dependent texture for simplified mesh geometry. The MareoNETte system [8] applies warping to the intermediate feature representations. The few-shot video-to-video system combines direct synthesis with the warping of the previous frame in order to obtain temporal continuity. The first-order motion model system [21] learns to warp the intermediate feature representation of the generator based on "unsupervised" keypoints that are learned from data. Beyond heads, differentiable warping have recently been used for face rotation, face normalization, full body rendering. Earlier, Deep Warp system [5] used neural warping to alter the appearance of eyes for the purpose of gaze redirection, while that used neural warping for the resynthesis of generic scenes. proposed method combines direct image synthesis with warping in a new way, as the authors use an RGB pose-independent texture comprising fine details alongside with a coarse-grained pose-dependent RGB component that is synthesized by a neural network.
Existing few-shot neural avatar systems achieve remarkable results but are still limited in two ways. First, they have a limited resolution (up to 256x256 pixels TODO CHECK). This limitation stems from the need to collect a large and diverse dataset of in-the-wild videos, which is possible at such a low resolution and much harder at higher ones. Secondly, despite the low resolution and unlike some of the graphics-based avatars, the neural systems are too slow to be deployed to mobile devices and require a high-end GPU to run in real-time. The authors note that most application scenarios of neural avatars, especially those concerned with telepresence, would benefit highly from the capability to run in real-time on a mobile device. TODO: IF the authors GET SUPERQUALITY, DISCUSS THE ONE-SHOT QUALITY LIMITATIONS.
In this invention, the authors address the two limitations of one-shot neural avatar systems, and develop an approach that can run at higher resolution and much faster than previous systems. To achieve this, the authors adopt a bi-layer representation, where the image of an avatar in a new pose is generated by summing two components: a coarse image directly predicted by a rendering network, and a warped texture image. While the warp of the texture is also predicted by the rendering network, the texture itself is estimated at the time of avatar generation and is fixed at runtime. To enable the few-shot capability, the authors use the meta-learning stage on a dataset of videos, where the authors (meta)-train the rendering network, the embedding network, as well as the texture generation network.
The separation of the target frames into two layers allows both to improve the effective resolution and the speed of neural rendering. This is because, the authors can use off-line avatar generation stage to synthesize high-resolution texture, while at test time both the first component (coarse image) and the warping of the texture need not contain high frequency details and can therefore be predicted by a relatively small rendering network. These advantages of proposed system are validated by extensive comparisons with previously proposed neural avatar systems. The authors also report on the smartphone-based real-time implementation of proposed system, which was beyond the reach of previously proposed systems.
Proposed is a hardware, comprising software products that perform method for generation of photorealistic images of neural avatar in one-shot mode by the following stages:
stage of initialization for the creation of neural avatar, comprising the following steps:
encoding a concatenation of a source image
Figure PCTKR2021000795-appb-img-000001
and source pose
Figure PCTKR2021000795-appb-img-000002
, encoded as a landmark image into a stack of predicting embeddings
Figure PCTKR2021000795-appb-img-000003
, by embedder network
Figure PCTKR2021000795-appb-img-000004
;
initializing adaptive parameters from predicting embeddings
Figure PCTKR2021000795-appb-img-000005
and decoding an inpainted high-frequency texture of the source image
Figure PCTKR2021000795-appb-img-000006
by texture generator network
Figure PCTKR2021000795-appb-img-000007
;
creating neural avatar by initializing adaptive parameters of texture generator network
Figure PCTKR2021000795-appb-img-000008
by ues embeddings
Figure PCTKR2021000795-appb-img-000009
, and predicting texture
Figure PCTKR2021000795-appb-img-000010
by texture generator network
Figure PCTKR2021000795-appb-img-000011
;
stage of inference for generation images of the neural avatar comprising the following steps:
initializing adaptive parameters of inference generator network
Figure PCTKR2021000795-appb-img-000012
by use embeddings
Figure PCTKR2021000795-appb-img-000013
, and using target pose
Figure PCTKR2021000795-appb-img-000014
to predict a low-frequency component
Figure PCTKR2021000795-appb-img-000015
of the image of the avatar and a warping field
Figure PCTKR2021000795-appb-img-000016
by inference generator network
Figure PCTKR2021000795-appb-img-000017
, which generates the high-frequency component
Figure PCTKR2021000795-appb-img-000018
of the image of the avatar
Figure PCTKR2021000795-appb-img-000019
by applying to the texture
Figure PCTKR2021000795-appb-img-000020
and the warping field
Figure PCTKR2021000795-appb-img-000021
, namely
Figure PCTKR2021000795-appb-img-000022
,
wherein the image of the avatar
Figure PCTKR2021000795-appb-img-000023
is computed as a sum of the high-frequency component
Figure PCTKR2021000795-appb-img-000024
and the low-frequency component
Figure PCTKR2021000795-appb-img-000025
, namely
Figure PCTKR2021000795-appb-img-000026
.
Wherein the target pose
Figure PCTKR2021000795-appb-img-000027
is defined by the vector of face keypoint coordinates. Stage of initialization is only done once per each avatar. Wherein texture can be high-frequency texture. Stage of initialization further comprises updating the high-frequency texture using the texture updater network that is trained to add person-specific details to the texture by observing the mismatch between the source image
Figure PCTKR2021000795-appb-img-000028
and the avatar image for the source pose
Figure PCTKR2021000795-appb-img-000029
obtained before the update of the texture. Wherein the warping field
Figure PCTKR2021000795-appb-img-000030
is a mapping between coordinate spaces of the texture and the image of avatar. The embedder network
Figure PCTKR2021000795-appb-img-000031
, the texture generator network
Figure PCTKR2021000795-appb-img-000032
, the inference generator network
Figure PCTKR2021000795-appb-img-000033
are trained in an end-to-end fashion. Wherein method for generation of photorealistic images of neural avatar further comprising mapping a real or a synthesized target image, concatenated with the target pose, into realism scores by the discriminator network
Figure PCTKR2021000795-appb-img-000034
. The target pose
Figure PCTKR2021000795-appb-img-000035
is obtained by an external landmark tracking process. Tracking process can be applied to another video sequence of the same or a different person, based on the voice signal of the person, or created in some other way.
The above and/or other aspects will be more apparent by describing exemplary embodiments with reference to the accompanying drawing.
Figure 1 illustrates generation of the output image.
Figure 2 illustrates the general pipeline of the method.
The developed model can be used for synthesis of artificial images of people, guided by the pose expression. The model can run on cloud platforms, desktop solutions and mobile devices.
The proposed invention can be realized in server for the initialization, and a smartphone for inference, namely the initialization component can be transfer to the smartphone.
The model produces a realistic image of a person given a single source image (which is called "one-shot learning" ), and a set of facial keypoints, which encode face expression and head rotation ( "talking head synthesis" ). The key difference between the other models is the capability to run in real time mobile devices. The key novelty of the proposed method is to decompose an output image into a low- and high-frequency components. Low-frequency component, therefore, can be synthesized in real-time using conventional approaches, but with a much "faster" model, compared to the previous work. The high-frequency component is predicted via warping of the texture, with the former fixed during inference.
This allows to "offload" some of the computation that is done conventionally during inference into a person-specific initialization stage. The former accepts a single source image of a person, and initializes internal parameters of the model, specific to that person. Another novelty is an application of an existing "learned gradient descend" method for the texture, which allows to further tailor it to a particular person during the initialization stage and reduce the identity gap of the produced avatar.
Methods
Used is video sequences annotated with keypoints and, optionally, segmentation masks, for training, t-th frame of the i-th video sequence are denoted as xi(t), corresponding keypoints as yi(t), and segmentation masks as mi(t). Used is an index t to denote a target frame, and s - a source frame. Also, the authors will mark all tensors, related to generated images, with a hat symbol, ex.
Figure PCTKR2021000795-appb-img-000036
. Proposed is the spatial size of all frames to be constant and denote it as H x W. In some modules, input keypoints are encoded as an RGB image, which is a standard approach in a large body of previous works [8,25,29], In this application, it is called as a landmark image. But, contrary to these approaches, the authors input the keypoints into the inference generator directly as a vector. This allows to significantly reduce the inference time of the method.
Architecture
As illustrated in Fig. 1. The output image is generated in two stages: initialization and inference. During initialization, the authors predict embeddings using a source frame, initialize adaptive parameters of both inference and texture generators, and predict a high-frequency texture. The stage of initialization is only done once per each avatar. During inference, the authors use target keypoints (target pose) to predict a low-frequency component of the output image and a warping field, which, applied to the texture, gives the high-frequency component. These components, namely the predicted low-frequency component image and the warped texture image, are added together to produce an output.
In the proposed approach, the following networks are trained in an end-to-end fashion:
- The embedder network
Figure PCTKR2021000795-appb-img-000037
encodes a concatenation of a source image and a landmark image into a stack of embeddings
Figure PCTKR2021000795-appb-img-000038
, which are used for initialization of the adaptive parameters inside the generators.
- The texture generator network
Figure PCTKR2021000795-appb-img-000039
initializes its adaptive parameters from the embeddings and decodes an inpainted high-frequency component of the source image, which the authors call a texture
Figure PCTKR2021000795-appb-img-000040
.
The inference generator network
Figure PCTKR2021000795-appb-img-000041
maps target poses into image of the avatar
Figure PCTKR2021000795-appb-img-000042
. This network consists of three parts. A pose embedder part maps a pose vector into a spatial tensor, which is used as an input for the convolutional part. The latter performs upsampling, guided by the adaptive parameters, predicted from the embeddings. The output of the convolutional part is split into
Figure PCTKR2021000795-appb-img-000043
(a low-frequency layer of the output image), which encodes basic facial features, skin color and light sources and
Figure PCTKR2021000795-appb-img-000044
(a mapping between coordinate spaces of the texture and the output image). These outputs are combined in the composing part. The high frequency layer of the output image is obtained by warping the predicted texture:
Figure PCTKR2021000795-appb-img-000045
, and is added to a low-frequency component to produce image of the avatar:
Figure PCTKR2021000795-appb-img-000046
Finally, the discriminator network
Figure PCTKR2021000795-appb-img-000047
, which is a conditional [19] relativistic [14] PatchGAN [11], maps a real or a synthesised target image, concatenated with the target landmark image, into realism scores
Figure PCTKR2021000795-appb-img-000048
.
During training, the generated is an output image in two stages: person-specific initialization and inference (see figure 1). During the initialization stage, the authors first input a source image
Figure PCTKR2021000795-appb-img-000049
and a source pose
Figure PCTKR2021000795-appb-img-000050
, encoded as a landmark image, into the embedder. The outputs of the embedder are
Figure PCTKR2021000795-appb-img-000051
tensor
Figure PCTKR2021000795-appb-img-000052
, which are used to predict the adaptive parameters of the texture generator and the inference generator. Then, a high-frequency texture
Figure PCTKR2021000795-appb-img-000053
of the source image is synthesized by the texture generator, which concludes the initialization. During the inference stage, the authors only input corresponding target pose
Figure PCTKR2021000795-appb-img-000054
into the inference generator. It predicts a low-frequency component of the output image
Figure PCTKR2021000795-appb-img-000055
directly and a high-frequency component
Figure PCTKR2021000795-appb-img-000056
by warping the texture with a prediced filed
Figure PCTKR2021000795-appb-img-000057
. The image of the avatar
Figure PCTKR2021000795-appb-img-000058
is a sum of these two components.
It is important to note that while the texture generator is manually forced to generate only a high-frequency component of the image via the design of the loss functions the authors do not specifically constrain it to perform inpainting. This behavior is emergent from the fact that the used are two different images with different poses for initialization and loss calculation.
Figure 2 illustrates the general pipeline of the method. The initialization module receives an image of the user. It then takes 100ms on an NVIDIA GPU to initialize an avatar, i.e. to precompute the weights of the inference generator network and the texture, as well as to adjust the texture. After such initialization, a new image of an avatar for a new pose defined by facial keypoint positions can be obtained by the inference module in much smaller time (e.g. 42 ms on a mobile Snapdragon 855 GPU).
Training process
Used are multiple loss functions for training. The main loss function responsible for the realism of the outputs is trained in an adversarial way [7]. Also used are pixelwise loss to preserve source lightning conditions and perceptual [13] loss to match the source identity in the outputs. Finally, a regularization of the texture mapping adds robustness to the random initialization of the model.
Adversarial loss
Adversarial loss is optimized by both the generator and the discriminator networks. Usually, it resembles a binary classification loss function between real and fake images, which discriminator is optimized to minimize, and generator - maximize [7]. Authors follow a large body of previous works [1,8,25,29] and use a hinge loss as a substitute for the original binary cross entropy loss. The authors also perform relativistic realism score calculation [14], following its recent success in tasks such as super-resolution [27] and denoising [15]. This addition is supposed to make the adversarial training more stable [14]. Therefore, the authors use equations 2 and 3 to calculate realism scores for real and fake images respectively, with in and tn denoting indices of mini-batch elements, N - a mini - batch size and
Figure PCTKR2021000795-appb-img-000059
Figure PCTKR2021000795-appb-img-000060
Moreover, the authors use PatchGAN [11] formulation of the adversarial learning. In it, the discriminator outputs a matrix of realism scores instead of a single prediction, and each element of this matrix is treated as a realism score for a corresponding patch in the input image. This formulation is also used in a large body of relevant works [8, 25,26] and improves the stability of the adversarial training. If the authors denote the size of a scores matrix
Figure PCTKR2021000795-appb-img-000061
, the resulting objectives can be written as follows:
Figure PCTKR2021000795-appb-img-000062
Equation 4 is the only term that is used for the training of the discriminator. For the generator, the authors also calculate a feature matching loss [26], which has now become a standard component of supervised image-to-image translation models. In this objective, the authors want to minimize a distance between the intermediate feature maps of discriminator, calculated using corresponding target and generated images. If the authors denote as
Figure PCTKR2021000795-appb-img-000063
the features at different spatial resolutions
Figure PCTKR2021000795-appb-img-000064
, then the feature matching objective can be calculated as follows:
Figure PCTKR2021000795-appb-img-000065
Pixelwise and perceptual losses force the predicted images to match the ground truth, and are respectively applied to low- and high-frequency components of the output images. Since usage of pixelwise losses assumes that all pixels in the image are statistically independent, empirically the optimization process leads to blurry images [11], which is most suitable for a low-frequency component of the output. On contrary, the optimization of a perceptual loss leads to crisper and more realistic images [13], which the authors utilize to train a high-frequency component. If remove that separation between the components and train them jointly via a single objective, that makes the method unstable with respect to architecture choices and even the quality of the images in the dataset, with either a low or a high-frequency component receiving all the training signal, and the other getting close to none, which leads to suboptimal performance after convergence is achieved.
Pixelwise loss is calculated by simply measuring a mean L1 distance between a target image and a low-frequency component:
Figure PCTKR2021000795-appb-img-000066
To calculate a perceptual loss, the authors need to use a stop-gradient operator SG, that allows to prevent the gradient flow into a low-frequency component. The input generated image is, therefore, calculated as following:
Figure PCTKR2021000795-appb-img-000067
Following [8] and [29], proposed variant of the perceptual loss consists of two components: features evaluated using an ILSVRC (ImageNet) pre-trained VGG19 network [22], and a VGGFace network [20], trained for face recognition. If the authors denote the intermediate features of these networks as
Figure PCTKR2021000795-appb-img-000068
, and their spatial size as
Figure PCTKR2021000795-appb-img-000069
, the objectives can be written as follows:
Figure PCTKR2021000795-appb-img-000070
It is important to note that, in contrary to these pairwise losses, adversarial loss is back-propagated both into low- and high-frequency components, which leads to better realism and pose preservation in the predicted images.
Texture mapping regularization is proposed to improve the stability of the training. The training signal that the texture generator Gtex receives is first warped by the warping field
Figure PCTKR2021000795-appb-img-000071
predicted by the inference generator. Because of this, random initializations of the networks typically lead to suboptimal textures, in which the face of the source person occupies a small fraction of the total area of the texture. As the training progresses, this leads to a lower effective resolution of the output image, since the optimization process is unable to escape this bad local optima. The authors address the problem by treating the network's output as a delta to an identity mapping, and also by applying a magnitude penalty on that delta in the early iterations. The weight of this penalty is multiplicatively reduced to zero during training, so it does not affect the final performance of the model. More formally, the authors decompose the output warping field into a sum of two terms:
Figure PCTKR2021000795-appb-img-000072
, where
Figure PCTKR2021000795-appb-img-000073
denotes an identity mapping, and apply an
Figure PCTKR2021000795-appb-img-000074
penalty, averaged by a number of spatial positions in the mapping, to the second term:
Figure PCTKR2021000795-appb-img-000075
The generator networks, i.e. the image embedder, the texture generator and the inference generator are jointly trained on a single objective
Figure PCTKR2021000795-appb-img-000076
which is a weighted sum of the objectives 5-7, 9-11, and, optionally,
Figure PCTKR2021000795-appb-img-000077
.
Fine-tuning
Training on a person-specific source data leads to significant improvement in realism and identity preservation of the synthesized images [29], but is computationally expensive, if it involves optimization of the model's parameters or usage of "heavy" objectives like adversarial or perceptual losses. Moreover, when the source data is scarce, like in one-shot scenario, fine-tuning may lead to overfitting and performance degradation, which is observed in [29].
Authors address both of these problems by using a learned gradient descend (LGD) method to optimize only the synthesized texture
Figure PCTKR2021000795-appb-img-000078
. Optimizing with respect to the texture tensor prevents the model from overfitting, while the LGD allows to perform optimization with respect to computationally expensive objectives by doing forward passes through a pre-trained network.
Specifically, the authors introduce a lightweight loss function
Figure PCTKR2021000795-appb-img-000079
(we used a sum of squared errors), that measures the distance between a generated image and a ground-truth in the pixel space, and a texture updating network
Figure PCTKR2021000795-appb-img-000080
, that uses the current state of the texture and a gradient of this function with respect to the texture to produce an update
Figure PCTKR2021000795-appb-img-000081
. During fine-tuning the authors recursively perform M update steps, each time measuring the gradients of
Figure PCTKR2021000795-appb-img-000082
with respect to an updated texture. More formally:
Figure PCTKR2021000795-appb-img-000083
where
Figure PCTKR2021000795-appb-img-000084
denotes an iteration number, with
Figure PCTKR2021000795-appb-img-000085
. During test-time, the authors do the same M updates to the texture and use the obtained
Figure PCTKR2021000795-appb-img-000086
for inference.
The network
Figure PCTKR2021000795-appb-img-000087
is trained by back-propagation through all M steps. For training, the authors use the same objective
Figure PCTKR2021000795-appb-img-000088
that was used during the training of the base model. The authors evaluate it using a target frame
Figure PCTKR2021000795-appb-img-000089
and a generated frame
Figure PCTKR2021000795-appb-img-000090
It is important to highlight that
Figure PCTKR2021000795-appb-img-000091
used for training of
Figure PCTKR2021000795-appb-img-000092
, but simply guides the updates to the texture. Also, the gradients with respect to this loss are evaluated on the source image, while the objective is calculated on a target image, which implies that the network has to produce updates for the whole texture, not just a region "visible" on the source image. Lastly, while the authors do not propagate any gradients into the generator part of the base model, the authors keep training the discriminator using the same objective
Figure PCTKR2021000795-appb-img-000093
. Even though training the updater network jointly with the base generator is possible, and can lead to better quality (following the success of model agnostic meta-learning [3] method), the authors resort to two-stage training due to memory constraints.
Segmentation
Empirically the authors have observed that the average area which the face occupies on target images affects the performance of proposed method. A warping from the texture coordinate space to the image space is trained in an unsupervised way and is heavily influenced by the dataset. For example, if there is a strong correlation between the frames in the training videos, there is no incentive for the texture generator to produce a proper texture with hallucinated features, given the objective to match a target image. In that case, the model can simply decode the source image from the embedding and produce good results. This failure case leads to poor extrapolation on novel viewpoints and can be achieved if, for example, the authors significantly increase the area of background, since it is heavily correlated between the source and target frames.
Therefore, if for enlarging the crop size, such that it would fit a full head, it is necessary to do foreground segmentation in order to filter out the training signal related to background. The authors use a state-of-the-art face and body segmentation model [6] to obtain the ground truth masks. Then, the authors predict the mask
Figure PCTKR2021000795-appb-img-000094
via an inference generator alongside with its other outputs, and train it via a binary cross-entropy loss
Figure PCTKR2021000795-appb-img-000095
to match the ground truth
Figure PCTKR2021000795-appb-img-000096
. To filter out the training signal, related to the background, the authors have explored multiple options. The authors cannot simply mask the gradients that are fed into the generator, because it would lead to overfitting of the discriminator. The authors also cannot apply the ground truth masks to all the images in the dataset, since the model [6] works so well that it produces a sharp border between the foreground and the background, leading to the occurrence of border artifacts during training.
Instead, the authors figured out that usage of predictions
Figure PCTKR2021000795-appb-img-000097
works much better. They are smooth and prevent the discriminator from overfitting to either the lack of background, or sharpness of the border. Used is
Figure PCTKR2021000795-appb-img-000098
to mask all the synthesized images and the ground truth image before the image-based losses are applied. The stop-gradient operator ensures that the training does not converge to a degenerate state.
Implementation details
Proposed networks consist of pre-activation residual blocks [9] with leaky ReLU activations. A minimum number of features in these blocks are set to 64, and a maximum to 512. In a default configuration, the authors use half the number of features in the inference generator, but the authors also evaluate proposed model with full- and quarter-capacity inference part, with the results provided in the experiments.
We use batch normalization [10] in all the networks except for the embedder and the texture updater. Inside the texture generator, the authors pair batch normalization with adaptive SPADE layers [25]. These layers are modified to predict pixelwise scale and bias coefficients using feature maps, which are being treated as model parameters, instead of being input from a different network. This allows to save memory by removing additional networks and intermediate feature maps from the optimization process, and to increase the batch size. Also, following [25], the authors predict weights for all 1 x 1 convolutions in the network from the embeddings
Figure PCTKR2021000795-appb-img-000099
, which includes the scale and bias mappings in AdaSPADE layers, and skip connections in the residual upsampling blocks. In the inference generator, the authors use standard adaptive batch normalization layers [1], but also predict weights for the skip connections from the embeddings. For the vector pose embedding used is a multi-layer perceptron, with its output reshaped into a convolutional part's input.
A simultaneous gradient descend is implemented on parameters of the generator networks and the discriminator using Adam [17] with a learning rate of
Figure PCTKR2021000795-appb-img-000100
. The authors use 0.5 weight for adversarial losses (eq. 4-5), and 10 for all other losses, except for the VGGFace perceptual loss (eq. 10), which is set to 0.01. The weight of the regularizer (eq. 11) is then multiplicatively reduced by 0.9 every 50 iterations. Proposed models are trained on 8 NVIDIA P40 GPUs with the batch size of 48 for the base model, a batch size of 32 for the updater model. The authors set unrolling depth M of the updater to 4 and use a sum of squared elements as the lightweight objective. Batch normalization statistics are synchronized across all GPUs during training. During inference they are replaced with "standing" statistics, similar to [1], which significantly improves the quality of the outputs, compared to the usage of running statistics. Spectral normalization is also applied in all linear and convolutional layers of all networks.
Please refer to the supplementary material for a detailed description of proposed model's architecture, as well as the discussion of training and architectural features that the authors have adopted.
The foregoing exemplary embodiments are examples and are not to be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.
References
1. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)
2. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538-1546 (2015)
3. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017 (2017)
4. Fu, C., Hu, Y., Wu, X., Wang, G., Zhang, Q., He, R.: High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv: 1903.12003 (2019)
5. Ganin, Y., Kononenko, D., Sungatul1ina, D., Lempitsky, V.: Deepwarp: Photorealistic image resynthesis for gaze manipulation. In: European Conference on Computer Vision, pp. 311-326. Springer (2016)
6. Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: Universal human parsing via graph transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019 (2019)
7. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 (2014)
8. Ha, S., Kersner, M., Kim, B., Seo, S., Kim, D.: Marionette: Few-shot face reenactment preserving identity of unseen targets. CoRR abs/1911.08139 (2019)
9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Computer Vision - ECCV 2016 - 14th European Conference (2016)
10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015 (2015)
11. Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017)
12. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks, pp. 2017-2025 (2015)
13. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision - ECCV 2016 - 14th European Conference (2016)
14. Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard GAN. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)
15. Kim, D.,Chung, J.R., Jung, S.: GRDN: grouped residual dense network for real image denoising and gan-based real-world noise modeling. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019 (2019)
16. Kim, H.,Garrido, P., Tewari, A., Xu, W., Thies, J., Niefiner, M., Perez, P., Richardt, C., Zollhofer, M., Theobalt, C.: Deep video portraits. arXiv preprint arXiv: 1805.11714 (2018)
17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)
18. Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Transactions on Graphics (TOG) 37(4), 68 (2018)
19. Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR abs/1411.1784 (2014)
20. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference 2015, BMVC 2015 (2015)
21. Siarohin, A., Lathuiliere, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurlPS 2019 (2019)
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556(2014), http://arxiv.org/abs/1409.1556
23. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36(4), 95 (2017)
24. Tripathy, S., Kannala, J., Rahtu, E.: Icface: Interpretable and controllable face reenactment using gans. CoRR abs/1904.01909 (2019), http://arxiv.org/abs/1904.01909
25. Wang, T., Liu, M., Tao, A., Liu, G., Catanzaro, B., Kautz, J.: Few- shot video-to-video synthesis. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurlPS 2019 (2019)
26. Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 (2018)
27. Wang, X, Yu, K, Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y, Loy, C.C.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Computer Vision - ECCV 2018 Workshops (2018)
28. Wiles, O., Sophia Koepke, A., Zisserman, A.: X2face: A network for controlling face generation using images, audio, and pose codes. In: The European Conference on Computer Vision (ECCV) (September 2018)
29. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.S.: Few-shot adversarial learning of realistic neural talking head models. In: IEEE International Conference on Computer Vision, ICCV 2019 (2019)

Claims (10)

  1. A hardware, comprising software products that perform method for generation of photorealistic images of neural avatar in one-shot mode by the following stages:
    stage of initialization for the creation of neural avatar, comprising the following steps:
    encoding a concatenation of a source image
    Figure PCTKR2021000795-appb-img-000101
    and source pose
    Figure PCTKR2021000795-appb-img-000102
    , encoded as a landmark image into a stack of predicting embeddings
    Figure PCTKR2021000795-appb-img-000103
    , by embedder network
    Figure PCTKR2021000795-appb-img-000104
    ;
    initializing adaptive parameters from predicting embeddings
    Figure PCTKR2021000795-appb-img-000105
    and decoding an inpainted high-frequency texture of the source image
    Figure PCTKR2021000795-appb-img-000106
    by texture generator network
    Figure PCTKR2021000795-appb-img-000107
    ;
    creating neural avatar by initializing adaptive parameters of texture generator network
    Figure PCTKR2021000795-appb-img-000108
    by use embedding
    Figure PCTKR2021000795-appb-img-000109
    , and predicting texture
    Figure PCTKR2021000795-appb-img-000110
    by texture generator network
    Figure PCTKR2021000795-appb-img-000111
    ;
    stage of inference for generation images of the neural avatar comprising the following steps:
    initializing adaptive parameters of inference generator network
    Figure PCTKR2021000795-appb-img-000112
    by use embeddings
    Figure PCTKR2021000795-appb-img-000113
    , and using target pose
    Figure PCTKR2021000795-appb-img-000114
    to predict a low-frequency component
    Figure PCTKR2021000795-appb-img-000115
    of the image of image of the avatar and a warping field
    Figure PCTKR2021000795-appb-img-000116
    by inference generator network
    Figure PCTKR2021000795-appb-img-000117
    , which generates the high-frequency component
    Figure PCTKR2021000795-appb-img-000118
    of the image of the avatar
    Figure PCTKR2021000795-appb-img-000119
    by applying to the texture
    Figure PCTKR2021000795-appb-img-000120
    and the warping field
    Figure PCTKR2021000795-appb-img-000121
    , namely
    Figure PCTKR2021000795-appb-img-000122
    ,
    wherein the image of the avatar
    Figure PCTKR2021000795-appb-img-000123
    is computed as a sum of the high-freauency component
    Figure PCTKR2021000795-appb-img-000124
    and the low-freauency component
    Figure PCTKR2021000795-appb-img-000125
    , namely
    Figure PCTKR2021000795-appb-img-000126
    .
  2. The hardware according to claim 1, wherein the target pose
    Figure PCTKR2021000795-appb-img-000127
    is defined by the vector of face keypoint coordinates.
  3. The hardware according to claim 1, wherein stage of initialization is only done once per each avatar.
  4. The hardware according to claim 1, wherein texture can be high-frequency texture.
  5. The hardware according to claim 1, wherein stage of initialization further comprises updating the high-frequency texture using the texture updater network that is trained to add person-specific details to the texture by observing the mismatch between the source image
    Figure PCTKR2021000795-appb-img-000128
    and the avatar image for the source pose
    Figure PCTKR2021000795-appb-img-000129
    obtained before the update of the texture.
  6. The hardware according to claim 1, wherein the warping field
    Figure PCTKR2021000795-appb-img-000130
    is a mapping between coordinate spaces of the texture and the image of avatar.
  7. The hardware according to claim 1, wherein the embedder network
    Figure PCTKR2021000795-appb-img-000131
    , the texture generator network
    Figure PCTKR2021000795-appb-img-000132
    , the inference generator network
    Figure PCTKR2021000795-appb-img-000133
    are trained in an end-to-end fashion.
  8. The hardware according to claim 1, further comprising mapping a real or a synthesized target image, concatenated with the target pose, into realism scores by the discriminator network
    Figure PCTKR2021000795-appb-img-000134
    .
  9. The hardware according to claim 1, wherein the target pose
    Figure PCTKR2021000795-appb-img-000135
    is obtained by an external landmark tracking process.
  10. The hardware according to claim 9, wherein the tracking process can be applied to another video sequence of the same or a different person, based on the voice signal of the person.
PCT/KR2021/000795 2020-03-03 2021-01-20 Fast bi-layer neural synthesis of one-shot realistic images of neural avatar WO2021177596A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
RU2020109348 2020-03-03
RU2020109348 2020-03-03
RU2020124828 2020-07-27
RU2020124828A RU2764144C1 (en) 2020-07-27 2020-07-27 Rapid two-layer neural network synthesis of realistic images of a neural avatar based on a single image

Publications (1)

Publication Number Publication Date
WO2021177596A1 true WO2021177596A1 (en) 2021-09-10

Family

ID=77613640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/000795 WO2021177596A1 (en) 2020-03-03 2021-01-20 Fast bi-layer neural synthesis of one-shot realistic images of neural avatar

Country Status (1)

Country Link
WO (1) WO2021177596A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596764A (en) * 2023-07-17 2023-08-15 华侨大学 Lightweight image super-resolution method based on transform and convolution interaction
CN117710449A (en) * 2024-02-05 2024-03-15 中国空气动力研究与发展中心高速空气动力研究所 NUMA-based real-time pose video measurement assembly line model optimization method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374242A1 (en) * 2016-12-01 2018-12-27 Pinscreen, Inc. Avatar digitization from a single image for real-time rendering
US20200051303A1 (en) * 2018-08-13 2020-02-13 Pinscreen, Inc. Real-time avatars using dynamic textures
US20200066029A1 (en) * 2017-02-27 2020-02-27 Metail Limited Method of generating an image file of a 3d body model of a user wearing a garment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374242A1 (en) * 2016-12-01 2018-12-27 Pinscreen, Inc. Avatar digitization from a single image for real-time rendering
US20200066029A1 (en) * 2017-02-27 2020-02-27 Metail Limited Method of generating an image file of a 3d body model of a user wearing a garment
US20200051303A1 (en) * 2018-08-13 2020-02-13 Pinscreen, Inc. Real-time avatars using dynamic textures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUANGXIAO GU; YUQIAN ZHOU; THOMAS HUANG: "FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 November 2019 (2019-11-21), 201 Olin Library Cornell University Ithaca, NY 14853, XP081536788 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596764A (en) * 2023-07-17 2023-08-15 华侨大学 Lightweight image super-resolution method based on transform and convolution interaction
CN116596764B (en) * 2023-07-17 2023-10-31 华侨大学 Lightweight image super-resolution method based on transform and convolution interaction
CN117710449A (en) * 2024-02-05 2024-03-15 中国空气动力研究与发展中心高速空气动力研究所 NUMA-based real-time pose video measurement assembly line model optimization method
CN117710449B (en) * 2024-02-05 2024-04-16 中国空气动力研究与发展中心高速空气动力研究所 NUMA-based real-time pose video measurement assembly line model optimization method

Similar Documents

Publication Publication Date Title
Zakharov et al. Few-shot adversarial learning of realistic neural talking head models
US11775829B2 (en) Generative adversarial neural network assisted video reconstruction
Yi et al. Audio-driven talking face video generation with learning-based personalized head pose
US11861936B2 (en) Face reenactment
WO2020096403A1 (en) Textured neural avatars
WO2020190083A1 (en) Electronic device and controlling method thereof
CN112037320B (en) Image processing method, device, equipment and computer readable storage medium
Ma et al. Styletalk: One-shot talking head generation with controllable speaking styles
WO2021177596A1 (en) Fast bi-layer neural synthesis of one-shot realistic images of neural avatar
WO2020150689A1 (en) Systems and methods for realistic head turns and face animation synthesis on mobile device
CN112949535B (en) Face data identity de-identification method based on generative confrontation network
WO2023085624A1 (en) Method and apparatus for three-dimensional reconstruction of a human head for rendering a human image
CN111428575A (en) Tracking method for fuzzy target based on twin network
Lin et al. Reconstruction algorithm for lost frame of multiview videos in wireless multimedia sensor network based on deep learning multilayer perceptron regression
US20210264207A1 (en) Image editing by a generative adversarial network using keypoints or segmentation masks constraints
Zhang et al. Deep learning in face synthesis: A survey on deepfakes
CN115914505B (en) Video generation method and system based on voice-driven digital human model
EP3874415A1 (en) Electronic device and controlling method thereof
Huang et al. Low light image enhancement network with attention mechanism and retinex model
CA3180427A1 (en) Synthesizing sequences of 3d geometries for movement-based performance
CN113421185B (en) StyleGAN-based mobile terminal face age editing method
Sun et al. Learning adaptive patch generators for mask-robust image inpainting
CN111640172A (en) Attitude migration method based on generation of countermeasure network
RU2764144C1 (en) Rapid two-layer neural network synthesis of realistic images of a neural avatar based on a single image
Huang et al. Perceptual conversational head generation with regularized driver and enhanced renderer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21763915

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21763915

Country of ref document: EP

Kind code of ref document: A1