WO2021096192A1 - Neural facial expressions and head poses reenactment with latent pose descriptors - Google Patents

Neural facial expressions and head poses reenactment with latent pose descriptors Download PDF

Info

Publication number
WO2021096192A1
WO2021096192A1 PCT/KR2020/015688 KR2020015688W WO2021096192A1 WO 2021096192 A1 WO2021096192 A1 WO 2021096192A1 KR 2020015688 W KR2020015688 W KR 2020015688W WO 2021096192 A1 WO2021096192 A1 WO 2021096192A1
Authority
WO
WIPO (PCT)
Prior art keywords
pose
person
identity
reenactment
image
Prior art date
Application number
PCT/KR2020/015688
Other languages
French (fr)
Inventor
Egor Andreevich BURKOV
Victor Sergeevich LEMPITSKY
Igor Igorevich PASECHNIK
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2020119034A external-priority patent/RU2755396C1/en
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2021096192A1 publication Critical patent/WO2021096192A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • An invention relates to computer graphics, telepresence (e.g. video conferencing), human head pose estimation, video production, face tracking, augmented/virtual reality.
  • telepresence e.g. video conferencing
  • human head pose estimation e.g. human head pose estimation
  • video production e.g. face tracking
  • face tracking e.g. face tracking
  • augmented/virtual reality e.g. augmented/virtual reality
  • Face/head reenactment is an active area of research. Here distinguished are between works where changes and augmentations are localized within faces (face reenactment), e.g. [33, 30], and more ambitious approaches that model extended regions including significant portion of clothing, neck, upper garment (head reenactment), e.g. [22, 37, 42].
  • Pose representation is an important aspect of reenactment systems. As mentioned above, most works drive reenactment using landmarks [33, 22, 37, 42, 10, 36]. Another approach is to use facial action units (AU) [9], as is done in face reenactment [30] and head reenactment [35]. Detecting action units still requires manual annotation and supervised learning. The X2Face system [39] uses latent vectors that are learned to be predictable of warping fields.
  • a more classic approach is to model face/head pose in the 3D morphable model (3DMM) framework [1] or using a similar approach in 2D (e.g. an active appearance model) [6]. Still, learning 3DMM and fitting a learned 3DMM almost invariably involves detecting landmarks, thus inheriting many of the landmark deficiencies. Alternatively, a dataset of 3D scans is required to build a model for pose/identity disentanglement in 3DMM framework.
  • disentanglement can be obtained by the direct fitting of factorized distributions to data (e.g. [23]).
  • 'head pose means the combination of head orientation, position, as well as facial expression
  • the representation of the pose the key role in the quality of reenactment.
  • keypoint (landmark) representation is based on keypoint (landmark) representation.
  • the main advantage of such representation is that robust and efficient "off-the-shelf" landmark detectors are now available [21, 2].
  • Face landmarks suffer from several shortcomings.
  • learning a landmark detector requires excessive annotation effort, and the sets of annotated landmarks often miss some important aspects of the pose.
  • many landmark annotations do not include eye pupils, and as a consequence, the reenactment will not have a full control of the gaze.
  • many of the landmarks do not have an anatomical basis, and their annotation is ambiguous and prone to errors, especially, when they are occluded. In practice, such ambiguity of annotation often translates into temporal instability of keypoint detection that in turn translates into the reenactment results.
  • landmarks are person-specific, as they contain considerable amount of information about pose-independent head geometry.
  • the previous state-of-the-art reenactment system [42] is augmented with the ability to predict foreground segmentation. Such prediction is needed for various scenarios, such as telepresence, where the transfer of the original background to the new environment can be undesirable.
  • Proposed is an alternative to the warping-based approach [39, 38].
  • Proposed approach learns low-dimensional person-agnostic pose descriptors alongside with medium-dimensional person-specific pose-independent descriptors, by imposing a set of reconstruction losses on video frames over a large collection of videos.
  • Proposed system modifies and expands the reenactment model of Zakharov et al. [42]. First, the ability to predict the segmentation is added. Second, the system learns to perform reenactment based on latent pose vectors rather than keypoints.
  • a simple learning framework based on sampling multiple random frames from the same video paired with the large size of the video dataset allows to learn extractors for both descriptors that work very well for reenactment tasks, including cross-person reenactment.
  • proposed reenactment based on the new latent pose representation preserves the identity of the target person much better than when FAb-Net [38] and X2Face [39] pose descriptors are used. Additionally, analyzed is the quality of learned latent pose descriptors for such tasks as landmark prediction and pose-based retrieval.
  • a rich representation of a human head/face pose and expression is a crucial ingredient in a large number of humancentric computer vision and graphics tasks. Learning such representations from data without relying on human annotation is an attractive approach, as it may utilize vast unlabeled datasets of human videos.
  • proposed is a new and simple way that performs such learning. Unlike previous works, proposed learning results in person-agnostic descriptors that capture human pose yet can be transfered from person to person, which is particularly useful for such applications as face reenactment. Alongside face reenactment, also showed is that proposed descriptors are useful for other downstream tasks, such as face orientation estimation.
  • proposed system can successfully be used for generating videos as well. Even when each video frame is generated independently and without any temporal processing, proposed system exhibits temporally smooth facial expressions, thanks to pose augmentations and latent pose representation. Previous methods that used facial keypoints to drive reenactment would inherit shakiness from keypoint detectors.
  • Proposed is a hardware device comprising software product that performs method for neural facial expressions and head pose reenactment, comprising: identity encoder unit configured to obtain the identity descriptor from person A's image, wherein output of the pose encoder unit does not contain information about person A's identity; pose encoder unit configured to obtain the descriptor of head pose and facial expression from person B's image, wherein output of the pose encoder unit does not contain information about person B's identity; generator unit which receives the outputs of the identity encoder unit and pose encoder unit, wherein the generator unit configured to synthesizing the avatar of person A, having head pose and facial expression from person B.
  • the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
  • person B's identity refers to person B's skin color, facial shape, eye color, clothing, and adornments.
  • Proposed is a method for synthesizing a photorealistic avatar of person, comprising: obtaining the identity descriptor from person A's image by use identity encoder unit; obtaining the descriptor of head pose and facial expression from person B's image by use pose encoder unit; synthesizing the avatar of person A, having head pose and facial expression from person B, by generator unit, which receives the outputs of the identity encoder unit and pose encoder unit.
  • the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
  • identity comprising skin color, facial shape, eye color, clothing, adornments.
  • proposed system can successfully be used for generating videos as well. Even when each video frame is generated independently and without any temporal processing, proposed system exhibits temporally smooth facial expressions, thanks to pose augmentations and latent pose representation. Previous methods that used facial keypoints to drive reenactment would inherit shakiness from keypoint detectors.
  • Figure 1 illustrates using arbitrary people as facial expression and head pose drivers (top row) for generating realistic reenactments of arbitrary talking heads (such as Mona Lisa, bottom row).
  • Figure 2A illustrates proposed training pipeline (discriminator not shown for simplicity).
  • FIG. 2B illustrates using proposed method.
  • Figure 3 illustrates evaluation of reenactment systems in terms of their ability to represent the driver pose and to preserve reference identity (arrows point towards improvement.
  • Figure 4 illustrates comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
  • Figure 5 illustrates Reenactment by interpolation between two pose vectors across spherical trajectory in the pose descriptor space.
  • Figure 6 Additional comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
  • Figure 7 illustrates quantitative evaluation of how ablating several important features of the training setup impacts proposed system.
  • Figure 8 illustrates comparison of cross-person reenactment for proposed best model and its ablated versions.
  • Figure 9 illustrates the effect of pose augmentations on X2Face+ and FAb-Net+ models. Without augmentations, the identity gap becomes conspicuous.
  • Proposed is a Neural facial expressions and head poses Reenactment system, which is driven by a latent pose representation and is capable of predicting the foreground segmentation alongside the RGB image.
  • the latent pose representation is learned as a part of the entire reenactment system, and the learning process is based solely on image reconstruction losses.
  • the learned descriptors are useful for other pose-related tasks, such as keypoint prediction and pose-based retrieval.
  • Figure 1 illustrates using arbitrary people as facial expression and head pose drivers(top row) for generating realistic reenactments of arbitrary talking heads (such as Mona Lisa, bottom row).
  • Proposed system can generate realistic reenactments of arbitrary talking heads (such as Mona Lisa) using arbitrary people as facial expression and head pose drivers (top row).
  • the method can successfully decompose pose and identity, so that the identity of the reenacted person is preserved.
  • the invention provides fast estimation of comprehensive and flexible representation (descriptor) of head pose and facial expression from one human head image.
  • the invention provides estimation of human head pose (yaw/roll/pitch angles) and their facial expression (including eye gaze direction) from an image of a person.
  • the invention can be applied in:
  • AR/VR systems to determine head pose or eye gaze for rendering of virtual objects from a correct viewpoint.
  • Proposed system modifies and expands the reenactment model of Zakharov et al. [42], First, the ability to predict the segmentation
  • the system learns to perform reenactment based on latent pose vectors rather than keypoints.
  • there is a "meta-learning" stage when a big model responsible for reproducing all people in the dataset is trained through a sequence of training episodes, and a fine-tuning stage, when that "meta-model" is fine-tuned to a tuple of images (or a single image) of a particular person.
  • FIG. 2A At each step of meta-learning, proposed system samples a set of frames from a video of a person.
  • the frames are processed by two encoders.
  • the bigger identity encoder is applied to several frames of the video, while the smaller pose embedder is applied to a hold-out frame.
  • Hold-out data is usually defined as a part of data that is purposely not shown to (i.e. kept away from) a certain model. In this case, a hold-out frame is just another frame of the same person.
  • Term "hold-out” emphasizes that this frame is ensured to not being fed into the identity encoder, only to pose encoder.
  • the obtained embeddings are passed to the generator network, whose goal is to reconstruct the last (hold-out) frame.
  • Having an identity encoder that is more capacious than the pose encoder is very important for disentangling pose from identity in the latent space of pose embeddings, which is a key component of the present invention. It is no less important to have a very tight bottleneck in the pose encoder, which in proposed case is implemented via a lower capacity neural network in the pose encoder, and a smaller dimensionality of the pose embeddings than that of the identity embeddings. This forces the pose encoder to only encode the pose information into pose embeddings to and disregard the identity information. Since the capacity of the pose encoder is limited, and since its input does not exactly match other frames w.r.t. identity (thanks to data augmentation), the system learns to extract all pose-independent information through the identity encoder and uses the smaller encoder to capture only pose-related information, thus achieving pose-identity disentanglement.
  • Proposed is method for neural facial expression and head pose reenactment, i.e. algorithm for synthesizing a photorealistic avatar of person A in which facial expression and head pose are taken from an image of person B (as illustrated in fig. 2A).
  • Figure 2B illustrates the inference pipeline of the proposed method, i.e. the algorithm of prediction/rendering with proposed system after it has been trained.
  • the generator unit is trained simultaneously with all other units.
  • the identity encoder unit predicts identity embeddings from several images of a person
  • the pose encoder unit predicts a pose embedding from an extra image of that person.
  • the generator unit consumes averaged identity embeddings and the pose embedding, and is trained to output the image that was fed into the pose encoder unit, as well as the foreground mask for that image.
  • Novel in this pipeline are (1) the pose encoder unit that is not directly supervised during training, (2) the use of foreground mask as a target, (3) pose augmentations (identity-preserving distortions of the input into the pose encoder unit) as a technique to reach pose-identity disentanglement.
  • the identity embedding is computed as an average of identity encoder outputs over all available images of person A. After that, the identity encoder is discarded and the rest of the system is fine-tuned on the images of person A just like during meta-learning stage, with the following differences: (1) there is only one person in the training set, (2) pose encoder's weights are kept frozen, (3) identity embedding is converted to a trainable parameter. Finally, an image of any person B is passed through the pose encoder, and the predicted rendering and segmentation are generated as usual (i.e. as during training) from the resulting pose embedding and the fine-tuned identity embedding.
  • ⁇ identity encoder ⁇ program unit uses the ⁇ identity encoder ⁇ program unit to obtain the identity (skin color, facial shape, eye color, clothing, adornments, ...) descriptor from person A's image (for example, using the system described in ⁇ Few-Shot Adversarial Learning of Realistic Neural Talking Head Models ⁇ paper).
  • the ⁇ pose encoder ⁇ unit is a trainable machine learning model (e.g. a convolutional neural network). It takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
  • trainable machine learning model e.g. a convolutional neural network
  • Training algorithm for that model in the training pipeline of ⁇ Few-Shot Adversarial Learning of Realistic Neural Talking Head Models ⁇ method, replace the input unit ⁇ RGB & landmarks ⁇ by proposed ⁇ pose encoder ⁇ unit. Next, train the whole system as described in the ⁇ Few-Shot Adversarial Learning of Realistic Neural Talking Head Models ⁇ paper with additional techniques, described below.
  • the generator has no downsamplings (it is not of image-to-image architecture), as it starts from a constant learnable tensor rather than from a full-sized image; the pose embedding is additionally concatenated to the input of the MLP that predicts AdaIN parameters; the identity encoder is of a different architecture (ResNeXt-50 32x4d).
  • Proposed is random pose augmentations to the facial expression and head pose driver; the generator predicts the foreground segmentation mask in an extra output channel, and a dice loss is applied to match that prediction with the ground truth.
  • the output of the ⁇ pose encoder ⁇ does not contain information about person B's identity, therefore, that output is depersonalized, which is better in terms of information security;
  • the meta-learning step is described in the below discussion.
  • the first K images I 1 , ..., I k are then fed into a relatively high-capacity convolutional net F, which calls identity encoder. It is analogous to the embedder network in [42] with the exception that it does not accept the keypoints as an input.
  • Identity embeddings are expected to contain the pose-independent information about the person (including lighting, clothing, etc.) Given K frames, obtained is a single identity vector by taking the mean of x 1 , ..., x k .
  • the remaining image I K+1 (the pose source) first undergoes a random pose augmentation transformation A, which is described below. Then, A(I K+1 ) is passed through a network of much lower capacity, which called is the pose encoder and denote as G.
  • the pose encoder outputs a d p -dimensional pose embedding , which want to be a person-agnostic pose descriptor.
  • the transformation A mentioned above is important for pose-identity disentanglement. It keeps person's pose intact but may alter its identity. Namely, it randomly scales the image independently over the horizontal and the vertical axes, and randomly applies content-preserving operations such as blur, sharpening, contrast change, or JPEG compression.
  • a pose augmentation since it is applied on the pose source, and it can be regarded as a form of data augmentation.
  • the pose and the identity embeddings are passed to the generator network that tries to reconstruct the image I K+1 as accurately as possible.
  • [42] used rasterized keypoints (stickman images) to pass the pose into their generator networks
  • authors rely entirely on the AdaIN [16] mechanism to pass both the pose and the identity embeddings to the generator.
  • proposed upsampling generator starts with a constant learnable tensor of size 512 ⁇ 4 ⁇ 4 and outputs the two tensors: of size 3 ⁇ 256 ⁇ 256 and of size 1 ⁇ 256 ⁇ 256, which it tries to match to the foreground part of the image I K+1 and its segmentation mask S K+1 respectively. This is achieved by simply predicting a 4 ⁇ 256 ⁇ 256
  • the AdaIN blocks are inserted after each convolution.
  • the AdaIN coefficients are produced by taking the concatenated pose and identity embeddings and passing this (d i + d p )- dimensional vector through an MLP with learnable parameters in the spirit of StyleGAN [20].
  • the model can be used to fit new identities unseen during meta-learning.
  • their identity vector can be extracted by passing those images through the identity encoder and averaging the results element-wise. Then, by plugging in a pose vector y extracted from an image of the same or of a different person, authors can reenact the person by computing the image and its foreground mask .
  • the estimated identity embedding is kept fixed during the fine-tuning (including it into the optimization did not result in any difference in proposed experiments, since the number of parameters in the embedding is much smaller than in the MLP and the generator network).
  • the pose embedding network G is also kept fixed during the fine-tuning.
  • Proposed training dataset is a collection of YouTube videos from VoxCeleb2 [4]. There are on the order of 100,000 videos of about 6,000 people. Sampled is 1 of every 25 frames from each video, leaving around seven million of total training images. In each image, annotated face is re-cropped by first capturing its bounding box with the S 3 FD detector [43], then making that box square by enlarging the smaller side, growing the box's sides by 80% keeping the center, and finally resizing the cropped image to 256X256.
  • the pose encoder has the MobileNetV2 architecture [31] and the identity encoder is a ResNeXt-50 (32 X 4d) [41]. Both have not been tweaked, and so they include batch normalization [18].
  • the pose and identity embedding sizes, d p and d i are 256 and 512 respectively. No normalization or regularization is applied to the embeddings.
  • the module that transforms them into AdaIN parameters is a ReLU perceptron with spectral normalization and one hidden layer of 768 neurons.
  • Proposed generator is based on that of [42], but without donwsampling blocks, since all inputs are delegated to AdaINs, which are located after each convolution. More precisely, a 512 X 4 X 4 learnable constant tensor is transformed by 2 constant resolution residual blocks, followed by 6 upsampling residual blocks. The number of channels starts halving from the fourth upsampling block so that the tensor of final resolution (256 X 256) has 64 channels. That tensor is passed through an AdaIN layer, a ReLU, a 1X1 convolution and a tanh, becoming a 4-channel image. Unlike [42], authors do not use self-attention. Spectral normalization [28] is employed everywhere in the generator, the discriminator and the MLP.
  • Proposed quantitative evaluation assesses both the relative performance of the pose descriptors using auxiliary tasks, and the quality of cross-person reenactment.
  • the ablation study in the supplementary material shows the effect of different components of proposed method.
  • Proposed invention 256-dimensional latent pose descriptors learned within proposed system.
  • FAb-Net Evaluated are the 256-dimensional FAb-Net descriptors [38] as a pose representation. These are related to the proposed invention in that, although not being person-agnostic, they are also learned in an unsupervised way from the VoxCeleb2 video collection.
  • 3DMM 3DMM.
  • This system extracts decomposed rigid pose, face expression, and a shape descriptor using a deep network.
  • the pose descriptor is obtained by concatenating the rigid pose rotation (represented as a quaternion), and the face expression parameters (29 coefficients).
  • Proposed descriptor learns from VoxCeleb2 dataset.
  • X2Face descriptor is trained on a smaller VoxCeleb1 dataset [29], and FAb-Net is learned from both.
  • the 3DMM descriptors are most heavily supervised, as the 3DMM is learned from 3D scans and requires a landmark detector (which is in turn learned in a supervised setting).
  • the X2Face system [39] based on native descriptors and warping-based reenactment.
  • X2Face+ X2Face+.
  • authors use frozen pre-trained X2Face's driving network (up to the driving vector) instead of proposed pose encoder, and keep the rest of the architecture unchanged from proposed. Trained is the identity encoder, the generator conditioned on X2Face latent pose vector and proposed identity embedding, and the projection discriminator.
  • FAb-Net+ Same as X2Face+ but with frozen FAb- Net in place of proposed pose encoder.
  • 3DMM+ Same as X2Face+ but with frozen Exp- Net [3] in place of proposed pose encoder, and with pose augmentations disabled.
  • the pose descriptor is constructed from ExpNet's outputs as described above.
  • Multi-PIE dataset [13] which is not used for training of either of descriptors, but has six emotion class annotations for people in various poses.
  • the dataset is restricted to near-frontal and half-profile camera orientations (namely 08_0, 13_0, 14_0, 05_1, 05_0, 04_1, 19_0), leaving 177,280 images.
  • authors randomly choose a query image from it and fetch the closest N images from the same group using cosine similarity of descriptors. If a person with the same emotion label is returned match to be correct.
  • Table 1 shows the overall ratio of correct matches within top-10, top-20, top-50, and top-100 lists.
  • 3DMM descriptor authors only consider the 29 face expression coefficients and ignore the rigid pose information as irrelevant for emotions.
  • Table 1 The accuracy of pose (expression)-based retrieval results using different pose descriptors on the Multi-PIE dataset. See text for more details.
  • the identity error I T estimates how closely the resulting talking heads resemble the original person k that the model was learned for.
  • the ArcFace [7] face recognition network R that outputs identity descriptors (vectors).
  • Authors compute the averaged reference descriptor from the fine-tuning dataset , and use the cosine similarity (csim) to compare it with the descriptors obtained from cross-person reenactment results.
  • Cross-person reenactment is performed by driving T k with all other 29 people. To obtain the final error, averaged is (one minus) similarities over all 30 people in the test set.
  • the pose reconstruction error P T is designed to quantify how well the system replays driver's pose and facial expression, and is defined in terms of facial landmarks. Since sets of landmarks can only be compared directly for the same person, restricted is the test dataset to self-reenactment pairs, i.e. drove is T k with I k . However, because T k has learned on , authors use another 32 hold-out frames from the same video to avoid overfitting. Employed is an off-the-shelf 2D facial landmarks prediction algorithm [2] L to obtain landmarks in both the driver and the reenactment result .
  • a measure d landmarks (l 1 , l 2 ) of how close landmarks l 2 approximate reference landmarks l 1 is the average distance between corresponding landmarks normalized by inter-ocular distance.
  • computed is d for all drivers and average across all 30 people:
  • Figure 3 illustrates evaluation of reenactment systems in terms of their ability to represent the driver pose and to preserve reference identity (arrows point towards improvement).
  • the horizontal and the vertical axes correspond to identity error I T and pose reconstruction error P T respectively, both computed from a subset of the test part of VoxCeleb2 dataset.
  • I T and pose reconstruction error P T respectively, both computed from a subset of the test part of VoxCeleb2 dataset.
  • Figure 4 illustrates comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
  • the top left image is one of the 32 identity source frames
  • top left image shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which we would like to render with different facial expression and head pose.
  • Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
  • the other images in the top row are facial expression and head pose drivers. Proposed method better preserves the identity of the target person and successfully transfers the mimics from the driver person.
  • Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that compared is against, with "Proposed invention” being the proposed invention, and other methods corresponding to the ones in Figure 3. Shown are the reenactment outputs from the respective method.
  • Figure 4 gives a qualitative comparison of the reenactment systems described above. It is evident that FSTH, being driven by rasterized landmarks, relies heavily on the driver's facial proportions, and thus is not person-agnostic. Its modified version FSTH+ does a better job having more representational power around vectorized keypoints; still, there are visible "identity bleeding" (e.g. compare head width in columns 1 and 2) and errors in prominent facial expressions, such as closing eyes. The warping-based method X2Face fails on slight rotations already.
  • the 3DMM+ method has a very tight bottleneck of interpretable parameters, and therefore its identity gap is very small. However, apparently for the same reason, it is not as good at rendering correct subtle facial expressions. Proposed full system is able to accurately represent facial expression and head pose driver's facial expression while preserving identity of the target person.
  • Figure 5 illustrates Reenactment by interpolation between two pose vectors across spherical trajectory in the pose descriptor space.
  • Each row shows reenactment outputs from the proposed invention for some person from the VoxCeleb2 test split.
  • the first and the last images in each row are computed by using some pose source frames from VoxCeleb2 test split (not shown in the figure), mapped by pose encoder to pose embeddings A and B respectively.
  • the images in between are obtained by running the proposed invention as usual, except that the pose embedding is obtained by spherical interpolation between A and B, rather than being computed by pose encoder from some image.
  • the pose embeddings in columns 2-5 are slerp(A, B; 0.2), slerp(A, B; 0.4), slerp(A, B; 0.6), slerp(A, B; 0.8).
  • Figure 6 illustrates additional comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
  • the top left image shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which to be rendered with different facial expression and head pose.
  • Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
  • Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that are compared against, with "Proposed invention” being the proposed invention, and other methods corresponding to the ones in Figure 3. Shown are the reenactment outputs from the respective method.
  • Temporal smoothness The supplementary video demonstrates the capability of proposed descriptor to create temporally smooth reenactment without any temporal smoothing of the extracted pose (provided that the results of bounding box detection are temporally smooth). At the same time, achieving temporally smooth reenactment with keypoint-driven systems (FSTH, FSTH+) requires a lot of keypoint smoothing.
  • X2Face+ and FAb-Net+ are altered by removing pose augmentations, and in bellow discussed are what effects are observed.
  • Table 2 A summary of systems compared in the ablation study.
  • Pose vector dimensionality d p is reduced from 256 to 64 simply by changing the number of channels in the last trainable layer of the pose encoder.
  • Proposed base model (proposed) but with pose vectors constrained to 64 dimensions is labeled -PoseDim in Figure 7.
  • Figure 7 illustrates quantitative evaluation of how ablating several important features of the training setup impacts proposed system. In addition, the impact of pose augmentation during training is illustrated for X2Face+ and FAb-Net+.
  • Pose encoder capacity Second, the pose encoder is replaced with a stronger network, namely ResNeXt- 50 (32 X 4), which makes it even with the identity encoder (that is of the same architecture). Denoted is proposed best model with this modification +PoseEnc. As indicated above the pose and the identity encoder are intentionally unbalanced so that the former is weaker, causing the optimization process to favor extracting person-specific information from the identity source frames rather than from the driving frame. Both the metrics and the reenactment samples for +PoseEnc suggest that this idea was not pointless: a more capacious pose encoder starts piping person-specific features from the facial expression and head pose driver.
  • Pose augmentation is retraining a model without random pose augmentations, i.e. A is set to an identity transformation.
  • A is set to an identity transformation.
  • a system is trained to exactly reconstruct the facial expression and head pose driver image, and is therefore more likely to degrade into an autoencoder (provided the pose encoder is trained along with the whole system).
  • Figure 8 illustrates comparison of cross-person reenactment for proposed best model and its ablated versions.
  • the top left image in fig. 8 shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which authors would like to render with different facial expression and head pose.
  • Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
  • Figure 7 shows severe growth in identity error and a sharp drop in pose error for those two models. This one more time proves that X2Face and FAb-Net descriptors are not person-agnostic.
  • Figure 9 illustrates the effect of pose augmentations on X2Face+ and FAb-Net+ models. Without augmentations, the identity gap becomes conspicuous.
  • the top left image in fig. 9 shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which authors would like to render with different facial expression and head pose. Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
  • Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that are compared.
  • the compared methods are enumerated in the four bottom rows of the Table 2, namely, X2Face+ and Fab-Net+, each evaluated with and without pose augmentations. Shown are the reenactment outputs from the respective method.), e.g. how glasses from driver #7 or facial shape from driver #1 are transferred to the result.
  • proposed system uses the pose descriptors without explicit supervision purely based on the reconstruction losses. The only weak form of supervision comes from the segmentation masks.
  • Proposed learned head pose descriptors outperform previous unsupervised descriptors at the tasks of pose-based retrieval, as well as cross-person reenactment.
  • At least one of the plurality of units may be implemented through an AI model.
  • a function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor.
  • the processor may include one or a plurality of processors.
  • one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
  • CPU central processing unit
  • AP application processor
  • GPU graphics-only processing unit
  • VPU visual processing unit
  • NPU neural processing unit
  • the one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory.
  • the predefined operating rule or artificial intelligence model is provided through training or learning.
  • learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made.
  • the learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.
  • the AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights.
  • Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
  • the learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction.
  • Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
  • a method for recognizing the facial expression and head pose may obtain output data recognizing an image or a facial expression and head pose in the image by using image data as input data for an artificial intelligence model.
  • the artificial intelligence model may be obtained by training.
  • "obtained by training” means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm.
  • the artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.
  • Visual understanding is a technique for recognizing and processing things as does human vision and includes, e.g., object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.
  • the above-described method performed by the electronic device may be performed using an artificial intelligence model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A rich representation of a human head/face pose and expression is a crucial ingredient in a large number of human-centric computer vision and graphics tasks. Learning such representations from data without relying on human annotation is an attractive approach, as it may utilize vast unlabeled datasets of human videos. In this work, inventors propose a new and simple way that performs such learning. Unlike previous works, here learning results in person-agnostic descriptors that capture human pose yet can be transferred from person to person, which is particularly useful for such applications as face reenactment. Alongside face reenactment, inventors also show that these descriptors are useful for other downstream tasks, such as face orientation estimation. Meanwhile, the above-described method performed by the electronic device may be performed using an artificial intelligence model.

Description

NEURAL FACIAL EXPRESSIONS AND HEAD POSES REENACTMENT WITH LATENT POSE DESCRIPTORS
An invention relates to computer graphics, telepresence (e.g. video conferencing), human head pose estimation, video production, face tracking, augmented/virtual reality.
Head video reenactment has seen dramatic progress in quality and robustness over the recent years. Current state-of-the-art systems [33, 22, 37, 30, 39, 42, 10, 36, 35] demonstrate compelling nearly photorealistic "talking head" reenactments. The most recent ones are able to accomplish this even when a single image of the target person is available [30, 42, 10, 36, 35], by using deep neural generative networks.
Face/head reenactment is an active area of research. Here distinguished are between works where changes and augmentations are localized within faces (face reenactment), e.g. [33, 30], and more ambitious approaches that model extended regions including significant portion of clothing, neck, upper garment (head reenactment), e.g. [22, 37, 42].
Pose representation is an important aspect of reenactment systems. As mentioned above, most works drive reenactment using landmarks [33, 22, 37, 42, 10, 36]. Another approach is to use facial action units (AU) [9], as is done in face reenactment [30] and head reenactment [35]. Detecting action units still requires manual annotation and supervised learning. The X2Face system [39] uses latent vectors that are learned to be predictable of warping fields.
A more classic approach is to model face/head pose in the 3D morphable model (3DMM) framework [1] or using a similar approach in 2D (e.g. an active appearance model) [6]. Still, learning 3DMM and fitting a learned 3DMM almost invariably involves detecting landmarks, thus inheriting many of the landmark deficiencies. Alternatively, a dataset of 3D scans is required to build a model for pose/identity disentanglement in 3DMM framework.
Several recent works have investigated how landmarks can be learned in an unsupervised way [44, 19]. While generally very promising, unsupervised keypoints still contain person-specific information just like supervised keypoints, and therefore are not generally suitable for cross-person reenactment. Same applies to dense, high-dimensional descriptors such as DensePose body descriptor [14], and dense face-only descriptors [15, 34]. Finally, Codec avatars [26] learn person-specific latent pose descriptors and extractors based on the reconstruction losses. However, the transfer of such descriptors from person to person was not considered.
The recent and parallel work [32] has demonstrated that relative motion of unsupervised keypoints can be used to transfer animations at least in the absence of strong head rotation. Full-fledged comparison of proposed approach to [32] is left for future work.
Beyond head/face reenactment, there is a very large body of work on learning disentangled representations. Some representative works that learn latent pose or shape descriptors for arbitrary classes of objects using datasets of videos include [8, 40]. Some approaches (e.g. [24]) aim to learn content-style disentanglement (which may roughly correspond to shape-texture disentanglement) using adversarial [12] and cycle-consistency [45, 17] losses. Alternatively, disentanglement can be obtained by the direct fitting of factorized distributions to data (e.g. [23]).
More importantly, proposed is a new pose (here and below, 'head pose' means the combination of head orientation, position, as well as facial expression) representation for Neural facial expressions and head poses Reenactment. The representation of the pose the key role in the quality of reenactment. Most systems, including [33, 22, 37, 42, 10, 36], are based on keypoint (landmark) representation. The main advantage of such representation is that robust and efficient "off-the-shelf" landmark detectors are now available [21, 2].
The emergence of large unlabeled datasets of human videos such as [29, 4, 5] allows to learn latent pose-expression descriptors in an unsupervised way. This approach has been first explored in [39, 38], where the latent pose descriptors were learned such that the dense flow between different frames can be inferred from the learned descriptors.
Face landmarks, however, suffer from several shortcomings. First, learning a landmark detector requires excessive annotation effort, and the sets of annotated landmarks often miss some important aspects of the pose. E.g. many landmark annotations do not include eye pupils, and as a consequence, the reenactment will not have a full control of the gaze. Second, many of the landmarks do not have an anatomical basis, and their annotation is ambiguous and prone to errors, especially, when they are occluded. In practice, such ambiguity of annotation often translates into temporal instability of keypoint detection that in turn translates into the reenactment results. Finally, as a representation, landmarks are person-specific, as they contain considerable amount of information about pose-independent head geometry.
This may be highly undesirable for head reenactment, e.g. if one wants to drive an iconic photograph or painting with the target person having a different head geometry.
In this invention proposed is improving the pre-existing neural one-shot head reenactment systems in two important ways. First, rather straightforwardly, the previous state-of-the-art reenactment system [42] is augmented with the ability to predict foreground segmentation. Such prediction is needed for various scenarios, such as telepresence, where the transfer of the original background to the new environment can be undesirable.
Proposed is an alternative to the warping-based approach [39, 38]. Proposed approach learns low-dimensional person-agnostic pose descriptors alongside with medium-dimensional person-specific pose-independent descriptors, by imposing a set of reconstruction losses on video frames over a large collection of videos.
Importantly, when evaluating the reconstruction losses, proposed is segmenting out the background, so that the background clutter and its change across frames does not affect the learned descriptors.
Proposed system modifies and expands the reenactment model of Zakharov et al. [42]. First, the ability to predict the segmentation is added. Second, the system learns to perform reenactment based on latent pose vectors rather than keypoints.
A simple learning framework based on sampling multiple random frames from the same video paired with the large size of the video dataset allows to learn extractors for both descriptors that work very well for reenactment tasks, including cross-person reenactment.
In particular, proposed reenactment based on the new latent pose representation preserves the identity of the target person much better than when FAb-Net [38] and X2Face [39] pose descriptors are used. Additionally, analyzed is the quality of learned latent pose descriptors for such tasks as landmark prediction and pose-based retrieval.
A rich representation of a human head/face pose and expression is a crucial ingredient in a large number of humancentric computer vision and graphics tasks. Learning such representations from data without relying on human annotation is an attractive approach, as it may utilize vast unlabeled datasets of human videos. In this work, proposed is a new and simple way that performs such learning. Unlike previous works, proposed learning results in person-agnostic descriptors that capture human pose yet can be transfered from person to person, which is particularly useful for such applications as face reenactment. Alongside face reenactment, also showed is that proposed descriptors are useful for other downstream tasks, such as face orientation estimation.
Importantly, proposed system can successfully be used for generating videos as well. Even when each video frame is generated independently and without any temporal processing, proposed system exhibits temporally smooth facial expressions, thanks to pose augmentations and latent pose representation. Previous methods that used facial keypoints to drive reenactment would inherit shakiness from keypoint detectors.
Proposed is a hardware device comprising software product that performs method for neural facial expressions and head pose reenactment, comprising: identity encoder unit configured to obtain the identity descriptor from person A's image, wherein output of the pose encoder unit does not contain information about person A's identity; pose encoder unit configured to obtain the descriptor of head pose and facial expression from person B's image, wherein output of the pose encoder unit does not contain information about person B's identity; generator unit which receives the outputs of the identity encoder unit and pose encoder unit, wherein the generator unit configured to synthesizing the avatar of person A, having head pose and facial expression from person B. Wherein the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity. Wherein person B's identity refers to person B's skin color, facial shape, eye color, clothing, and adornments.
Proposed is a method for synthesizing a photorealistic avatar of person, comprising: obtaining the identity descriptor from person A's image by use identity encoder unit; obtaining the descriptor of head pose and facial expression from person B's image by use pose encoder unit; synthesizing the avatar of person A, having head pose and facial expression from person B, by generator unit, which receives the outputs of the identity encoder unit and pose encoder unit. Wherein the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity. Wherein identity comprising skin color, facial shape, eye color, clothing, adornments.
Importantly, proposed system can successfully be used for generating videos as well. Even when each video frame is generated independently and without any temporal processing, proposed system exhibits temporally smooth facial expressions, thanks to pose augmentations and latent pose representation. Previous methods that used facial keypoints to drive reenactment would inherit shakiness from keypoint detectors.
The above and/or other aspects will be more apparent by describing exemplary embodiments with reference to the accompanying drawing.
Figure 1 illustrates using arbitrary people as facial expression and head pose drivers (top row) for generating realistic reenactments of arbitrary talking heads (such as Mona Lisa, bottom row).
Figure 2A illustrates proposed training pipeline (discriminator not shown for simplicity).
Figure 2B illustrates using proposed method.
Figure 3 illustrates evaluation of reenactment systems in terms of their ability to represent the driver pose and to preserve reference identity (arrows point towards improvement.
Figure 4 illustrates comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
Figure 5 illustrates Reenactment by interpolation between two pose vectors across spherical trajectory in the pose descriptor space.
Figure 6: Additional comparison of cross-person reenactment for several systems on VoxCeleb2 test set.
Figure 7 illustrates quantitative evaluation of how ablating several important features of the training setup impacts proposed system.
Figure 8 illustrates comparison of cross-person reenactment for proposed best model and its ablated versions.
Figure 9 illustrates the effect of pose augmentations on X2Face+ and FAb-Net+ models. Without augmentations, the identity gap becomes conspicuous.
-
Proposed is a Neural facial expressions and head poses Reenactment system, which is driven by a latent pose representation and is capable of predicting the foreground segmentation alongside the RGB image. The latent pose representation is learned as a part of the entire reenactment system, and the learning process is based solely on image reconstruction losses. Despite its simplicity, with a large and diverse enough training dataset, such learning successfully decomposes pose from identity. The resulting system can then reproduce mimics of the driving person and, furthermore, can perform cross-person reenactment. Additionally, the learned descriptors are useful for other pose-related tasks, such as keypoint prediction and pose-based retrieval.
Figure 1 illustrates using arbitrary people as facial expression and head pose drivers(top row) for generating realistic reenactments of arbitrary talking heads (such as Mona Lisa, bottom row). Proposed system can generate realistic reenactments of arbitrary talking heads (such as Mona Lisa) using arbitrary people as facial expression and head pose drivers (top row). Despite learning in an unsupervised setting, the method can successfully decompose pose and identity, so that the identity of the reenacted person is preserved.
The invention provides fast estimation of comprehensive and flexible representation (descriptor) of head pose and facial expression from one human head image.
The invention provides estimation of human head pose (yaw/roll/pitch angles) and their facial expression (including eye gaze direction) from an image of a person. The invention can be applied in:
telepresence (video conferencing) systems;
face-based remote control systems (in TVs, smartphones, robots, etc.);
entertainment systems, games (for controlling/driving a virtual avatar, or for creating it);
messengers (e.g. for creating animated stickers/emojis);
AR/VR systems (to determine head pose or eye gaze for rendering of virtual objects from a correct viewpoint).
Proposed system modifies and expands the reenactment model of Zakharov et al. [42], First, the ability to predict the segmentation
is added. Second, the system learns to perform reenactment based on latent pose vectors rather than keypoints. As in [42], learn on the VoxCeleb2 dataset [4] of video sequences. Each sequence contains a talking person, and is obtained from a raw sequence by running a face detector, cropping the resulting face and resizing it to a fixed size (256 256 in proposed case). Also, as in the case of [42], there is a "meta-learning" stage when a big model responsible for reproducing all people in the dataset is trained through a sequence of training episodes, and a fine-tuning stage, when that "meta-model" is fine-tuned to a tuple of images (or a single image) of a particular person.
As illustrated in Figure 2A: At each step of meta-learning, proposed system samples a set of frames from a video of a person. The frames are processed by two encoders. The bigger identity encoder is applied to several frames of the video, while the smaller pose embedder is applied to a hold-out frame. Hold-out data is usually defined as a part of data that is purposely not shown to (i.e. kept away from) a certain model. In this case, a hold-out frame is just another frame of the same person. Term "hold-out" emphasizes that this frame is ensured to not being fed into the identity encoder, only to pose encoder. The obtained embeddings are passed to the generator network, whose goal is to reconstruct the last (hold-out) frame. Having an identity encoder that is more capacious than the pose encoder is very important for disentangling pose from identity in the latent space of pose embeddings, which is a key component of the present invention. It is no less important to have a very tight bottleneck in the pose encoder, which in proposed case is implemented via a lower capacity neural network in the pose encoder, and a smaller dimensionality of the pose embeddings than that of the identity embeddings. This forces the pose encoder to only encode the pose information into pose embeddings to and disregard the identity information. Since the capacity of the pose encoder is limited, and since its input does not exactly match other frames w.r.t. identity (thanks to data augmentation), the system learns to extract all pose-independent information through the identity encoder and uses the smaller encoder to capture only pose-related information, thus achieving pose-identity disentanglement.
Proposed is method for neural facial expression and head pose reenactment, i.e. algorithm for synthesizing a photorealistic avatar of person A in which facial expression and head pose are taken from an image of person B (as illustrated in fig. 2A).
Figure 2B illustrates the inference pipeline of the proposed method, i.e. the algorithm of prediction/rendering with proposed system after it has been trained.
It is necessary to note, that The generator unit is trained simultaneously with all other units. During training, the identity encoder unit predicts identity embeddings from several images of a person, and the pose encoder unit predicts a pose embedding from an extra image of that person. The generator unit consumes averaged identity embeddings and the pose embedding, and is trained to output the image that was fed into the pose encoder unit, as well as the foreground mask for that image. Novel in this pipeline are (1) the pose encoder unit that is not directly supervised during training, (2) the use of foreground mask as a target, (3) pose augmentations (identity-preserving distortions of the input into the pose encoder unit) as a technique to reach pose-identity disentanglement.
First of all, the identity embedding is computed as an average of identity encoder outputs over all available images of person A. After that, the identity encoder is discarded and the rest of the system is fine-tuned on the images of person A just like during meta-learning stage, with the following differences: (1) there is only one person in the training set, (2) pose encoder's weights are kept frozen, (3) identity embedding is converted to a trainable parameter. Finally, an image of any person B is passed through the pose encoder, and the predicted rendering and segmentation are generated as usual (i.e. as during training) from the resulting pose embedding and the fine-tuned identity embedding.
1. Use the ≪identity encoder≫ program unit to obtain the identity (skin color, facial shape, eye color, clothing, adornments, …) descriptor from person A's image (for example, using the system described in ≪Few-Shot Adversarial Learning of Realistic Neural Talking Head Models≫ paper).
2. Use the ≪pose encoder≫ program unit to obtain the descriptor of head pose and facial expression from person B's image (dashed line in the figure 2A).
3. Feed the outputs of those two blocks into the ≪generator≫ program unit to synthesize the desired avatar.
The ≪pose encoder≫ unit is a trainable machine learning model (e.g. a convolutional neural network). It takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
Training algorithm for that model: in the training pipeline of ≪Few-Shot Adversarial Learning of Realistic Neural Talking Head Models≫ method, replace the input unit ≪RGB & landmarks≫ by proposed ≪pose encoder≫ unit. Next, train the whole system as described in the ≪Few-Shot Adversarial Learning of Realistic Neural Talking Head Models≫ paper with additional techniques, described below. Namely: the generator has no downsamplings (it is not of image-to-image architecture), as it starts from a constant learnable tensor rather than from a full-sized image; the pose embedding is additionally concatenated to the input of the MLP that predicts AdaIN parameters; the identity encoder is of a different architecture (ResNeXt-50 32x4d).
Proposed is random pose augmentations to the facial expression and head pose driver; the generator predicts the foreground segmentation mask in an extra output channel, and a dice loss is applied to match that prediction with the ground truth.
Novelty in comparison to current other solutions for the same task:
● the output of the ≪pose encoder≫ does not contain information about person B's identity, therefore, that output is depersonalized, which is better in terms of information security;
● for the same reason as above, proposed approach does not require a unit that adapts the head pose and facial expression descriptor to person A's identity (inside or outside of the ≪generator≫), and therefore leads to better computational efficiency.
The meta-learning step is described in the below discussion.
During each episode of meta-learning consider is a single video sequence. Then K + 1 random frames I 1 , ..., I K+1 are fetched from this sequence, as well as S K+1 - a foreground segmentation map for I K+1, which is precompute used an off-the-shelf semantic segmentation network.
The first K images I 1 , ..., I k are then fed into a relatively high-capacity convolutional net F, which calls identity encoder. It is analogous to the embedder network in [42] with the exception that it does not accept the keypoints as an input.
For each image I i, the identity encoder outputs d i-dimensional vector x i = F(I i), which calls the identity embedding of I i. Identity embeddings are expected to contain the pose-independent information about the person (including lighting, clothing, etc.) Given K frames, obtained is a single identity vector
Figure PCTKR2020015688-appb-img-000001
by taking the mean of x 1, ..., x k .
The remaining image I K+1 (the pose source) first undergoes a random pose augmentation transformation A, which is described below. Then, A(I K+1) is passed through a network of much lower capacity, which called is the pose encoder and denote as G. The pose encoder outputs a d p-dimensional pose embedding
Figure PCTKR2020015688-appb-img-000002
, which want to be a person-agnostic pose descriptor.
The transformation A mentioned above is important for pose-identity disentanglement. It keeps person's pose intact but may alter its identity. Namely, it randomly scales the image independently over the horizontal and the vertical axes, and randomly applies content-preserving operations such as blur, sharpening, contrast change, or JPEG compression. We call A pose augmentation since it is applied on the pose source, and it can be regarded as a form of data augmentation.
The pose and the identity embeddings are passed to the generator network that tries to reconstruct the image I K+1 as accurately as possible. Whereas [42] used rasterized keypoints (stickman images) to pass the pose into their generator networks, authors rely entirely on the AdaIN [16] mechanism to pass both the pose and the identity embeddings to the generator. More specifically, proposed upsampling generator starts with a constant learnable tensor of size 512×4×4 and outputs the two tensors:
Figure PCTKR2020015688-appb-img-000003
of size 3×256×256 and of
Figure PCTKR2020015688-appb-img-000004
size 1×256×256, which it tries to match to the foreground part of the image I K+1 and its segmentation mask S K+1 respectively. This is achieved by simply predicting a 4×256×256
tensor in the final layer. The AdaIN blocks are inserted after each convolution. The AdaIN coefficients are produced by taking the concatenated pose and identity embeddings and passing this (d i + d p)- dimensional vector through an MLP with learnable parameters in the spirit of StyleGAN [20].
Expected is
Figure PCTKR2020015688-appb-img-000005
and
Figure PCTKR2020015688-appb-img-000006
produced by the generator to be as close as possible to
Figure PCTKR2020015688-appb-img-000007
and S K+1, respectively. This is achieved with the help of several loss functions. Segmentation maps are matched with the help of the dice coefficient loss [27]. Head images with background blacked out, on the other hand, are matched using the same combination of losses as in [42], Namely, there are content losses based on matching of ConvNet activations for a VGG-19 model trained for ImageNet classification and a VGGFace model trained for face recognition. Also,
Figure PCTKR2020015688-appb-img-000008
and
Figure PCTKR2020015688-appb-img-000009
are passed through a projection discriminator (the difference from [42] here is that authors again do not provide rasterized keypoints to it) to compute the adversarial loss that pushes images to be realistic, the discriminator feature matching loss, and an embedding match term.
Reenactment and fine-tuning. Once the model has been meta-learned, it can be used to fit new identities unseen during meta-learning. Thus, given one or more images of a new person, their identity vector
Figure PCTKR2020015688-appb-img-000010
can be extracted by passing those images through the identity encoder and averaging the results element-wise. Then, by plugging in a pose vector y extracted from an image of the same or of a different person, authors can reenact the person by computing the image
Figure PCTKR2020015688-appb-img-000011
and its foreground mask
Figure PCTKR2020015688-appb-img-000012
.
To further reduce the identity gap, invited to follow [42] and fine-tune the model (namely, the weights of the MLP, the generator, and the discriminator) with the same set of losses as in [42] plus the dice coefficient loss treating the provided set of images of a new person and their segmentation as the ground truth. The estimated identity embedding
Figure PCTKR2020015688-appb-img-000013
is kept fixed during the fine-tuning (including it into the optimization did not result in any difference in proposed experiments, since the number of parameters in the embedding
Figure PCTKR2020015688-appb-img-000014
is much smaller than in the MLP and the generator network). The pose embedding network G is also kept fixed during the fine-tuning.
Key finding is that when applied to a person X the reenactment model trained as discussed above can successfully reproduce the mimics of a person in image I when the pose vector y = G(I) is extracted from an image of the same person X. More surprisingly, the model can also reproduce the mimics when the pose vector is extracted from an image of a different person Y. In this case, the bleeding of identity from this different person is kept to a minimum, i.e. the resulting image still looks like an image of a person X.
Initially, such disentanglement of pose and identity should not happen, and that some form of adversarial training [12] or cycle-consistency [45, 17] would be necessary to ensure the disentanglement. It turns out that with (i) low enough capacity of the pose extractor network G, (ii) pose augmentations applied, and (iii)background segmented out, disentanglement happens automatically, and proposed experiments with extra loss terms such as in e.g. [8] did not produce any further improvement. Apparently, with the three techniques above, the model prefers to extract all person-specific details from the identity source frame using the higher-capacity identity extractor network.
Below, evaluated is this disentanglement effect that came as a "pleasant surprise" and show that it is indeed stronger than in the case of other related approaches (i.e. supports cross-person reenactment better with less identity bleeding).
Below, additionally conducted is ablation studies to investigate how pose encoder capacity, pose augmentations, segmentation, and latent pose vector dimensionality d p affect the ability of proposed reenactment system to preserve pose and identity.
Proposed training dataset is a collection of YouTube videos from VoxCeleb2 [4]. There are on the order of 100,000 videos of about 6,000 people. Sampled is 1 of every 25 frames from each video, leaving around seven million of total training images. In each image, annotated face is re-cropped by first capturing its bounding box with the S 3FD detector [43], then making that box square by enlarging the smaller side, growing the box's sides by 80% keeping the center, and finally resizing the cropped image to 256X256.
Human segmentation is obtained by the Graphonomy model [11]. As in [42], authors set K = 8 thus using the identity vector extracted from eight random frames of a video in order to reconstruct the ninth one based on its pose descriptor.
In the proposed best model, the pose encoder has the MobileNetV2 architecture [31] and the identity encoder is a ResNeXt-50 (32 X 4d) [41]. Both have not been tweaked, and so they include batch normalization [18]. The pose and identity embedding sizes, d p and d i, are 256 and 512 respectively. No normalization or regularization is applied to the embeddings. The module that transforms them into AdaIN parameters is a ReLU perceptron with spectral normalization and one hidden layer of 768 neurons.
Proposed generator is based on that of [42], but without donwsampling blocks, since all inputs are delegated to AdaINs, which are located after each convolution. More precisely, a 512 X 4 X 4 learnable constant tensor is transformed by 2 constant resolution residual blocks, followed by 6 upsampling residual blocks. The number of channels starts halving from the fourth upsampling block so that the tensor of final resolution (256 X 256) has 64 channels. That tensor is passed through an AdaIN layer, a ReLU, a 1X1 convolution and a tanh, becoming a 4-channel image. Unlike [42], authors do not use self-attention. Spectral normalization [28] is employed everywhere in the generator, the discriminator and the MLP.
Instead of alternating generator and discriminator updates, a single weight update is carried out for all networks after gradient accumulations from all loss terms.
Trained are the model for 1,200,000 iterations with a minibatch of 8 samples spread over two NVIDIA P40 GPUs, which in total takes about two weeks.
Proposed quantitative evaluation assesses both the relative performance of the pose descriptors using auxiliary tasks, and the quality of cross-person reenactment. Qualitatively, showed are examples of reenactment in same-person and cross person scenarios as well as interpolation results in the learned pose space. The ablation study in the supplementary material shows the effect of different components of proposed method.
Below, submitted is comparing results of the proposed invention with the results of the known methods and systems. Considered are the following pose descriptors based on various degrees of supervision:
Proposed invention. 256-dimensional latent pose descriptors learned within proposed system.
X2Face. 128-dimensional driving vectors learned within the X2Face reenactment system [39].
FAb-Net. Evaluated are the 256-dimensional FAb-Net descriptors [38] as a pose representation. These are related to the proposed invention in that, although not being person-agnostic, they are also learned in an unsupervised way from the VoxCeleb2 video collection.
3DMM. Considered is a state-of-the-art 3DMM system [3]. This system extracts decomposed rigid pose, face expression, and a shape descriptor using a deep network. The pose descriptor is obtained by concatenating the rigid pose rotation (represented as a quaternion), and the face expression parameters (29 coefficients).
Proposed descriptor learns from VoxCeleb2 dataset. X2Face descriptor is trained on a smaller VoxCeleb1 dataset [29], and FAb-Net is learned from both. The 3DMM descriptors are most heavily supervised, as the 3DMM is learned from 3D scans and requires a landmark detector (which is in turn learned in a supervised setting).
In addition, considered are the following head reenactment systems based on these pose descriptors:
X2Face. The X2Face system [39] based on native descriptors and warping-based reenactment.
X2Face+. In this variant, authors use frozen pre-trained X2Face's driving network (up to the driving vector) instead of proposed pose encoder, and keep the rest of the architecture unchanged from proposed. Trained is the identity encoder, the generator conditioned on X2Face latent pose vector and proposed identity embedding, and the projection discriminator.
FAb-Net+. Same as X2Face+ but with frozen FAb- Net in place of proposed pose encoder.
3DMM+. Same as X2Face+ but with frozen Exp- Net [3] in place of proposed pose encoder, and with pose augmentations disabled. The pose descriptor is constructed from ExpNet's outputs as described above.
In the proposed invention additionally normalized are these 35-dimensional descriptors by per-element mean and standard deviation computed over the VoxCeleb2 training set.
FSTH. The original few-shot talking head system of [42] driven by rasterized keypoints.
FSTH+. Authors retrain the system of [42] by making several changes that makes it more comparable with proposed system and other baselines. The raw keypoint coordinates are put into the generator using AdaIN mechanism (just like in proposed system). The generator predicts segmentation alongside the image. Authors also use the same crops, which are different from [42].
To understand how good are the learned pose descriptors at matching different people in the same pose, used is the Multi-PIE dataset [13], which is not used for training of either of descriptors, but has six emotion class annotations for people in various poses. The dataset is restricted to near-frontal and half-profile camera orientations (namely 08_0, 13_0, 14_0, 05_1, 05_0, 04_1, 19_0), leaving 177,280 images. In each camera orientation group, authors randomly choose a query image from it and fetch the closest N images from the same group using cosine similarity of descriptors. If a person with the same emotion label is returned match to be correct.
This procedure is repeated 100 times for each group. Table 1 shows the overall ratio of correct matches within top-10, top-20, top-50, and top-100 lists. For the 3DMM descriptor, authors only consider the 29 face expression coefficients and ignore the rigid pose information as irrelevant for emotions.
Figure PCTKR2020015688-appb-img-000015
Table 1: The accuracy of pose (expression)-based retrieval results using different pose descriptors on the Multi-PIE dataset. See text for more details.
In this comparison, it can be observed that the latent space of proposed pose embeddings is better grouped with respect to emotion classes that those of other facial expression descriptors, as proposed result is much better for top-10 and top-20 metrics, while being similar to X2Face and better than the rest for top-50 and top-100. FAb-Net's and X2Face's vectors contain identity information, so they are more likely to be close to vectors representing same or similar person. As for 3DMM, it requires different latent expression vectors to turn different shapes (persons) into the same facial expression by construction; therefore, expression coefficients may easily coincide for different people showing different facial expressions.
Keypoint regression is not within proposed target applications, since keypoints contain person-specific information. However, this is a popular task on which unsupervised pose descriptors have compared in the past, so the proposed method is ran on a standard benchmark on the MAFL [25] test set. To predict keypoints, used is a ReLU MLP with one hidden layer of size 768, and in proposed case authors use both pose and identity embeddings as an input. Using the standard normalized inter-ocular distance, authors obtain the distance error of 2.63. This is smaller than the error of 3.44 obtained by FAb-Net, though behind the state-of-the-art of [19] (2.54) for this task.
Quantitative evaluation. Compared is the performance of the seven reenactment systems listed above in the crossperson setting. To do this, 30 people are randomly chose from the test split of VoxCeleb2 and learn talking head models T 1; …; T 30 for them. Each model T k is created from 32 random frames of a video
Figure PCTKR2020015688-appb-img-000016
. All models except X2Face are fine-tuned to those 32 frames for 600 optimization steps. Using these models, authors compute two metrics per system, identity error I T and pose reconstruction error P T.
The identity error I T estimates how closely the resulting talking heads resemble the original person k that the model was learned for. For that, used is the ArcFace [7] face recognition network R that outputs identity descriptors (vectors). Authors compute the averaged reference descriptor
Figure PCTKR2020015688-appb-img-000017
from the fine-tuning dataset
Figure PCTKR2020015688-appb-img-000018
, and use the cosine similarity (csim) to compare it with the descriptors obtained from cross-person reenactment results. Cross-person reenactment is performed by driving T k with all other 29 people. To obtain the final error, averaged is (one minus) similarities over all 30 people in the test set.
Formally,
Figure PCTKR2020015688-appb-img-000019
The pose reconstruction error P T, on the other hand, is designed to quantify how well the system replays driver's pose and facial expression, and is defined in terms of facial landmarks. Since sets of landmarks can only be compared directly for the same person, restricted is the test dataset to self-reenactment pairs, i.e. drove is T k with I k. However, because T k has learned on
Figure PCTKR2020015688-appb-img-000020
, authors use another 32 hold-out frames from the same video
Figure PCTKR2020015688-appb-img-000021
to avoid overfitting. Employed is an off-the-shelf 2D facial landmarks prediction algorithm [2] L to obtain landmarks in both the driver
Figure PCTKR2020015688-appb-img-000022
and the reenactment result
Figure PCTKR2020015688-appb-img-000023
.
In proposed case, a measure d landmarks(l 1, l 2) of how close landmarks l 2 approximate reference landmarks l 1 is the average distance between corresponding landmarks normalized by inter-ocular distance. As before, computed is d for all drivers and average across all 30 people:
Figure PCTKR2020015688-appb-img-000024
Figure 3 illustrates evaluation of reenactment systems in terms of their ability to represent the driver pose and to preserve reference identity (arrows point towards improvement). The horizontal and the vertical axes correspond to identity error I T and pose reconstruction error P T respectively, both computed from a subset of the test part of VoxCeleb2 dataset. These metrics are described extensively below in the "Detailed description" section, in its "Quantitative evaluation" subsection. Each point on the plot corresponds to a head reenactment system of the art with that proposed system is compared. The plot in Figure 3 has these two metrics evaluated for the compared models. A perfect system T would have I T = P T = 0, i.e. the closer to the lower left corner, the better. In these terms, proposed full model is strictly better than all systems except FSTH+, which is slightly better in one of the metrics but much worse in the other, and which benefits from an external keypoint detector.
Qualitative comparison. Figure 4 illustrates comparison of cross-person reenactment for several systems on VoxCeleb2 test set. The top left image is one of the 32 identity source frames, top left image shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which we would like to render with different facial expression and head pose. Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person. The other images in the top row are facial expression and head pose drivers. Proposed method better preserves the identity of the target person and successfully transfers the mimics from the driver person. Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that compared is against, with "Proposed invention" being the proposed invention, and other methods corresponding to the ones in Figure 3. Shown are the reenactment outputs from the respective method.
Figure 4 gives a qualitative comparison of the reenactment systems described above. It is evident that FSTH, being driven by rasterized landmarks, relies heavily on the driver's facial proportions, and thus is not person-agnostic. Its modified version FSTH+ does a better job having more representational power around vectorized keypoints; still, there are visible "identity bleeding" (e.g. compare head width in columns 1 and 2) and errors in prominent facial expressions, such as closing eyes. The warping-based method X2Face fails on slight rotations already.
Two similar methods, X2Face+ and FAb-Net+, both provide strong baselines despite some signs of identity mismatch, for example, traces of eyeglasses in column 7 and long hair seeping in from the facial expression and head pose driver in column 5. It is important to note that although pose descriptors from those methods are not person-agnostic, pose augmentations during training is still applied. In the ablation study below, demonstrated is that cross-person reenactment performance drops dramatically when removed are pose augmentations in these two methods.
The 3DMM+ method has a very tight bottleneck of interpretable parameters, and therefore its identity gap is very small. However, apparently for the same reason, it is not as good at rendering correct subtle facial expressions. Proposed full system is able to accurately represent facial expression and head pose driver's facial expression while preserving identity of the target person.
In addition, authors also show reenactment by interpolation in the pose space for proposed system in Figure 5, which demonstrates smooth pose changes. Figure 5 illustrates Reenactment by interpolation between two pose vectors across spherical trajectory in the pose descriptor space. Each row shows reenactment outputs from the proposed invention for some person from the VoxCeleb2 test split. The first and the last images in each row are computed by using some pose source frames from VoxCeleb2 test split (not shown in the figure), mapped by pose encoder to pose embeddings A and B respectively. The images in between are obtained by running the proposed invention as usual, except that the pose embedding is obtained by spherical interpolation between A and B, rather than being computed by pose encoder from some image. Namely, the pose embeddings in columns 2-5 are slerp(A, B; 0.2), slerp(A, B; 0.4), slerp(A, B; 0.6), slerp(A, B; 0.8).
Figure 6 illustrates additional comparison of cross-person reenactment for several systems on VoxCeleb2 test set. The top left image shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which to be rendered with different facial expression and head pose. Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that are compared against, with "Proposed invention" being the proposed invention, and other methods corresponding to the ones in Figure 3. Shown are the reenactment outputs from the respective method.
Temporal smoothness. The supplementary video demonstrates the capability of proposed descriptor to create temporally smooth reenactment without any temporal smoothing of the extracted pose (provided that the results of bounding box detection are temporally smooth). At the same time, achieving temporally smooth reenactment with keypoint-driven systems (FSTH, FSTH+) requires a lot of keypoint smoothing.
Considered are reducing pose vector dimensionality, increasing pose encoder capacity, keeping the background in images, and removing pose augmentation.
Implemented is retraining proposed best model with different subsets of these changes, and try removing pose augmentations from X2Face+ and FAb-Net+. X2Face+ and FAb-Net+ are altered by removing pose augmentations, and in bellow discussed are what effects are observed.
All the resulting models in question are listed in Table 2. Implemented is comparing them both quantitatively and qualitatively, just as in Section 4.3.
There are four ablation dimensions to explore in proposed study, and these correspond to the columns in Table 2.
Figure PCTKR2020015688-appb-img-000025
Table 2: A summary of systems compared in the ablation study.
Pose vector dimensionality d p. First, d p is reduced from 256 to 64 simply by changing the number of channels in the last trainable layer of the pose encoder. Proposed base model (proposed) but with pose vectors constrained to 64 dimensions is labeled -PoseDim in Figure 7. Figure 7 illustrates quantitative evaluation of how ablating several important features of the training setup impacts proposed system. In addition, the impact of pose augmentation during training is illustrated for X2Face+ and FAb-Net+.
Intuitively, a tighter bottleneck like this should both limit the ability to represent diverse poses and force the generator to take person-specific information from the richer identity embedding. According to the plot, indeed, the pose reconstruction error increases slightly, while the system stays person-agnostic to a similar degree. Qualitatively, however, the difference in pose is negligible.
Pose encoder capacity. Second, the pose encoder is replaced with a stronger network, namely ResNeXt- 50 (32 X 4), which makes it even with the identity encoder (that is of the same architecture). Denoted is proposed best model with this modification +PoseEnc. As indicated above the pose and the identity encoder are intentionally unbalanced so that the former is weaker, causing the optimization process to favor extracting person-specific information from the identity source frames rather than from the driving frame. Both the metrics and the reenactment samples for +PoseEnc suggest that this idea was not pointless: a more capacious pose encoder starts piping person-specific features from the facial expression and head pose driver. In Figure 8, the +PoseEnc's result is influenced by clothing from driver #1, hair from drivers #6-#8, facial shape from driver #9. A huge increase in the identity error in Figure 7 also confirms this. On the other hand, such system reconstructs the pose with a better accuracy, which may indicate that it might be a better choice for self-reenactment where "identity bleeding" is less of an issue.
Erasing the background. Third, proposed system is modified so that it does not predict foreground segmentation, does not use segmentation to compute loss functions, and thus becomes unsupervised. "Ours plus" is this change - Segm. A pose encoder in such system spends its capacity on encoding the driving image's background rather than estimating the subtle details of facial expressions. This happens because the perceptual loss functions are too sensitive to discrepancies between generated and target backgrounds compared to facial expression differences. More importantly, because background often changes within a video, reconstructing the target image's background is too difficult by just looking at the identity source images. Therefore, the optimization algorithm is tempted to offload identity encoder's job to pose encoder. This is evident from the plot and the samples, where introducing backgrounds contributes a lot to the identity gap, even more obviously when combined with a stronger pose encoder (model +PoseEnc -Segm).
Pose augmentation. Implemented is retraining a model without random pose augmentations, i.e. A is set to an identity transformation. In this setup, a system is trained to exactly reconstruct the facial expression and head pose driver image, and is therefore more likely to degrade into an autoencoder (provided the pose encoder is trained along with the whole system).
As easy to see from Figure 7 (Proposed invention!-Augm, +PoseEnc ! +PoseEnc - Augm), although this further improves the ability to represent poses, it also hurts identity preservation a lot. In fact, a system with a powerful ResNeXt-50 pose encoder trained without pose augmentations (+PoseEnc - Augm) turned out to be the worst of proposed models in terms of PT , but at the same time the best model in terms of pose reconstruction quality. Such a model, again, may be very useful for self-reenactment, but terrible for "puppeteering" (cross-person reenactment). Still, even in self-reenactment, one has to be careful as this model can give undesired effects such as image quality transfer (e.g. from driver #8 in Figure 8). Figure 8 illustrates comparison of cross-person reenactment for proposed best model and its ablated versions. The top left image in fig. 8 shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which authors would like to render with different facial expression and head pose. Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
This effect is once again confirmed by removing pose augmentations from X2Face+ and FAb-Net+ (the (-Augm) suffix is added to each model). With random augmentations on, despite the person-specific nature of X2Face and FAb-Net pose descriptors, the generator still develops robustness to person-specific features of facial expression and head pose drivers. Without augmentations, however, the degree of "identity bleeding" becomes fully explained by the identity-specificity of those off-the-shelf descriptors. Also, the pose reconstruction error should decrease given that the generator will not have to be that robust to drivers anymore, and so some of its capacity will free up and may be devoted to render more accurate poses. As expected, Figure 7 shows severe growth in identity error and a sharp drop in pose error for those two models. This one more time proves that X2Face and FAb-Net descriptors are not person-agnostic. In addition, one can observe the identity gap visually from Figure 9. Figure 9 illustrates the effect of pose augmentations on X2Face+ and FAb-Net+ models. Without augmentations, the identity gap becomes conspicuous. The top left image in fig. 9 shows one of the images defining the target identity from the VoxCeleb2 test split, i.e. the person which authors would like to render with different facial expression and head pose. Other images in this row are also from the VoxCeleb2 test split, and define the pose to transfer to the target person.
Each of the remaining rows corresponds to some Neural facial expressions and head poses Reenactment method that are compared. The compared methods are enumerated in the four bottom rows of the Table 2, namely, X2Face+ and Fab-Net+, each evaluated with and without pose augmentations. Shown are the reenactment outputs from the respective method.), e.g. how glasses from driver #7 or facial shape from driver #1 are transferred to the result.
In conclusion, there is a trade-off between identity preservation error I T and pose reconstruction error P T. This trade-off is adjusted by applying the above changes depending on whether the self-reenactment scenario or the cross-person driving scenario is more important. The latter is the case for proposed best model PROPOSED, while a good candidate for the former setting might be +PoseEnc or +PoseEnc - Segm.
Presented is and evaluated is a Neural facial expressions and head poses Reenactment that uses latent pose descriptors and is able to achieve realistic reenactment. Unlike the predecessor system [42] that used the keypoints as pose descriptor, proposed system uses the pose descriptors without explicit supervision purely based on the reconstruction losses. The only weak form of supervision comes from the segmentation masks. Proposed learned head pose descriptors outperform previous unsupervised descriptors at the tasks of pose-based retrieval, as well as cross-person reenactment.
Main, perhaps surprising, finding is that limiting capacity of the pose extraction network in proposed scheme is sufficient for pose/identity disentanglement. At the same time, it might happen that appropriate use of cyclic and/or adversarial losses may improve disentanglement even better.
Perhaps because of the constraint on the network capacity, proposed pose descriptors and reenactment system has problems with capturing some subtle mimics, especially gaze direction (though it still does a better job than keypoint descriptors that lack gaze representation altogether). Another obvious avenue of research is learning pose descriptor and the entire system in a semi-supervised way.
At least one of the plurality of units may be implemented through an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor.
The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning.
Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.
The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
According to the disclosure, in a method of an electronic device, a method for recognizing the facial expression and head pose may obtain output data recognizing an image or a facial expression and head pose in the image by using image data as input data for an artificial intelligence model. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.
Visual understanding is a technique for recognizing and processing things as does human vision and includes, e.g., object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.
Meanwhile, the above-described method performed by the electronic device may be performed using an artificial intelligence model.
The foregoing exemplary embodiments are examples and are not to be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.
References
[1] V. Blanz, T. Vetter, et al. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, volume 99, pages 187-194, 1999.
[2] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc. ICCV, pages 1021-1030,2017.
[3] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni. Expnet: Landmark-free, deep, 3d facial expressions. In In proc. FG, pages 122-129. IEEE, 2018.
[4] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
[5] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In Proc. CVPR, pages 3444-3453. IEEE, 2017.
[6] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. T-PAMI, (6):681-685, 2001.
[7] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019.
[8] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Proc. NeurIPS, pages 4414-4423, 2017.
[9] P. Ekman. Facial action coding system. 1977.
[10] C. Fu, Y. Hu, X. Wu, G. Wang, Q. Zhang, and R. He. High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv:1903.12003, 2019.
[11] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin. Graphonomy: Universal human parsing via graph transfer learning. In CVPR, 2019.
[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.
[13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE Computer Society, September 2008.
[14] R. A. G"uler, N. Neverova, and I. Kokkinos. DensePose: Dense human pose estimation in the wild. In Proc. CVPR,June 2018.
[15] R. A. G"uler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2, page 5, 2017.
[16] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. In Proc. ICCV, 2017.
[17] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In Proc. ECCV, 2018.
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, ICML' 15, pages 448-456, 2015.
[19] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks through conditional image generation. In Proc. NeurIPS, pages 4016-4027, 2018.
[20] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, June 2019.
[21] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. Proc. CVPR, pages 1867-1874, 2014.
[22] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M.
Figure PCTKR2020015688-appb-img-000026
, P. P' erez, C. Richardt, M. Zollh¨ ofer, and C. Theobalt. Deep video portraits. In Proc. SIGGRAPH, 2018.
[23] H. Kim and A. Mnih. Disentangling by factorising. In Proc. ICML, pages 2654-2663, 2018.
[24] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. Few-shot unsupervised image-to-image translation. In Proc. ICCV, 2019.
[25] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proc. ICCV, December 2015.
[26] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.
[27] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation, pages 565-571, 10 2016.
[28] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
[29] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.
[30] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proc. ECCV, pages 818-833, 2018.
[31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. CVPR, June 2018.
[32] A. Siarohin, S. Lathuili'ere, S. Tulyakov, E. Ricci, and N. Sebe. First order motion model for image animation. In Proc. NeurIPS, pages 7135-7145, 2019.
[33] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.
[34] J. Thewlis, S. Albanie, H. Bilen, and A. Vedaldi. Unsupervised learning of landmarks by descriptor vector exchange. In Proc. ICCV, 2019.
[35] S. Tripathy, J. Kannala, and E. Rahtu. Icface: Interpretable and controllable face reenactment using gans. CoRR, abs/1904.01909, 2019.
[36] T.Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro. Few-shot video-to-video synthesis. CoRR, abs/1910.12713, 2019.
[37] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. Proc. NeurIPS, 2018.
[38] O. Wiles, A. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In Proc. BMVC, 2018.
[39] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proc. ECCV, September 2018.
[40] F. Xiao, H. Liu, and Y. J. Lee. Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos. In Proc. ICCV, October 2019.
[41] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proc.CVPR, July 2017.
[42] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proc. ICCV, October 2019.
[43] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector. In Proc. ICCV, Oct 2017.
[44] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised discovery of object landmarks as structural representations. In Proc. CVPR, pages 2694-2703, 2018.
[45] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks.In Proc. ICCV, 2017.

Claims (6)

  1. A hardware device comprising software product that performs method for neural facial expressions and head pose reenactment, comprising:
    identity encoder unit configured to obtain the identity descriptor from person A's image, wherein output of the pose encoder unit does not contain information about person A's identity;
    pose encoder unit configured to obtain the descriptor of head pose and facial expression from person B's image, wherein output of the pose encoder unit does not contain information about person B's identity;
    generator unit which receives the outputs of the identity encoder unit and pose encoder unit, wherein the generator unit configured to synthesizing the avatar of person A, having head pose and facial expression from person B.
  2. A hardware according to claim 1, wherein the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
  3. A hardware according to claim 1, wherein person B's identity refers to person B's skin color, facial shape, eye color, clothing, and adornments.
  4. Method for synthesizing a photorealistic avatar of person, comprising:
    obtaining the identity descriptor from person A's image by use identity encoder unit;
    obtaining the descriptor of head pose and facial expression from person B's image by use pose encoder unit;
    synthesizing the avatar of person A, having head pose and facial expression from person B, by generator unit, which receives the outputs of the identity encoder unit and pose encoder unit.
  5. The method according to claim 4, wherein the pose encoder unit is a convolutional neural network, which takes as input a human image and outputs a vector that describes head pose and facial expression and does not describe their identity.
  6. A hardware according to claim 4, wherein identity comprising skin color, facial shape, eye color, clothing, adornments.
PCT/KR2020/015688 2019-11-12 2020-11-10 Neural facial expressions and head poses reenactment with latent pose descriptors WO2021096192A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
RU2019136336 2019-11-12
RU2019136336 2019-11-12
RU2020119034 2020-06-09
RU2020119034A RU2755396C1 (en) 2020-06-09 2020-06-09 Neural network transfer of the facial expression and position of the head using hidden position descriptors

Publications (1)

Publication Number Publication Date
WO2021096192A1 true WO2021096192A1 (en) 2021-05-20

Family

ID=75913114

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/015688 WO2021096192A1 (en) 2019-11-12 2020-11-10 Neural facial expressions and head poses reenactment with latent pose descriptors

Country Status (1)

Country Link
WO (1) WO2021096192A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114626860A (en) * 2022-05-12 2022-06-14 武汉和悦数字科技有限公司 Dynamic identity identification method and device for online commodity payment
CN116311477A (en) * 2023-05-15 2023-06-23 华中科技大学 Cross-identity consistency-oriented face movement unit detection model construction method
EP4136574A4 (en) * 2021-06-14 2023-07-05 Tencent America Llc Video conferencing based on adaptive face re-enactment and face restoration
CN116796196A (en) * 2023-08-18 2023-09-22 武汉纺织大学 Co-language gesture generation method based on multi-mode joint embedding
CN117036620A (en) * 2023-10-07 2023-11-10 中国科学技术大学 Three-dimensional face reconstruction method based on single image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030215115A1 (en) * 2002-04-27 2003-11-20 Samsung Electronics Co., Ltd. Face recognition method and apparatus using component-based face descriptor
US20060192785A1 (en) * 2000-08-30 2006-08-31 Microsoft Corporation Methods and systems for animating facial features, and methods and systems for expression transformation
US20080298643A1 (en) * 2007-05-30 2008-12-04 Lawther Joel S Composite person model from image collection
US20120309520A1 (en) * 2011-06-06 2012-12-06 Microsoft Corporation Generation of avatar reflecting player appearance
KR20160033552A (en) * 2014-09-18 2016-03-28 한화테크윈 주식회사 Face recognizing system using keypoint descriptor matching and majority vote and method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060192785A1 (en) * 2000-08-30 2006-08-31 Microsoft Corporation Methods and systems for animating facial features, and methods and systems for expression transformation
US20030215115A1 (en) * 2002-04-27 2003-11-20 Samsung Electronics Co., Ltd. Face recognition method and apparatus using component-based face descriptor
US20080298643A1 (en) * 2007-05-30 2008-12-04 Lawther Joel S Composite person model from image collection
US20120309520A1 (en) * 2011-06-06 2012-12-06 Microsoft Corporation Generation of avatar reflecting player appearance
KR20160033552A (en) * 2014-09-18 2016-03-28 한화테크윈 주식회사 Face recognizing system using keypoint descriptor matching and majority vote and method thereof

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4136574A4 (en) * 2021-06-14 2023-07-05 Tencent America Llc Video conferencing based on adaptive face re-enactment and face restoration
CN114626860A (en) * 2022-05-12 2022-06-14 武汉和悦数字科技有限公司 Dynamic identity identification method and device for online commodity payment
CN116311477A (en) * 2023-05-15 2023-06-23 华中科技大学 Cross-identity consistency-oriented face movement unit detection model construction method
CN116796196A (en) * 2023-08-18 2023-09-22 武汉纺织大学 Co-language gesture generation method based on multi-mode joint embedding
CN116796196B (en) * 2023-08-18 2023-11-21 武汉纺织大学 Co-language gesture generation method based on multi-mode joint embedding
CN117036620A (en) * 2023-10-07 2023-11-10 中国科学技术大学 Three-dimensional face reconstruction method based on single image
CN117036620B (en) * 2023-10-07 2024-03-01 中国科学技术大学 Three-dimensional face reconstruction method based on single image

Similar Documents

Publication Publication Date Title
WO2021096192A1 (en) Neural facial expressions and head poses reenactment with latent pose descriptors
Burkov et al. Neural head reenactment with latent pose descriptors
Zhang et al. Cross-domain correspondence learning for exemplar-based image translation
Tomei et al. Art2real: Unfolding the reality of artworks via semantically-aware image-to-image translation
Wang et al. Region attention networks for pose and occlusion robust facial expression recognition
Liu et al. Two-stream transformer networks for video-based face alignment
Qian et al. Unsupervised face normalization with extreme pose and expression in the wild
Masi et al. Do we really need to collect millions of faces for effective face recognition?
WO2020190083A1 (en) Electronic device and controlling method thereof
Khabarlak et al. Fast facial landmark detection and applications: A survey
WO2020096403A1 (en) Textured neural avatars
WO2022250408A1 (en) Method and apparatus for video recognition
US20220207847A1 (en) Adjusting a Digital representation of a Head Region
US20200272806A1 (en) Real-Time Tracking of Facial Features in Unconstrained Video
CN108363973B (en) Unconstrained 3D expression migration method
Yu et al. Heatmap regression via randomized rounding
EP3874415A1 (en) Electronic device and controlling method thereof
Xu et al. Designing one unified framework for high-fidelity face reenactment and swapping
Weber et al. High-level geometry-based features of video modality for emotion prediction
Wang et al. Facial expression-aware face frontalization
He et al. Autolink: Self-supervised learning of human skeletons and object outlines by linking keypoints
Ren et al. HR-Net: a landmark based high realistic face reenactment network
Yu et al. VTON-MP: Multi-Pose Virtual Try-On via Appearance Flow and Feature Filtering
Cai et al. Cascading scene and viewpoint feature learning for pedestrian gender recognition
Shih et al. Video interpolation and prediction with unsupervised landmarks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20888360

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20888360

Country of ref document: EP

Kind code of ref document: A1