WO2021228183A1 - Reconstitution faciale - Google Patents

Reconstitution faciale Download PDF

Info

Publication number
WO2021228183A1
WO2021228183A1 PCT/CN2021/093530 CN2021093530W WO2021228183A1 WO 2021228183 A1 WO2021228183 A1 WO 2021228183A1 CN 2021093530 W CN2021093530 W CN 2021093530W WO 2021228183 A1 WO2021228183 A1 WO 2021228183A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
source
target
parameters
identity
Prior art date
Application number
PCT/CN2021/093530
Other languages
English (en)
Inventor
Stefanos ZAFEIRIOU
Rami KOUJAN
Michail-Christos DOUKAS
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2021228183A1 publication Critical patent/WO2021228183A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/18Image warping, e.g. rearranging pixels individually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the present specification relates to facial re-enactment.
  • Facial re-enactment aims at transferring the expression of a source actor to a target face.
  • Example approaches include transferring the expressions of the source actor by modifying deformations within the inner facial region of the target actor.
  • this specification describes a method comprising: generating, based on a plurality of source input images of a first source subject, a first plurality of sequential source face coordinate images and gaze tracking images of the first source subject, wherein the first plurality of source face coordinate images comprise source identity parameters and source expression parameters, wherein the source expression parameters are represented as offsets from the source identity parameters; generating, based on a plurality of target input images of a first target subject, a plurality of sequential target face coordinate images and gaze tracking images of the first target subject; wherein the target face coordinate images comprise target identity parameters and target expression parameters, wherein the target expression parameters are represented as offsets from the target identity parameters; and generating, using a first neural network, a plurality of sequential output images, wherein the plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.
  • the first plurality of source face coordinate images further comprising source imaging parameters, wherein the source imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the source subject in the plurality of source input images.
  • the target face coordinate images may further comprise target imaging parameters, wherein the target imaging parameters represent at least one of a rotation, translation, and/or orthographic scale of the target subject in the plurality of target input images;
  • Some examples include training the first neural network with the first plurality of source face coordinate images and the plurality of target face coordinate images;
  • Some examples include training the first neural network with a plurality of source face coordinate images associated with a plurality of source subjects.
  • the first neural network comprises a generative adversarial network
  • the generative adversarial network comprises: a generator module for generating the plurality of output images; and at least one discriminator module for generating at least one loss function respectively based on the one or more ground truth inputs and the plurality of output images generated by the generator module.
  • Some examples include training the generative adversarial network, wherein the training comprises: receiving one or more ground truth inputs related to the target subject; updating the generative adversarial network based on the at least one loss function.
  • the at least one discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.
  • training the first neural network further comprises determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the target subject, wherein the generator module generates the plurality of output images based at least in part on the identity feature vector of the target subject.
  • the identity feature vector may be determined from concatenating target face coordinate images and an output of the identity embedder.
  • generating the plurality of output images using the trained first neural network comprises: initializing the generator module; initializing the at least one discriminator modules; and mapping the source expression parameters and the source gaze tracking images on the target identity parameters based on at least one determined identity feature vector of the target subject.
  • the source face coordinate images and the target face coordinate images comprise normalized mean face coordinates, wherein the normalized mean face coordinates comprise an encoding function of a rasterizer and a normalized version of a three dimensional morphable model.
  • the target face coordinate images are used as a conditional input to the first neural network.
  • this specification describes a method of training a neural network to generate a plurality of sequential output images, the method comprising: inputting a first plurality of sequential source face coordinate images and gaze tracking images of a first source subject, wherein the sequential source face coordinate images and gaze tracking images are based on a plurality of source input images of the first source subject; inputting a plurality of sequential target face coordinate images and gaze tracking images of a first target subject, wherein the plurality of sequential target face coordinate images and gaze tracking images are based on a plurality of target input images of the first target subject; generating, at a generator module of the neural network, the plurality of sequential output images based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters; generating, at a discriminator module of the neural network, at least one loss function based on one or more ground truth inputs and the plurality of sequential output images; and updating one or more parameters of the neural network based on the at least one loss function.
  • Some examples include inputting a plurality of source face coordinate images associated with a plurality of source subjects, wherein the plurality of sequential output images is further based on the plurality of source face coordinate images associated with the plurality of source subjects.
  • the discriminator module comprises at least one of an image discriminator, a mouth discriminator, and a dynamics discriminator.
  • Some examples include determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the first target subject, wherein the generator module of the neural network generates the plurality of sequential output images based at least in part on the identity feature vector of the target subject.
  • generating the plurality of sequential output images may comprise: initializing the generator module of the neural network; initializing the discriminator module of the neural network; mapping the source expression parameters and the source gaze tracking images on target identity parameters based on at least one determined identity feature vector of the first target subject.
  • this specification describes a system comprising one or more processors and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform a method of the first aspect or the second aspect.
  • this specification describes a computer program product comprising computer readable instructions that, when executed by a computing system, cause the computing system to perform a method of the first aspect or the second aspect.
  • FIG. 1 shows an image mapping module in accordance with an example embodiment.
  • FIG. 2 is a block diagram of a system in accordance with an example embodiment
  • FIG. 3 is a block diagram of a system in accordance with an example embodiment
  • FIG. 4 shows an example of feature extraction in accordance with an example embodiment
  • FIG. 5 is a flow diagram showing an algorithm in accordance with an example embodiment
  • FIG. 6 is a flow diagram showing an algorithm in accordance with an example embodiment
  • FIG. 7 is a block diagram of a system in accordance with an example embodiment
  • FIG. 8 is a data structure in accordance with an example embodiment
  • FIG. 9 is a block diagram showing an example of a structure of a neural network for use in an example embodiment
  • FIG. 10 shows an example discriminator module in accordance with an example embodiment
  • FIG. 11 is a block diagram of a system in accordance with an example embodiment
  • FIG. 12 is a flow diagram of an algorithm in accordance with an example embodiment
  • FIG. 13 is a flow diagram showing an algorithm in accordance with an example embodiment.
  • FIG. 14 is a block diagram of a system in accordance with an example embodiment.
  • FIG. 1 shows an image mapping module, indicated generally by the reference numeral 10, in accordance with an example embodiment.
  • the image mapping module 10 has a first input receiving images (e.g. video images) of a source actor 12, a second input receiving images (e.g. video images) of a target actor 14, and an output 16.
  • the source may be considered to provide “expressions” .
  • the target may be considered to be an “identity” to which the expressions of the source actor are to be applied.
  • the image mapping module 10 seeks to transfer the expressions of the source actor 12 onto the target actor 14 to provide the output 16, thereby implementing facial re-enactment.
  • the inputs 12 and 14 and the output 16 may all be videos.
  • the module 10 seeks to transfer the pose, facial expression, and/or eye gaze movement from the source actor 12 to the target actor 14, thereby providing the output 16. Given that people may easily detect mistakes in the appearance of a human face, specific attention may be given to the details of the mouth and the teeth such that photo-realistic and temporally consistent videos of faces are generated as outputs.
  • FIG. 2 is a block diagram of a system, indicated generally by the reference numeral 20, in accordance with an example embodiment.
  • the system 20 comprises a first module 22, which module may be implemented as part of the image mapping module 10 described above.
  • the first module 22 receives source input images 24 (e.g. source video images) .
  • the source input images 24 may be obtained from the source actor 12 described above.
  • the first module 22 generates source face coordinate images and gaze tracking images from the source input images 24.
  • FIG. 3 is a block diagram of a system, indicated generally by the reference numeral 30, in accordance with an example embodiment.
  • the system 30 comprises a first module 32, which module may be implemented as part of the image mapping module 10 described above.
  • the first module 32 receives target input images 34 (e.g. target video images) .
  • the target input images 34 may be obtained from the target actor 14 described above.
  • the first module 32 generates target face coordinate images and gaze tracking images from the target input images 34.
  • the source face coordinate images and the target face coordinate images described above may comprise normalized mean face coordinates (NMFC images) , wherein the normalized mean face coordinates comprise an encoding function of a rasterizer and a normalized version of a three dimensional morphable model.
  • NMFC images normalized mean face coordinates
  • FIG. 4 shows an example of feature extraction (e.g. extraction of pose, facial expression, and/or eye gaze movement of a source actor) in accordance with an example embodiment. More specifically FIG. 4 shows an example face coordinate image 42 (such as the source coordinate images and target coordinate images described above) and an example gaze tracking image 44 (such as the gaze tracking images of the source actor 12 or the target 14 described above) .
  • face coordinate image 42 such as the source coordinate images and target coordinate images described above
  • an example gaze tracking image 44 such as the gaze tracking images of the source actor 12 or the target 14 described above
  • FIG. 5 is a flow diagram showing an algorithm, indicated generally by the reference numeral 50, in accordance with an example embodiment.
  • the algorithm 50 starts at operation 52, where facial reconstruction and tracking is carried out using the systems 20 and 30 described above.
  • conditioning images are generated based on the outputs of the operation 52.
  • video rendering is performed based on the conditional images generated in the operation 54.
  • the facial reconstruction and tracking at operation 52 may comprise generating target face coordinate images, gaze tracking images, source face coordinate images, and source gaze tracking images.
  • human head characteristics e.g. related to identity of a person
  • human head characteristics e.g. related to identity of a person
  • three dimensional morphable models 3DMMs
  • 3D reconstruction of faces e.g. face of a source actor and/or a target actor
  • an input image sequence e.g. an input video
  • the 3D facial reconstruction and tracking at operation 52 may produce two sets of parameters comprising shape parameters and imaging parameters.
  • the shape parameters may comprise at least one of identity parameters and the expression parameters.
  • the shape parameters may be represented
  • the imaging parameters may be represented by .
  • the imaging parameters may represent at least one of a rotation, translation, and/or orthographic scale of a subject (source or target actor) in the plurality of input images.
  • a 3D facial shape may be represented mathematically according to the equation 1 below:
  • the face coordinate images (e.g. the source face coordinate images and/or the target face coordinate images) , denoted by the 3D facial shape x, may be a function of the identity parameters and the expression parameters
  • expression variations may be represented as offsets from a given identity shape.
  • the identity parameters and expression parameters of a source actor and/or target actor may be determined based at least partially on a plurality of video frames from an input source video 24 and/or target video 34 respectively.
  • Facial videos e.g. videos of the face of a source actor or videos of the face of a target actor
  • Facial landmarking may be used for achieving reliable landmark localisation such that the landmark information may be combined with 3D morphable models in order to perform the facial reconstruction.
  • the camera used for obtaining the videos may perform scaled orthographic projection (SOP) .
  • the identity parameters may be fixed over all the video frames (as the video is assumed to be of a single person with the same identity, e.g. either the source actor or the target actor) , and the expression parameters and/or imaging parameters may vary over the video frames (as the expression, pose, and/or camera or imaging angles may change through the video) .
  • the expression parameters and/or imaging parameters may vary over the video frames (as the expression, pose, and/or camera or imaging angles may change through the video) .
  • a given image sequence e.g.
  • a cost function may be minimized, where the cost function may comprise a) a sum of squared 2D landmark reprojection errors over all frames (E l ) , b) a shape priors term (E pr ) that imposes a quadratic prior over the identity and per-frame expression parameters and c) a temporal smoothness term (E sm ) that enforces smoothness of the expression parameters in time, by using a quadratic penalty of the second temporal derivatives of the expression vector, as shown in equation 2 below:
  • box constraints may be imposed on identity parameters and expression parameters (e.g. per frame expression parameters) in order to account for outliers (e.g. frames with strong occlusions that cause gross errors in the landmarks) .
  • outliers e.g. frames with strong occlusions that cause gross errors in the landmarks
  • box constraints may be imposed on identity parameters and expression parameters (e.g. per frame expression parameters) in order to account for outliers (e.g. frames with strong occlusions that cause gross errors in the landmarks) .
  • outliers e.g. frames with strong occlusions that cause gross errors in the landmarks
  • identity parameters of the 3D morphable model may originate from the Large Scale Morphable Model (LSFM) , for example, built from a plurality of (e.g. approximately 10,000) images of different people, with varied demographic information.
  • the expression parameters of the 3D morphable model may originate from using the blendshapes model of Face-warehouse and non-rigid ICP to register the blendshapes model with the LSFM model.
  • the conditioning of image generation at operation 54 may comprise training a neural network (e.g. implemented in a video renderer module similar to the first module 32) with a plurality of images 34 of the target actor, such that the trained neural network may generate target face coordinate images and target gaze tracking images based on target input images, for example, similar to the first module 32 described above.
  • training the neural network may comprise providing a sequence of images (e.g. a video) of the target actor and the target face coordinate images (e.g. identity parameters, expression parameters, and/or imaging parameters) obtained in the facial reconstruction and tracking operation 52.
  • the parameterisation obtained from the target face coordinate images seperates identity of the target actor from expression of the target actor, such that the neural network that is trained on the specific target actor in a training phase may be used for transferring the expression and/or pose of another source actor with different identity during a test phase.
  • the inputs to the neural network comprise the face coordinate images (e.g outputs from system 20 or 30) , which, for example, may be determined based on the shape parameters (s t ) (e.g. including identity parameters and expression parameters) , and imaging parameters (p t ) .
  • the shape parameters and imaging parameters may be used for rasterizing the 3D facial reconstruction of the frame, such that a visibility mask is produced in the image space.
  • each pixel of the visibility mask M may store an identity of a corresponding visible triangle on the 3D face from the respective pixel.
  • the normalised x-y-z coordinates of the centre of this triangle may be encoded in another image, termed as Normalised Mean Face Coordinates image, and the NMFC image may be utilised as a conditional input of the video rendering stage at operation 56.
  • the NMFC image may be generated according to equation (3) below:
  • the input of the neural video renderer may further be conditioned on the gaze image (G) of the corresponding frame t.
  • the video rendering at operation 56 may comprise using a neural network for rendering a video of the target actor with the expression and/or pose of a source actor.
  • a neural network for rendering a video of the target actor with the expression and/or pose of a source actor.
  • a sequence of NMFC frames c 1: T and corresponding sequence of eye gaze frames e 1: T may be obtained for the source actor.
  • a conditional input to the neural network x 1: T may be computed, for example, by concatenating NMFC frame sequences of the source actor and eye gaze frame sequence of the source actor at the channel dimension, such that
  • the neural network may learn to translate its conditional input video to a realistic and temporally coherent output video which may show the target actor performing similar head motions, facial expressions, and/or eye blinks as the source actor in the source input video.
  • FIG. 6 is a flow diagram showing an algorithm, indicated generally by the reference numeral 60, in accordance with an example embodiment.
  • the algorithm 60 starts at operation 62 where a first plurality of sequential source face coordinate images (e.g. NMFC images) are generated.
  • the first plurality of sequential source face coordinate images comprise face coordinate images and gaze tracking images of the first source subject that are generated based on a plurality of source input images of a first source subject (e.g. the source actor 12) .
  • the operation 62 may be implemented by the first module 22 of the system 20 described above with reference to FIG. 2.
  • the operation 62 may be implemented using the techniques described with reference to the facial reconstruction and tracking stage 52 as described above.
  • the first plurality of source face coordinate images may comprise source identity parameters and source expression parameters.
  • the source expression parameters are represented as offsets from the source identity parameters.
  • a plurality of sequential target face coordinate images (e.g. NMFC images) are generated.
  • the plurality of sequential target face coordinate images comprises face coordinate images and gaze tracking images of the first target subject that are generated based on a plurality of target input images of a first target subject (e.g. the target 14) .
  • the operation 64 may be implemented by the first module 32 of the system 20 described above with reference to FIG. 3.
  • the operation 64 may be implemented using the techniques described with reference to the facial reconstruction and tracking stage 52 as described above.
  • the target face coordinate images may comprise target identity parameters and target expression parameters.
  • the target expression parameters may be represented as offsets from the target identity parameters.
  • the plurality of source face coordinate images may further comprise source imaging parameters (e.g. camera parameters or imaging parameters (P) in equation (2) described above) .
  • the source imaging parameters may represent at least one of a rotation, translation, and/or orthographic scale of the source subject in the plurality of source input images.
  • the plurality of target face coordinate images may further comprise target imaging parameters (e.g. camera parameters or imaging parameters (P) in equation (2) described above) .
  • the target image parameters may represent at least one of a rotation, translation, and/or orthographic scale of the target subject in the plurality of target input images.
  • a plurality of sequential output images are generated.
  • the plurality of sequential output images are based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.
  • the plurality of sequential output images may be generated using a first neural network. Operation 68 may be implemented as part of the video rendering stage 56 described above with reference to FIG. 5.
  • the neural network used to generate the output images in the operation 68 described above may be trained in an operation 66 of the algorithm 60.
  • the operation 66 is shown in dotted form between the operation 64 and 68 by way of example only.
  • the operation 66 is optional, since the neural network may be trained at other times.
  • the training of the first neural network at operation 66 may be performed at the conditioning image generation stage 54 described above with reference to FIG. 5.
  • FIG. 7 is a block diagram of a system, indicated generally by the reference numeral 70, in accordance with an example embodiment.
  • the system 70 may be used to implement the operation 68 or algorithm 60 described above.
  • the system 70 comprises a first neural network 72.
  • the first neural network 72 receives source face coordinates and gaze tracking images 73 (see the operation 62 of the algorithm 60 described above) and target face coordinate and gaze tracking images 74 (see the operation 64 of the algorithm 60 described above) and generates output images (see the operation 68 of the algorithm 60 described above) .
  • the target face coordinate images 74 may be used as a conditional input to the first neural network 72.
  • the output images may comprise a plurality of sequential output images that may be based on a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.
  • the algorithm 60 may include the training of the first neural network 72.
  • the first neural network may be trained with the plurality of source face coordinate images and the plurality of target face coordinate images.
  • the plurality of source face coordinate images may comprise face coordinate images of the source actor 12, and the plurality of target face coordinate images may comprise face coordinate images of the target actor 14.
  • the first neural network 72 may be trained, for example, with a plurality of source face coordinate images associated with a plurality of source subjects. As such, the first neural network 72 may be trained with data from a wider range of source subjects, which may allow the neural network to be used for facial re-enactment of the target actor even when data available for the target actor (e.g. input video of the target actor) is limited (e.g. a short video of 50 to 500 frames of the target actor’s face) .
  • data available for the target actor e.g. input video of the target actor
  • FIG. 8 is a data structure 80 in accordance with an example embodiment.
  • the data structure 80 shows example parameters of the face coordinate images described above (e.g. the source face coordinate images and the target face coordinate images) .
  • the data structure 80 of a face coordinate image may comprise identity parameters 82, expression parameters 84, and, optionally, imaging parameters 86.
  • the first plurality of source face coordinate images comprises source identity parameters and source expression parameters, wherein the source expression parameters are represented as offsets from the source identity parameters.
  • the target face coordinate images comprise target identity parameters and target expression parameters, wherein the target expression parameters are represented as offsets from the target identity parameters.
  • FIG. 9 is a block diagram showing an example of a structure of the neural network 72 described above in accordance with an example embodiment.
  • the neural network 72 comprises a generator module 92 and at least one discriminator module 94 that may collectively form a generative adversarial network (GAN) .
  • GAN generative adversarial network
  • a GAN framework may be adopted for video translation, where the generator G is trained in an adversarial manner, alongside an image discriminator D I and a multiscale dynamics discriminator D D , which ensures that the generated video (e.g. sequential output images) is realistic, temporally coherent and conveys the same dynamics of the target actor’s input video.
  • a further mouth discriminator DM may be used for improved visual quality of the mouth area.
  • the training of the neural network 72 at operation 66 may be implemented using the generative adversarial network.
  • the source face coordinate images (e.g. as generated at system 20, at operation 62) and the target face coordinate images (e.g. as generated at system 30, at operation 64) may be provided as inputs to the generator module 92.
  • the generator module 92 may further receive source gaze tracking images and target gaze tracking images.
  • the generator module 92 may provide output images comprising a mapping of the source expression parameters and the source gaze tracking images on the target identity parameters.
  • the discriminator module 94 may receive ground truth images of the target actor as inputs, and also receive as input the output images generated by the generator module 92.
  • the at least one discriminator module 94 may generate at least one loss function respectively based on the one or more ground truth inputs and the plurality of output images generated by the generator module 92.
  • the loss function may indicate how realistic the output images may be perceived to be.
  • the loss function is then provided as a feedback to the generator module 92, such that the following output images are generated based at least partially on the loss function and the previous output images.
  • the loss function may be used for the training of the generator module 92, such that the output of the generator module 92 is improved in subsequent iterations.
  • the neural network 72 (e.g. implemented by the generative adversarial network) may be trained and/or initialized based on one or more ground truth inputs.
  • the neural network 72 may receive one or more ground truth inputs related to the target subject, and may then update the generative adversarial network based on the at least one loss function generated by the discriminator module.
  • dependence of the output frames on previous frames may be modelled by conditioning synthesis of the t-th frame using the conditional input x t , the previous input images x t-1 and x t-2 , as well as the previously generated output frames and as shown in equation 4 below:
  • the generator module 92 may generate output frames sequentially, such that the frames are produced one after the other, until the entire output sequence has been produced.
  • FIG. 10 shows features of an example discriminator module 94 in accordance with an example embodiment.
  • the discriminator module 94 comprises at least one of an image discriminator 102, a mouth discriminator 104, and a dynamics discriminator 106.
  • the image discriminator 102 may be used for distinguishing ground truth images from generated output images.
  • the mouth discriminator 104 may be provided to distinguish between cropped mouth-area ground truth images and generate output images.
  • the mouth discriminator 104 may be provided as a separate module than the image discriminator 102 as providing focussed training for the mouth region may provide more realistic results for the output images.
  • the dynamics discriminator 106 may be provided to distinguish between sequential ground truth images (e.g. ground truth videos) and generated sequential output images (e.g. output video) .
  • the dynamics discriminator 106 may comprise a temporal discriminator that is trained to detect videos with unrealistic temporal dynamics (e.g. ensuring that the generated video is realistic, temporally coherent and conveys the same dynamics of the target video)
  • FIG. 11 is a block diagram of a system, indicated generally by the reference numeral 110, in accordance with an example embodiment.
  • the system 110 may be an example implementation of the generator module 92.
  • the system 110 may comprise encoding pipelines 111 and 112, and decoding pipeline 116.
  • the encoding pipeline 111 may receive concatenated NMFC images 113 and eye gaze images 114 (x t-2: t )
  • the encoding pipeline 112 may receive at least two previously generated output images
  • the encoding pipelines 111 and 112 may each comprise a convolution, batch normalization, and rectified linear unit module, a downsampling module (e.g. 3x) , and a residual module (e.g. 4x) .
  • the resulting encoded features of the encoding pipelines 111 and 112 may be added and passed through the decoding pipeline 116, which provides the output in a normalised [-1, +1] range, using a tanh activation.
  • the decoding pipeline may comprise a residual module (e.g. 5x) , a downsampling module (e.g. 3x) , and a convolution tanh module.
  • ck may denote a 7x7 Convolution2D, BatchNorm, ReLU layer with k filters and unitary stride
  • dk may denote a 3x3 Convolution, BatchNorm, ReLU layer with k filters, and stride 2
  • uk may denote a 3x3 transpose Convolution2D, BatchNorm, ReLU layer with k filters, and stride of 1/2.
  • the encoding architecture may therefore be c64, d128, d256, d512, R512, R512, R512, R512 and the decoding architecture is R512, R512, R512, R512, u256, u128, u64, conv, tanh.
  • generator module 92 receives a loss function from the discriminator module 94, which discriminator module may comprise the image discriminator 102, mouth discriminator 104, and dynamics discriminator 106.
  • the image discriminator 102 (D I ) and mouth discriminator 104 (D M ) may learn to distinguish real frames from synthesized frames (e.g. generated output images) .
  • a given time step t’ in the range [1, T] may uniformly be sampled.
  • the real pair (x t’ , y t’ ) and the generated pair are provided as inputs to the image discriminator D I .
  • the corresponding mouth regions and may be cropped and provided as inputs to the mouth discriminator 104 (D M ) .
  • a Markovian discriminator architecture may be used.
  • the dynamics discriminator 106 may be trained to detect videos with unrealistic temporal dynamics.
  • K 3 consecutive real frames y t′: t′ +K-1 (e.g. ground truth frames) or generated frames (e.g. synthesized frames) as inputs (e.g. the real or generated frames may be randomly drawn from the video.
  • the dynamics discriminator 106 may further ensure that the flow w t’: t’+K-2 corresponds to the given video clip.
  • the dynamics discriminator 106 should learn to identify the pair (w t’: t’+K-2 ; y t’: t’+K-1 ) as real and the pair (w t’: t’+K- ) as generated.
  • a multiple scale dynamics discriminator may be employed, which may performs the tasks described above in three different temporal scales.
  • the first scale dynamics discriminator receives sequences in the original frame rate. Then, the two extra scales are formed by choosing not subsequent frames in the sequence, but subsampling the frames by a factor of K for each scale.
  • the objective of our GAN-based framework can be expressed as an adversarial loss.
  • the Least Squares Generative Adversarial Networks (LSGAN) loss may be used, such that the adversarial objective of the generator may be given by the equation 5 below:
  • two more losses may be added in the learning objective function of the generator, such as a VGG loss L G vgg and a feature matching loss L G feat which losses may be based on the discriminators.
  • the VGG network Given a ground truth frame y t and the synthesised frame , the VGG network may be used for extracting image features in different layers of these frames. The loss is then computed based on equation 6 provided below:
  • the discriminator feature matching loss may be computed by extracting features with the two discriminators 102 (D I ) and 106 (D D ) and computing the l 1 distance of these features for a generated frame and the corresponding ground truth y t . Therefore, the total objective for G may be given by equation (7) below:
  • the image, mouth, and dynamics discriminators 102, 104, and 106 may be optimised under their corresponding adversarial objective functions.
  • the image discriminator 102 may be optimized under the objective given by equation 2a below:
  • the mouth discriminator 104 may be optimized under a similar adversarial loss with which results by replacing the whole images in equation (2a) with the cropped mouth areas, as shown in equation 3a below:
  • the image cropping may be performed around the centre of the mouth on the real y t’ , generated and NMFC images, thus providing images of size 64x64.
  • the dynamics discriminator 106 is trained to minimize the adversarial loss according to equation 4a below:
  • the dynamics discriminator D D (106) may be conditioned on a very accurate facial flow of the target video.
  • Human facial performances exhibit non-rigid and composite deformations as a result of very complex facial muscular interactions, their flow may be captured by using target facial videos to train a specific network for this task.
  • optical flow estimation may be performed using a network pretrained on publicly available images. The pretrained models may be used, and their network may be fine-tuned on further datasets (e.g. 4DFAB dataset) comprising dynamic high-resolution 4D videos of subjects eliciting spontaneous and posed facial behaviours.
  • the provided camera parameters of the acquisition device may be used, and the 3D scans (e.g. 3D morphable model) of around 750K frames may be rasterized so that the difference between each pair of consecutive frames represents the 2D flow.
  • the 2D flow estimates of the same 750K frames using the original FlowNet2 may be generated and a masked End-Point-Error (EPE) loss may be used so that the background flow stays the same and the foreground follows the ground truth 2D flow coming from the 4DFAB dataset.
  • EPE End-Point-Error
  • FIG. 12 is a flow diagram of an algorithm, indicated generally by the reference numeral 120, in accordance with an example embodiment.
  • the algorithm 120 starts at operation 122, where facial reconstruction and tracking is carried out (e.g. using the systems 20 and 30 described above) .
  • conditional images are generated based on the outputs of the operation 122.
  • video rendering initialization is performed for initializing a neural network (e.g. neural network 72) used for rendering output images (e.g. output video) .
  • the neural network e.g. video rendering network 72
  • the neural network is updated based, at least in part, on the conditioning images generated in the operation 124, and on at least one loss function generated by at least one discriminator.
  • facial reconstruction and tracking operation 122 may be performed similar to operation 52 described above.
  • a reliable estimation of the human facial geometry may be generated, for example, capturing the temporal dynamics while being separable in each frame into the identity and expression contributions of the videoed subject.
  • a sparse landmark-based method (e.g. extracting 68 landmarks from each frame) may be adopted, which method may capitalise on the rich temporal information in videos while performing the facial reconstruction.
  • a scaled orthographic projection (SOP) may be postulated (e.g. with the assumption that in each video the identity parameters s i t are fixed (yet unknown) throughout the entire video, letting however the expression parameters s e t as well as the camera parameters (e.g. imaging parameters, scale, 3D pose) to differ among frames.
  • an energy equation that consists of three terms as demonstrated in equation (2b) may be minimized.
  • the terms may comprise 1) a data term penalising the l 2 norm error of the projected landmarks over all frames (E l ) , 2) shape regularisation term (E pr ) that reinforces a quadratic prior over the identity and per-frame expression parameters and 3) a temporal smoothness term (E sm ) that supports the smoothness of the expression parameters throughout the video, by employing a quadratic penalty on the second temporal derivatives of the expression vector. Equation 2b is provided below:
  • gross occlusions may be reduced by imposing box constraints on the identity and per-frame expression parameters.
  • the minimisation of the loss function may lead to a large-scale least squares problem with box constraints, which may be addressed by using the reflective Newton method.
  • the training the first neural network may comprise determining an identity feature vector of the target subject based at least in part on the target face coordinate images and an identity coefficient of the target subject, wherein the generator module generates the plurality of output images based at least in part on the identity feature vector of the target subject.
  • the identity feature vector may be determined by concatenating target face coordinate image and an output of the identity embedder.
  • FIG. 13 is a flow diagram of an algorithm, indicated generally by the reference numeral 130, in accordance with an example embodiment.
  • the algorithm 130 comprises generating the plurality of output images using the trained first neural network.
  • the algorithm 130 starts at operation 132 for initializing the generator module 92.
  • at operation 134 at least one discriminator modules 94 is initialized.
  • the source expression parameters and the source gaze tracking images are mapped on the target identity parameters based on at least one determined identity feature vector of the target subject.
  • the identity feature vector is described in further detail below.
  • FIG. 14 is a block diagram of a system, indicated generally by the reference numeral 140, in accordance with an example embodiment.
  • System 140 shows a framework used during the initialization training stage.
  • the framework comprises an identity embedder 144 receiving images 141 as inputs, a generator module 145 (similar to the generator module 92) receiving NMFC images 142 and previously generated (fake) frames 143, and at least one discriminator module comprising the dynamics discriminator 149, the image discriminator 150, and the mouth discriminator 151.
  • the generator module 145 comprises encoding pipelines 146 and 147 and a decoding pipeline 148. Synthesis is conditioned both on NMFC images (142) and frames generated (143) in previous time steps.
  • the identity feature vector h i computed by the identity embedder (identity embedding network) 144 is concatenated with the identity coefficients of person i, coming from the 3DMM, to form the final identity feature vector h * i . This is then injected to the generator 145 through the adaptive instance normalization layers.
  • the neural network (e.g. video rendering network) initialization operation 126 may comprise a learning phase using a multi-person dataset with N identities.
  • the neural network is trained with a plurality of images of a plurality of source actors. Let be a video of an i-th target person in the multi-person dataset, with removed background, and the corresponding foreground masks, extracted from the original video.
  • the neural network e.g.
  • generative adversarial network G aims to learn a mapping from this conditioning representation to a photo-realistic and temporally coherent video as well as a prediction of the foreground mask
  • the learning may be performed in a self-reenactment setting, thus the generated video should be a reconstruction of which serves as a ground truth.
  • NMFC sequence contains identity related information coming from the 3DMM, such as the head shape and geometry
  • the NMFC images may lack information regarding skin texture, hair, or upper body details.
  • the identity may thus be learnt by incorporating an identity embedding network E id (144) into the neural network (rather than learning identity from the NMFC input) .
  • the identity embedding network 144 learns to compute an identity feature vector which is passed to the generator to condition synthesis.
  • the system 140 further comprises image discriminator 150, a dynamics discriminator 149 and a dedicated mouth discriminator 151, which may ensure that the generated video looks realistic and temporally coherent.
  • the identity embedder may be used such that, given a target person i, M frames are randomly selected from the training video Each frame may be passed to the embedder, such that the identity feature vector is computed according to equation 4 (b) below:
  • Picking random frames from the training sequence i and averaging the embedding vectors may automatically render h i independent from the person’s head pose appearing in the M random input images.
  • the identity coefficient and vector s i computed during 3D face reconstruction may be used by concatenating the two vectors to get the final identity feature vector for person i:
  • the generator of system 140 conditions on two separate information sources: a) X i 1: T , the head pose and expression representation in NMFC images 142 and b) h * i , the identity representation (retrieved from the identity embedder 144) of the target person.
  • a sequential generator may be employed such that frames and masks are produced one after the other.
  • the synthetic frame and the hallucinated foreground mask at t may be computed by the generator, using equation 5b below:
  • the generator 145 may comprise two identical encoding blocks 146 and 147, and one decoding block 148 (similar to generator module 92 and 110) .
  • the conditional input X i t-2: t may be provided to the encoding pipeline 146, and the previously generated frames may be provided to the encoding pipeline 147. Then, their resulting features are summated and passed to the decoding pipeline 148, which outputs the generated frame and mask
  • the identity feature vector h * i may be injected into the generator through its normalization layers. For example, a layer normalization method may be applied to all normalisation layers, both in the encoding and decoding blocks of generator G (145) . More specifically, given a normalization layer in G with input x, the corresponding modulated output is calculated according to equation 6b below:
  • P ⁇ and P ⁇ matrices are the learnable parameters, which project the identity feature vector to the affine parameters of the normalization layer.
  • the image discriminator 150 and mouth discriminator 151 may be used for increasing the photo-realism of generated frames, as they learn to distinguish real from fake images.
  • the convolutional part of the image discriminator 150 (D I ) receives the real frame 156 or the synthesized frame 155 along with the corresponding conditional input 157 (x i t ) , and computes a feature vector d.
  • the image discriminator 150 may further keep a learnable matrix with each one of its lines representing a different person in the dataset. Given the identity index i, the image discriminator 150 may choose the appropriate row w i . Then, the realism score for person i may be calculated according to equation 7b below:
  • w 0 and c may be identity independent learnable parameters of the image discriminator 150.
  • r may reflect whether or not the head in the input frame is real and belongs to identity i and at the same time corresponds to the NMFC conditional input x i t .
  • a mouth discriminator 151 may be used to further improve the visual quality of the mouth area, for example, for improved teeth synthesis.
  • This mouth discriminator 151 receives the cropped mouth regions 159 and 158 from the real frame 156 or the synthesized frame 155 respectively, and may compute the realism score of the mouth area.
  • the dynamics discriminator 149 learns to distinguish between realistic and non-realistic temporal dynamics in videos.
  • a sequence of K consecutive frames 153 is randomly drawn from the ground truth video and provided to the dynamics discriminator 149, and a sequence K frames 152 is randomly drawn from the generated video and also provided to the dynamics discriminator 149.
  • the dynamics discriminator 149 may also receive and observe the optical flow 154 (V t: t +K-2) extracted from the real frame sequence 153. Therefore, the realism score reflects on whether or not the optical flow agrees with the motion in the short input video, forcing the generator to synthesize temporally coherent frames.
  • the objective of GAN-based framework can be expressed as an adversarial loss.
  • the parameters of both the video generator and identity embedder are optimized under the adversarial loss with each loss term coming from the corresponding discriminator network (e.g. 150, 151, or 149 respectively) .
  • An embedding matching loss may be added to the objective of E i d , taking advantage of the identity representation W, learned by the image discriminator, for each identity in the dataset. More specifically, given the person i, the identity feature vector h i computed by the identity embedder and the corresponding row w i of matrix W, the cosine distance between the identity features may be calculated according to equation 8b below:
  • the total objective of the identity embedder becomes which is minimized.
  • three more terms may be added in the loss function of the generator, such as a Visual Geometry Group (VGG) loss, a feature matching loss, and a mask reconstruction loss.
  • VGG Visual Geometry Group
  • a pretrained VGG network may be used to compute the VGG loss L vgg .
  • the feature matching loss L feat is calculated by extracting features with the image and mouth discriminators and computing the l 1 distance between the features extracted from the fake frame and the corresponding ground truth frame.
  • For the mask reconstruction loss we compute the simple l 1 distance between the ground truth foreground mask and the foreground mask predicted by the generator.
  • the total objective of the generator G can be written as Finally, all discriminators are optimised alongside E id and G, under the corresponding adversarial objective functions.
  • the network updating stage at operation 128 comprises fine-tuning the networks (e.g. generator networks, and/or discriminator networks) of the framework in system 140 to a new unseen identity, based at least partially on the network parameters learned from the multiple person dataset, in the previous training stage.
  • networks e.g. generator networks, and/or discriminator networks
  • a strong person specific generator may be obtained using a very small number of training samples, in a few epochs.
  • the facial reconstruction model described above may be used for computing the conditional input (142) to generator 145.
  • each RGB frame may be passed through the identity embedder 144 and the average identity feature may be calculated according to equation 9b below:
  • h new is concatenated with the identity coefficients of the new target actor from the 3DMM, yielding h * new .
  • the embedder E id may not be further required in the fine-tuning of operation 128.
  • the operation 128 comprises further steps specific to the new target actor, including a further generator initialization stage (e.g. operation 132) , discriminators initialization stage (e.g. operation 134) , training stage, and synthesis stage (e.g. operation 136) for the new target actor.
  • a further generator initialization stage e.g. operation 132
  • discriminators initialization stage e.g. operation 134
  • training stage e.g. operation 136
  • synthesis stage e.g. operation 136
  • a generator initialization vector is used to initialize the normalization layers of G.
  • Each normalization layer of G may be replaced with a simple instance normalization layer.
  • the identity projection matrices P ⁇ and P ⁇ learned during the initialization stage, are multiplied with h new and the resulting vectors and are used as an initialization of the modulation parameters of the instance normalization layer.
  • Other parameters of G may be initialized from the values learned in the first multi-person training stage.
  • the matrix W that contains an identity representation vector for each person in the multi-person dataset, may be replaced with a single vector w, which plays the role of row w i and is initialized with the values of h new .
  • the convolutional part of the image discriminator is initialized with the values learned from the previous training stage. Similar initialization may be performed for the mouth discriminator and the dynamics discriminator.
  • the framework is trained in an adversarial manner, with the generator aiming to learn the mapping from the NMFC sequence to the RGB video and foreground mask
  • the generator 145 can be used to perform source-to-target expression and pose transfer during test time. Given a sequence of frames from the source, first we the 3D facial reconstruction is performed, and the NMFC images are computed for each time step, by adapting the reconstructed head shape to the identity parameters of the target person. These NMFC frames are fed to G, which synthesizes the desired photo-realistic frames one after the other.
  • Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic.
  • the software, application logic and/or hardware may reside on memory, or any computer media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Ophthalmology & Optometry (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

Un système, un procédé et un programme informatique sont décrits consistant : à générer sur la base d'une pluralité d'images d'entrée sources d'un premier sujet source, une première pluralité d'images de coordonnées de visage sources séquentielles et des images de suivi de regard du premier sujet source, la première pluralité d'images de coordonnées de visage sources comprenant des paramètres d'identité sources et des paramètres d'expression sources, les paramètres d'expression source étant représentés sous la forme de décalages à partir des paramètres d'identité sources; à générer, sur la base d'une pluralité d'images d'entrée cibles d'un premier sujet cible, une pluralité d'images de coordonnées de visage cibles séquentielles et d'images de suivi du regard du premier sujet cible; les images de coordonnées de visage cibles comprenant des paramètres d'identité cibles et des paramètres d'expression cibles, les paramètres d'expression cibles étant représentés sous la forme de décalages à partir des paramètres d'identité cibles; et à générer, à l'aide d'un premier réseau neuronal, une pluralité d'images de sortie séquentielles, la pluralité d'images de sortie séquentielles étant basées sur un mappage des paramètres d'expression sources et des images de suivi de regard sources sur les paramètres d'identité cibles.
PCT/CN2021/093530 2020-05-13 2021-05-13 Reconstitution faciale WO2021228183A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2007052.0 2020-05-13
GB2007052.0A GB2596777A (en) 2020-05-13 2020-05-13 Facial re-enactment

Publications (1)

Publication Number Publication Date
WO2021228183A1 true WO2021228183A1 (fr) 2021-11-18

Family

ID=71135033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093530 WO2021228183A1 (fr) 2020-05-13 2021-05-13 Reconstitution faciale

Country Status (2)

Country Link
GB (1) GB2596777A (fr)
WO (1) WO2021228183A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373034A (zh) * 2022-01-10 2022-04-19 腾讯科技(深圳)有限公司 图像处理方法、装置、设备、存储介质及计算机程序

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184542A (zh) * 2020-07-17 2021-01-05 湖南大学 姿势导引的风格保持人体图像生成方法
CN113744129A (zh) * 2021-09-08 2021-12-03 深圳龙岗智能视听研究院 一种基于语义神经渲染的人脸图像生成方法及系统
CN115984949B (zh) * 2023-03-21 2023-07-04 威海职业学院(威海市技术学院) 一种带有注意力机制的低质量人脸图像识别方法及设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268623A (zh) * 2013-06-18 2013-08-28 西安电子科技大学 一种基于频域分析的静态人脸表情合成方法
CN104008564A (zh) * 2014-06-17 2014-08-27 河北工业大学 一种人脸表情克隆方法
US20160004905A1 (en) * 2012-03-21 2016-01-07 Commonwealth Scientific And Industrial Research Organisation Method and system for facial expression transfer
CN105900144A (zh) * 2013-06-07 2016-08-24 费斯史福特股份公司 实时面部动画的在线建模
CN106327482A (zh) * 2016-08-10 2017-01-11 东方网力科技股份有限公司 一种基于大数据的面部表情的重建方法及装置
US9799096B1 (en) * 2014-07-08 2017-10-24 Carnegie Mellon University System and method for processing video to provide facial de-identification
US20180068178A1 (en) * 2016-09-05 2018-03-08 Max-Planck-Gesellschaft Zur Förderung D. Wissenschaften E.V. Real-time Expression Transfer for Facial Reenactment
CN108288072A (zh) * 2018-01-26 2018-07-17 深圳市唯特视科技有限公司 一种基于生成对抗网络的面部表情合成方法
CN109934767A (zh) * 2019-03-06 2019-06-25 中南大学 一种基于身份和表情特征转换的人脸表情转换方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004905A1 (en) * 2012-03-21 2016-01-07 Commonwealth Scientific And Industrial Research Organisation Method and system for facial expression transfer
CN105900144A (zh) * 2013-06-07 2016-08-24 费斯史福特股份公司 实时面部动画的在线建模
CN103268623A (zh) * 2013-06-18 2013-08-28 西安电子科技大学 一种基于频域分析的静态人脸表情合成方法
CN104008564A (zh) * 2014-06-17 2014-08-27 河北工业大学 一种人脸表情克隆方法
US9799096B1 (en) * 2014-07-08 2017-10-24 Carnegie Mellon University System and method for processing video to provide facial de-identification
CN106327482A (zh) * 2016-08-10 2017-01-11 东方网力科技股份有限公司 一种基于大数据的面部表情的重建方法及装置
US20180068178A1 (en) * 2016-09-05 2018-03-08 Max-Planck-Gesellschaft Zur Förderung D. Wissenschaften E.V. Real-time Expression Transfer for Facial Reenactment
CN108288072A (zh) * 2018-01-26 2018-07-17 深圳市唯特视科技有限公司 一种基于生成对抗网络的面部表情合成方法
CN109934767A (zh) * 2019-03-06 2019-06-25 中南大学 一种基于身份和表情特征转换的人脸表情转换方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373034A (zh) * 2022-01-10 2022-04-19 腾讯科技(深圳)有限公司 图像处理方法、装置、设备、存储介质及计算机程序

Also Published As

Publication number Publication date
GB2596777A (en) 2022-01-12
GB202007052D0 (en) 2020-06-24

Similar Documents

Publication Publication Date Title
Su et al. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose
Zhao et al. Thin-plate spline motion model for image animation
Khakhulin et al. Realistic one-shot mesh-based head avatars
CN112889092B (zh) 有纹理的神经化身
WO2021228183A1 (fr) Reconstitution faciale
Noguchi et al. Unsupervised learning of efficient geometry-aware neural articulated representations
Ichim et al. Dynamic 3D avatar creation from hand-held video input
Corona et al. Lisa: Learning implicit shape and appearance of hands
Cao et al. 3D shape regression for real-time facial animation
US12067659B2 (en) Generating animated digital videos utilizing a character animation neural network informed by pose and motion embeddings
US11880935B2 (en) Multi-view neural human rendering
Su et al. Danbo: Disentangled articulated neural body representations via graph neural networks
RU2764144C1 (ru) Быстрый двухслойный нейросетевой синтез реалистичных изображений нейронного аватара по одному снимку
Crispell et al. Pix2face: Direct 3d face model estimation
Karpov et al. Exploring efficiency of vision transformers for self-supervised monocular depth estimation
US20230126829A1 (en) Point-based modeling of human clothing
Guo et al. HandNeRF: Neural radiance fields for animatable interacting hands
Zhi et al. Dual-space nerf: Learning animatable avatars and scene lighting in separate spaces
Venkat et al. HumanMeshNet: Polygonal mesh recovery of humans
Li et al. Spa: Sparse photorealistic animation using a single rgb-d camera
RU2713695C1 (ru) Текстурированные нейронные аватары
Huang et al. Object-occluded human shape and pose estimation with probabilistic latent consistency
Kabadayi et al. Gan-avatar: Controllable personalized gan-based human head avatar
CN117274446A (zh) 一种场景视频处理方法、装置、设备及存储介质
AU2022241513B2 (en) Transformer-based shape models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21803500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21803500

Country of ref document: EP

Kind code of ref document: A1