WO2020062120A1 - 一种从单幅图像生成人脸动画的方法 - Google Patents

一种从单幅图像生成人脸动画的方法 Download PDF

Info

Publication number
WO2020062120A1
WO2020062120A1 PCT/CN2018/108523 CN2018108523W WO2020062120A1 WO 2020062120 A1 WO2020062120 A1 WO 2020062120A1 CN 2018108523 W CN2018108523 W CN 2018108523W WO 2020062120 A1 WO2020062120 A1 WO 2020062120A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
face
feature points
area
facial
Prior art date
Application number
PCT/CN2018/108523
Other languages
English (en)
French (fr)
Inventor
周昆
耿佳豪
Original Assignee
浙江大学
杭州相芯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, 杭州相芯科技有限公司 filed Critical 浙江大学
Priority to PCT/CN2018/108523 priority Critical patent/WO2020062120A1/zh
Priority to EP18935888.0A priority patent/EP3859681A4/en
Publication of WO2020062120A1 publication Critical patent/WO2020062120A1/zh
Priority to US17/214,931 priority patent/US11544887B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30268Vehicle interior

Definitions

  • the present invention relates to the field of face-based animation, and in particular, to a method for editing a face area of a portrait picture.
  • Some work is based on a target character video or a driving character video (Umar, Mohammed, Simon, JD, Prince, and Jan Kautz. 2009. Visiolization: generating novel and facial images. ACM Transactions on Graphics (TOG) 28, 3 (2009), 57 .) (Pablo, Garrrido, Levi, Valgaerts, Ole Rehmsen, Thorsten Thormahlen, Patrick Perez, and Christian Theobalt. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference can be used in the way of Computer Vision and Pattern: Recognition. 4217 Relying on the face details in the target person's video or driving person's video can alleviate the loss of details to a certain extent, but this method also has some defects.
  • Face2face (Justus, Thies, Michael, Zollh ⁇ ofer, Marc, Stamminger, Christian, Theobalt, and Matthias Nie.ner.2016.Face2face: Real-time face capture, and reenactment of rgb, videos, InComputer vision, and PatternConference, 2016 on.IEEE, 2387-2395.)
  • the target person's video must contain sufficient mouth shape data.
  • work that requires driving video to enrich face details Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F Cohen.
  • An object of the present invention is to provide a method for generating a facial animation from a single image in response to the shortcomings of the prior art.
  • the invention realizes the non-linear geometric changes brought by the rigid body and non-rigid body changes through the global image deformation technology, and ensures the continuity between the face area and the non-face area, and then uses the generated anti-neural network to optimize the face area texture of the deformed image, and finally uses An adversarial neural network is generated to fill the oral area, so that the target person's characteristics are retained, the target feature point location is maintained, and the consistency between the face area and the non-face area is guaranteed, and it is consistent with the final result of real face image distribution.
  • This method reaches the level of the most advanced portrait animation generation technology, and can realize real-time processing, which has high practical value.
  • a method for generating a face animation from a single image includes the following steps:
  • Face feature point generation in the image Calculate feature points of the face and background area in the image;
  • Oral region texture generation Synthesize the oral region texture by generating an adversarial neural network, and generate the final face animation image.
  • step 1 includes the following sub-steps:
  • (1.1) Generation of feature points in the face area Detect the two-dimensional feature points, facial identity coefficients, expression coefficients and rigid body transformation coefficients of the target person's initial image face, and generate the corresponding three-dimensional by transmitting the facial expression coefficients and rigid body transformation coefficients of the person
  • the hybrid deformation model is projected onto a two-dimensional plane to obtain the face feature points after the offset.
  • s represents the driving person
  • t represents the target person
  • It is a rigid body transformation matrix between the target person's initial facial feature points and the driving character's initial facial feature points.
  • the step 2 is specifically: calculating an offset value of each feature point according to the target person feature points and the initial feature points after the shift. Using the facial area feature points and the background area feature points as vertices, triangulation is performed, and the offset map of each triangle is interpolated to obtain the offset map. In addition, in order to eliminate the discontinuity of non-face area offset values, the non-face areas in the offset map are filtered by Gaussian kernel. The radius of the Gaussian kernel increases with the distance from the face area, and its range is [7, 32]. Finally, the pixels at the corresponding position in the original image are transferred to the current image position through the above offset map, so as to obtain a deformed image.
  • step 3 includes the following sub-steps:
  • step 4 includes the following sub-steps:
  • the beneficial effect of the present invention is that, for the first time, the present invention proposes a method for generating a face animation from a single image by combining global deformation and generating an adversarial neural network.
  • the global deformation is used to realize the change of geometric features caused by rigid and non-rigid body changes, and to ensure that the face and The continuity of non-face region boundaries.
  • the generative adversarial neural network obtained by two trainings is used to optimize the texture of the face region and generate the texture of the oral region, so that the generated face conforms to the real face image distribution.
  • This method reaches the current state-of-the-art face image animation generation technology and can be processed in real time.
  • the invention can be used in applications such as face image editing, portrait animation generation based on a single image, and facial expression editing in videos.
  • FIG. 1 is a diagram illustrating the results of each stage of editing a first target portrait image by applying the method of the present invention.
  • FIG. 2 is a diagram showing the results of each stage of editing the second target portrait image by applying the method of the present invention.
  • FIG. 3 is a diagram of generating results of each stage of editing a third target portrait image by applying the method of the present invention.
  • FIG. 4 is a diagram showing the results of each stage of editing the fourth target portrait image by applying the method of the present invention.
  • FIG. 5 is a diagram showing the results of each stage of editing the fifth target portrait image by applying the method of the present invention.
  • the core technology of the present invention uses global deformation technology to process geometric feature changes caused by rigid and non-rigid changes, and uses wg-GAN to optimize face area details (excluding oral cavity), and uses hrh-GAN to fill oral cavity area details.
  • the method is mainly divided into the following four main steps: the generation of portrait feature points, the global two-dimensional deformation of the image according to the change of the feature points, the optimization of the face area details (excluding the oral cavity area), and the generation of the oral cavity texture.
  • Figures 1-5 show the results of each stage of applying the method of the present invention to the five target person portrait pictures.
  • the input image passes the first arrow to get the global deformation results, and then passes the second arrow to get the optimized face details (excluding the mouth).
  • the final arrow is used to fill the pre-mouth area and finally generate the result.
  • the present invention refers to the algorithm (Chen, Cao, Qiming, Hou, and Kun. 2014a. Displaced dynamic expression regression for real-time facial tracking and animation.
  • ACM Transactions on graphs (TOG) 33, 4 (2014), 43.) to detect the target The initial two-dimensional feature points, facial identity coefficients, facial expression coefficients and rigid body transformation coefficients of the initial image of the person.
  • TOG graphs
  • By transmitting the expression coefficients and rigid body transformation coefficients of the driving character we can generate the corresponding three-dimensional mixed deformation model and project it onto a two-dimensional plane, and we can get the facial feature points after the offset.
  • s represents the driving person
  • t represents the target person
  • It is a rigid body transformation matrix between the target person's initial facial feature points and the driving character's initial facial feature points.
  • Global two-dimensional image deformation Based on the initial feature points, according to the feature point changes specified by the user or the program, a global two-dimensional deformation is used to generate a deformation image that conforms to the feature point constraints.
  • the offset value of each feature point is calculated based on the target person's feature points and the initial feature points after the shift.
  • the feature points non-face area feature points and face area feature points
  • the offset values of the vertices in each triangle are interpolated to obtain an offset map.
  • the non-face area in the offset map is filtered by Gaussian checking.
  • the radius of the Gaussian kernel increases as the distance from the face area increases. We use 5 kinds of radii. Gaussian kernel, whose range is [7,32].
  • the pixels at the corresponding position in the original image are transferred to the current image position through the above offset map, so as to obtain a deformed image, the effect can refer to the results pointed by the first arrow in FIG. 1 to FIG. 5.
  • each video is sampled at intervals of 10 frames to obtain an image I i , and its facial feature points are detected to obtain P i .
  • the neutral expression image I * is selected from ⁇ I i
  • Calculated feature point offset D i, and P i by triangulation and interpolation of the deformation D i I *, to obtain the corresponding strain image I i W i with P * and P i.
  • the generator (optimizer) network structure is a coding and decoding structure.
  • the generator (optimizer) network structure is a coding and decoding structure.
  • the network structure can be expressed as (C64, K7, S1, LReLU, Skip1)-> (C128, K5 , S2, LReLU, Skip2)-> (C256, K3, S2, LReLU)-> 4 * (RB256, K3, S1, LReLU)-> (RC128, K3, R2, LReLU, Skip1)-> (RC64, K3 , R2, LReLU, Skip2)-> (C3, K3, S1, Sigmoid), where C, RB, and RC represent the convolution layer, the residual module, and the scaled convolution layer, respectively, and the following numbers represent the depth of the output of this layer Size; K represents the kernel in the module, and the number after it represents the size of the kernel; the number after S represents the stride size in the convolution layer or the residual module.
  • the layer is down-sampled, S2, otherwise S1;
  • R The number after it indicates the scaling of the convolution layer, that is, when the upsampling is required, it is R2; in addition, Skip indicates the skip transfer, and the subsequent number table Numbers, the same number indicates that they belong to the same hopping transfer;
  • LReLU Andrew L Maas, Awni Y Hannun, and Andrew Y Y Ng. 2013. Rectifier non-linearities imimprove neural network models. In Proc.icml, Vol. 30.3.)
  • Sigmoid said The activation function used.
  • the classifier network structure is a coding structure that encodes the input content into a feature vector through a convolution layer and uses the value of the fully connected layer output to measure the authenticity of the input content. Its structure can be expressed as (C64, K7, S1 , LReLU)-> (C128, K5, S2, LReLU)-> (C256, K3, S2, LReLU)-> (C512, K3, S2, LReLU)-> (C512, K3, S2, LReLU)-> ( C512, K3, S2, LReLU)-> (FC1), where FC indicates the fully connected layer, and the following numbers indicate that the output is 1, and the fully connected layer does not have any activation function.
  • the loss function of the training network can be defined by:
  • min R means to differentiate the parameters in the R optimizer to minimize the target sub-values
  • max D means to differentiate the parameters in the D resolver to maximize the target sub-values
  • L (R) is the regular term, which is the L1 loss function between R (x w , M) and x g , which is used to constrain the optimizer's optimization result.
  • is a super parameter used to control the specific gravity of L (R), which is equal to 0.004 in the present invention.
  • the deformed image and the initial face image are cropped to obtain the facial area images of the two, and the facial area images of the two are aligned to obtain I i and I * and their corresponding faces.
  • a feature point P i and P * By calculating the difference between P i and P *, I * obtained by the feature point offset I i D i.
  • Oral region texture generation Generate the oral region texture by hrh-GAN.
  • Training data comes from MMI, MUG, CFD and Internet data.
  • the corresponding oral region mask maps are generated from the oral region feature points.
  • the face image and the corresponding oral area mask form the hrh-GAN training data.
  • the hrh-GAN network structure and training method in the present invention are based on algorithms (Satoshi, Iizuka, Edgar, Simo-Serra, and Hiroshi, Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 107 .).
  • a full convolutional neural network is used as a generator to complete the image.
  • the combination of global discriminator and local discriminator guarantees the global and local rationality of the generated image.
  • the present invention uses a similar generator to generate the oral cavity area.
  • the global discriminator and local discriminator help the generator to generate reasonable details of the oral cavity area.
  • the loss function we use is the same as that of Iizuka.
  • the facial feature points calculate the oral area mask map corresponding to the optimized face image obtained in 3.2, and stitch the face image with the oral area mask map as the input of the hrh-GAN generator.
  • a face image is obtained after filling the oral cavity area.
  • the face image is aligned with the face position in the deformed image through translation and rotation, and combined with the non-face area in the deformed image to obtain the final target portrait image.
  • the inventor implemented an implementation example of the present invention on a machine equipped with an Intel Core i7-4790 central processor and an NVidia GTX1080Ti graphics processor (11GB).
  • the inventor used all the parameter values listed in the specific embodiment to obtain all the experimental results shown in FIG. 5.
  • the invention can effectively and naturally generate portrait animations based on driving people in the Internet.
  • the entire processing process takes about 55 milliseconds: feature point detection and tracking takes about 12 milliseconds; global image deformation takes about 12 milliseconds; optimization of the face area texture takes about 11 milliseconds; oral cavity detail filling takes about It takes 9 milliseconds; the remaining time is mainly used for data transfer between CPU and GPU; in addition, wg-GAN and hrh-GAN need to be trained for 12 hours and 20 hours respectively, and both only need to be trained once and can be used for any target People images.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Graphics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明公开了一种从单幅图像生成人脸动画的方法。该方法主要分为四个步骤:图像中人脸特征点生成、图像全局二维形变、人脸区域细节的优化、口腔区域纹理的生成。本发明可以根据人脸特征点变化实时地生成人脸动画,动画质量达到当前最先进的人脸图像动画技术的水平。本发明可以用在一系列应用中,如人脸图像编辑,基于单幅图像的肖像动画生成,以及视频中人脸表情的编辑。

Description

一种从单幅图像生成人脸动画的方法 技术领域
本发明涉及基于人脸动画领域,尤其涉及一种对人像图片的人脸区域进行编辑的方法。
背景技术
人脸编辑领域以Blanz及Vetter的工作作为起始(Volker Blanz and Thomas Vetter.1999.A morphable model for the synthesis of 3D faces.In Proceedings of the 26th annual conference on Computer graphics and interactive techniques.ACM Press/Addison-Wesley Publishing Co.,187–194.),用单张图像通过拟合参数得到图像对应的三维可形变模型及纹理。该技术为后续人脸编辑工作生成更真实结果打下基础(Pia Breuer,Kwang-In Kim,Wolf Kienzle,Bernhard Scholkopf,and Volker Blanz.2008.Automatic 3D face reconstruction from single images or video.In Automatic Face&Gesture Recognition,2008.FG’08.8th IEEE International Conference on.IEEE,1–8.)(Marcel Piotraschke and Volker Blanz.2016.Automated 3d face reconstruction from multiple images using quality measures.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.3418–3427.)。这些技术通常会因为主成分表达能力有限使编辑后的人脸丢失细节。
有的工作基于一段目标人物视频或是一段驱动人物视频(Umar Mohammed,Simon JD Prince,and Jan Kautz.2009.Visiolization:generating novel facial images.ACM Transactions on Graphics(TOG)28,3(2009),57.)(Pablo Garrido,Levi Valgaerts,Ole Rehmsen,Thorsten Thormahlen,Patrick Perez,and Christian Theobalt.2014.Automatic face reenactment.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.4217–4224.)这类方式可以借助目标人物视频或驱动人物视频中人脸细节,在一定程度上缓解细节内容丢失问题,但是这类方式也存在一些缺陷。例如Face2face(Justus Thies,Michael Zollh¨ofer,Marc Stamminger,Christian Theobalt,and Matthias Nie.ner.2016.Face2face:Real-time face capture and reenactment of rgb videos.In Computer Vision and Pattern Recognition(CVPR),2016IEEE Conference on.IEEE,2387–2395.)需要目标人物视频包含充足的嘴型数据。像需要借助驱动视频来丰富生成人脸细节的工作(Hadar Averbuch-Elor,Daniel Cohen-Or,Johannes Kopf,and Michael  F Cohen.2017.Bringing portraits to life.ACM Transactions on Graphics(TOG)36,6(2017),196.),生成结果质量随着目标人物与驱动人物图像差别增大而下降。另外这些方式没有任何手段保证生成结果符合真实图像分布。
最近,生成对抗神经网络的发展,为该领域提供新思路,比如利用几何信息作为网络训练的引导,采用对抗训练的方式,使网络生成结果符合几何信息与真实人脸图像分布(Fengchun Qiao,Naiming Yao,Zirui Jiao,Zhihao Li,Hui Chen,and Hongan Wang.2018.Geometry-Contrastive Generative Adversarial Network for Facial Expression Synthesis.arXiv preprint arXiv:1802.01822(2018).)(Lingxiao Song,Zhihe Lu,Ran He,Zhenan Sun,and Tieniu Tan.2017.Geometry Guided Adversarial Facial Expression Synthesis.arXiv preprint arXiv:1712.03474(2017).)。但是这些方法一般只能处理剪裁后的人脸区域,非人脸区域无法进行处理,且生成结果质量随目标几何信息与原始图像几何信息差别增大而下降。
发明内容
本发明的目的在于针对现有技术的不足,提供了一种从单幅图像生成人脸动画的方法。本发明通过图像全局形变技术实现刚体及非刚体变化带来的非线性几何变化并保证人脸区域与非人脸区域的连贯性,再利用生成对抗神经网络优化形变图像的人脸区域纹理,最后利用生成对抗神经网络填充口腔区域,如此得到保留目标人物特征,符合目标特征点位置,保证人脸区域与非人脸区域连贯性,且符合真实人脸图像分布的最终结果。该方法达到最先进的肖像动画生成技术的水平,且可实现实时处理,具有很高的实用价值。
本发明的目的是通过以下技术方案来实现的,一种从单幅图像生成人脸动画的方法,包括以下步骤:
(1)图像中人脸特征点生成:计算图像中人脸与背景区域的特征点;
(2)图像全局二维形变:基于步骤1得到的初始特征点,以及用户或程序指定的特征点变化,通过全局二维形变,生成符合特征点约束的形变图像;
(3)人脸区域细节的优化:通过生成对抗神经网络来优化形变图像中人脸区域的纹理,所述人脸区域不包含口腔区域;
(4)口腔区域纹理的生成:通过生成对抗神经网络来合成口腔区域纹理,并生成最后的人脸动画图像。
进一步地,所述步骤1包括如下子步骤:
(1.1)人脸区域特征点的生成:检测目标人物初始图像脸部二维特征点、人物身份系数、表情系数及刚体变换系数,通过传递驱动人物的表情系数及刚体变换系数,生成对应的三维混 合形变模型,将其投影到二维平面,得到偏移后的人脸特征点。
(1.2)背景区域特征点的生成:检测并追踪驱动视频中的非人脸区域特征点,并通过下式将其转化到目标图像中:
Figure PCTCN2018108523-appb-000001
其中,s表示驱动人物,t表示目标人物,
Figure PCTCN2018108523-appb-000002
是目标人物偏移后的非人脸区域特征点,
Figure PCTCN2018108523-appb-000003
是驱动人物当前第i帧对应的特征点,
Figure PCTCN2018108523-appb-000004
是目标人物初始人脸特征点与驱动人物初始人脸特征点之间的刚体变换矩阵。通过上式,可以得到目标图像的非人脸区域特征点。
进一步地,所述步骤2具体为:根据偏移后的目标人物特征点与初始特征点,计算得到每个特征点的偏移值。以人脸区域特征点和背景区域特征点作为顶点,进行三角化,并对每个三角形中顶点偏移值插值得到偏移图。另外为了消除非人脸区域偏移值不连续问题,通过高斯核对偏移图中非人脸区域进行滤波,高斯核半径随着距离人脸区域距离增大而增大,其范围在[7,32]。最后通过上述偏移图,将原始图像中相应位置的像素转移到当前图像位置,如此得到形变图像。
进一步地,所述步骤3包括如下子步骤:
(3.1)生成并训练学习基于形变引导的生成对抗神经网络(简称为wg-GAN)的生成器和判别器;
(3.2)对形变图像及初始图像人脸区域进行剪裁对齐,根据两者的特征点偏移(经归一化)生成偏移图,以形变图像人脸区域及偏移图作为wg-GAN中优化器输入,得到经优化的不含口腔区域的人脸图像。
进一步地,所述步骤4包括如下子步骤:
(4.1)生成并训练学习适用于口腔内部纹理合成的生成对抗神经网络(简称为hrh-GAN)的生成器和判别器。
(4.2)根据特征点生成步骤3.2得到人脸图像对应的口腔区域掩码图,并将人脸图像与掩码图作为hrh-GAN中生成器输入,得到填充口腔纹理后的完整人脸图像。
本发明的有益效果是,本发明首次提出结合全局形变与生成对抗神经网络的从单幅图像生成人脸动画的方法,借助全局形变实现刚体与非刚体变化带来的几何特征变化且保证人脸与非人脸区域边界的连贯性,另外借助两个训练学习得到的生成对抗神经网络来优化人脸区域纹理并生成口腔区域纹理,使生成人脸符合真实人脸图像分布。本方法达到当前最先进的人脸图像动画生成技术的水平,且可实时处理。本发明可以用于人脸图像编辑,基于单幅图像的肖像动画生成,以及视频中人脸表情的编辑等应用。
附图说明
图1是应用本发明的方法对第一个目标人物肖像图片进行编辑的各阶段生成结果图。
图2是应用本发明的方法对第二个目标人物肖像图片进行编辑的各阶段生成结果图。
图3是应用本发明的方法对第三个目标人物肖像图片进行编辑的各阶段生成结果图。
图4是应用本发明的方法对第四个目标人物肖像图片进行编辑的各阶段生成结果图。
图5是应用本发明的方法对第五个目标人物肖像图片进行编辑的各阶段生成结果图。
具体实施方式
本发明的核心技术利用全局形变技术处理刚性及非刚性变化带来的几何特征变化,并利用wg-GAN优化人脸区域细节(不含口腔),用hrh-GAN填补口腔区域细节。该方法主要分为如下四个主要步骤:人像特征点生成、根据特征点变化进行图像全局二维形变、人脸区域细节的优化(不含口腔区域)、口腔区域纹理的生成。
下面结合附图1-5详细说明本发明的各个步骤。图1-5是应用本发明的方法对五个目标人物肖像图片进行编辑的各阶段生成结果。输入图像经过第一个箭头后得到全局形变结果,再经过第二个箭头得到优化人脸细节(不含口腔)后的结果图,经过最后一个箭头得到填补口前区域后最终生成结果。
1.人像特征点生成:使用特征点检测算法,得到图像中人脸及背景区域特征点。
1.1人脸区域特征点的生成
本发明参考算法(Chen Cao,Qiming Hou,and Kun Zhou.2014a.Displaced dynamic expression regression for real-time facial tracking and animation.ACM Transactions on graphics(TOG)33,4(2014),43.)来检测目标人物初始图像脸部二维特征点、人物身份系数、表情系数及刚体变换系数。通过传递驱动人物的表情系数及刚体变换系数,我们可以生成对应的三维混合形变模型,将其投影到二维平面,我们就可以得到偏移后的人脸特征点。
1.2背景区域特征点的生成
本发明中非人脸区域特征点生成方法参考算法(Hadar Averbuch-Elor,Daniel Cohen-Or,Johannes Kopf,and Michael F Cohen.2017.Bringing portraits to life.ACM Transactions on Graphics(TOG)36,6(2017),196.)。因为驱动人物图像与目标人物图像在非背景区域并没有鲁邦的对应关系,因此该方法检测并追踪驱动视频中的非人脸区域特征点,并通过下式将其转化到目标图像中:
Figure PCTCN2018108523-appb-000005
其中,s表示驱动人物,t表示目标人物,
Figure PCTCN2018108523-appb-000006
是目标人物偏移后的非人脸区域特征点,
Figure PCTCN2018108523-appb-000007
是驱动人物当前第i帧对应的特征点,
Figure PCTCN2018108523-appb-000008
是目标人物初始人脸特征点与驱动人物初始人脸特征点之间的刚体变换矩阵。通过上式,我们可以得到目标图像的非人脸区域特征点。
2.图像全局二维形变:基于初始特征点,根据用户或程序指定的特征点变化,通过全局二维形变,生成符合特征点约束的形变图像。
2.1形变
根据偏移后的目标人物特征点与初始特征点,计算得到每个特征点的偏移值。以特征点(非人脸区域特征点与人脸区域特征点)作为顶点,进行三角化,并对每个三角形中顶点偏移值插值得到偏移图。另外为了消除非人脸区域偏移值不连续问题,通过高斯核对偏移图中非人脸区域进行滤波,高斯核半径随着距离人脸区域距离增大而增大,我们采用了5种半径的高斯核,其范围在[7,32]。最后通过上述偏移图,将原始图像中相应位置的像素转移到当前图像位置,如此得到形变图像,效果可以参考图1至图5中第一个箭头指向的结果。
3.人脸区域细节的优化(不含口腔区域):通过wg-GAN优化形变图像中人脸区域的纹理(不包含口腔区域)。
3.1训练wg-GAN
训练数据。从公共数据集MMI(Maja Pantic,Michel Valstar,Ron Rademaker,and Ludo Maat.2005.Web-based database for facial expression analysis.In Multimedia and Expo,2005.ICME 2005.IEEE International Conference on.IEEE,5–pp.),MUG(Niki Aifanti,Christos Papachristou,and Anastasios Delopoulos.2010.The MUG facial expression database.In Image analysis for multimedia interactive services(WIAMIS),2010 11th international workshop on.IEEE,1–4.),CFD(Debbie S Ma,Joshua Correll,and Bernd Wittenbrink.2015.The Chicago face database:A free stimulus set of faces and norming data.Behavior research methods 47,4(2015),1122–1135.)作为数据来源。以视频为单位,对每段视频以10帧为间隔进行采样得到图像I i,并检测其人脸特征点得到P i。在{I i|0<i<N}中选取中性表情图像I *,并得到其对应特征点P *,N为自然数。用P *及P i计算得到特征点偏移D i,并通过对P i三角化及对D i插值形变I *,得到I i对应的形变图像W i。另外在所有训练数据上统计人脸各部分特征点偏移的标准差,并用上述标准差对D i按部位进行归一化处理,得到归一化的
Figure PCTCN2018108523-appb-000009
并以此生成偏移图M i,最终以(W i,M i,I i)组成一组训练数据。另外我们利用翻转与裁剪操作进行数据增广。
网络结构。生成器(优化器)网络结构是一种编码解码结构。编码过程中为了避免网络压缩过多信息,我们仅将输入图像下采样到原本四分之一大小,即下采样两次,并让经过下 采样的特征图通过4块残差模块(Kaiming He,Xiangyu Zhang,Shaoqing Ren,and Jian Sun.2016.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition.770–778.),最后通过缩放卷积(Jon Gauthier.2014.Conditional generative adversarial nets for convolutional face generation.Class Project for Stanford CS231N:Convolutional Neural Networks for Visual Recognition,Winter semester 2014,5(2014),2.)输出原始尺寸大小图像。另外网络在对应的下采样与上采样中添加跳跃式传递(Phillip Isola,Jun-Yan Zhu,Tinghui Zhou,and Alexei A Efros.2017.Image-to-image translation with conditional adversarial networks.Proceedings of the IEEE conference on computer vision and pattern recognition(2017).)来保证图像结构的正确性,即含有两次跳跃式传递,因此网络结构可以表示为(C64,K7,S1,LReLU,Skip1)->(C128,K5,S2,LReLU,Skip2)->(C256,K3,S2,LReLU)->4*(RB256,K3,S1,LReLU)->(RC128,K3,R2,LReLU,Skip1)->(RC64,K3,R2,LReLU,Skip2)->(C3,K3,S1,Sigmoid),其中C、RB、RC分别表示卷积层、残差模块、缩放卷积层,其后的数字表示该层输出的深度大小;K表示该模块中的核,其后的数字表示核的大小;S后的数字表示卷积层或残差模块中步幅大小,若该层进行下采样,则S2,否则S1;R后的数字表示缩放卷积层缩放比例,即当需要上采样是R2;另外Skip表示跳跃式传递,其后的数字表示编号,编号相同表示属于同一条跳跃式传递;LReLU(Andrew L Maas,Awni Y Hannun,and Andrew Y Ng.2013.Rectifier nonlinearities improve neural network acoustic models.In Proc.icml,Vol.30.3.)及Sigmoid表示使用的激活函数。分辨器网络结构是一种编码结构,其通过卷积层将输入内容编码成特征向量,并利用全连接层输出用来衡量输入内容真实度的值,其结构可以表示为(C64,K7,S1,LReLU)->(C128,K5,S2,LReLU)->(C256,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(FC1),其中FC表示全连接层,其后的数字表示输出为1,全连接层没有任何激活函数。
损失函数。用函数R(x w,M)来表示优化器,其中x w是输入的形变图像,M是偏移图。用D(x,M)来表示分辨器,其中x是优化器生成结果R(x w,M)或真实图像x g。训练网络的损失函数可以用下式定义:
Figure PCTCN2018108523-appb-000010
min R表示对R优化器中参数求导,使得目标式子值最小化;max D表示对D分辨器中参数求导,使得目标式子值最大化;
Figure PCTCN2018108523-appb-000011
表示对每个mini-batch求期望;L(R)为正则项,是R(x w, M)与x g之间的L1损失函数,用以约束优化器优化结果,其具体形式如下:
Figure PCTCN2018108523-appb-000012
α是超参,用来控制L(R)的比重,在本发明中等于0.004。另外式子:
Figure PCTCN2018108523-appb-000013
是对抗损失函数,我们采用WGAN(Martin Arjovsky,Soumith Chintala,and L′eon Bottou.2017.Wasserstein gan.arXiv preprint arXiv:1701.07875(2017).)中使用的损失函数。在训练过程中,为了提高对抗训练效果,在分辨器训练过程中,我们参考工作(Ashish Shrivastava,Tomas Pfister,Oncel Tuzel,Josh Susskind,Wenda Wang,and Russ Webb.2017.Learning from simulated and unsupervised images through adversarial training.In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Vol.3.6.),将当前迭代优化器生成结果结合优化器历史结果作为分辨器输入。
3.2优化人脸区域细节
根据人脸特征点,对形变图像及初始人脸图像进行裁剪,分别得到两者的人脸区域图像,并对两者的人脸区域图像进行对齐,得到I i与I *及其对应人脸特征点P i与P *。用P i与P *做差,得到由I *到I i的特征点偏移D i。在实现本发明过程中我们发现不论是在训练过程中还是在实际运行中,如果直接用原始D i生成偏移图,像眉毛特征点的偏移会被网络忽略,因为相比嘴部特征点,眉毛、鼻子、眼睛等部位的偏移范围要小很多,但是这些部位常常在微小的几何特征变化下就会产生明显的纹理变化,因此不论是在训练中还是实际运行中,我们都需要对D i按部位进行归一化处理,归一化操作如下:在整个训练数据集上按部位计算偏移值的标准差,并利用上述标准差,对D i相应部位进行归一化处理得到
Figure PCTCN2018108523-appb-000014
并通过以特征点为顶点进行三角化操作及插值操作,将
Figure PCTCN2018108523-appb-000015
生成偏移图M i。将I i与M i进行拼接,得到网络输入。输入网络后便可得到经过优化后的人脸图像,效果可以参考图1至图5第二个箭头之后的结果。
4口腔区域纹理的生成:通过hrh-GAN生成口腔区域纹理。
4.1训练hrh-GAN
训练数据。数据来源于MMI、MUG、CFD及互联网数据。通过收集人脸图像,并检测其人脸特征点,通过口腔区域特征点,生成其对应口腔区域掩码图。人脸图像及对应的口腔区域掩码图组成hrh-GAN训练数据。另外,与训练wg-GAN类似,我们也同样使用翻转与裁剪操作进行数据增广。
训练方式。本发明中hrh-GAN网络结构及训练方式基于算法(Satoshi Iizuka,Edgar Simo-Serra,and Hiroshi Ishikawa.2017.Globally and locally consistent image  completion.ACM Transactions on Graphics(TOG)36,4(2017),107.)。在该算法中,全卷积神经网络作为生成器,用于补全图像。另外由全局判别器及局部判别器的组合保证生成图像全局及局部的合理性。本发明采用相似的生成器来生成口腔区域,由全局判别器及局部判别器帮助生成器生成合理的口腔区域细节,我们采用的损失函数与Iizuka的相同。在实验过程中,我们发现在低分辨率上,利用上述方式训练可以得到令人满意的结果,但是在高分辨率数据集上,直接训练得到的牙齿区域细节不自然,因此我们采用逐渐增加分辨率的训练策略训练生成对抗网络(Tero Karras,Timo Aila,Samuli Laine,and Jaakko Lehtinen.2017.Progressive growing of gans for improved quality,stability,and variation.arXiv preprint arXiv:1710.10196(2017).),我们全局分辨器与局部分辨器输入大小比例为8:3。在分辨率为128*128的初始训练阶段,我们使用与Iizuka在128分辨率下相似的网络结构,但是生成器第一层网络由(C64,K5,S1)改为(C64,K1,S1)->(C64,K3,S1),并将最后两层网络由(C32,K3,S1)->(C3,K3,S1)改为(C3,K1,S1);全局分辨器第一层由(C64,K5,S2)改为(C32,K1,S1)->(C64,K5,S2);局部分辨器第一层改法同上,并将最后一层卷积层删除。在第二阶段,我们将生成器第一阶段时第一层卷积层由(C64,K1,S1)改为三层卷积层(C16,K1,S1)->(C32,K5,S1)->(C64,K3,S2);最后输出层(C3,K1,S1)改为(DC32,K4,S2)->(C16,K3,S1)->(C3,K1,S1),其中DC表示反卷积,其后数字表示输出的深度大小;全局分辨器将第一层(C32,K1,S1)改为(C16,K1,S1)->(C32,K5,S2);局部分辨器第一层改法与全局分辨器相同;另外网络中间层结构与第一阶段保持一致,且参数值由第一阶段继承而来。通过如此训练可以得到自然的高分辨率口腔区域纹理,结果可参考图1至图5第三个箭头之后的结果。
4.2生成口腔区域纹理
根据人脸特征点,计算3.2中得到的优化后的人脸图像对应的口腔区域掩码图,并将该人脸图像与口腔区域掩码图进行拼接,作为hrh-GAN生成器的输入,如此得到填补口腔区域后的人脸图像。最后将人脸图像通过平移与旋转,与形变图像中人脸位置对齐,并与形变图像中非人脸区域进行组合,得到最终目标人物肖像图。
实施实例
发明人在一台配备Intel Core i7-4790中央处理器,NVidia GTX1080Ti图形处理器(11GB)的机器上实现了本发明的实施实例。发明人采用所有在具体实施方式中列出的参数值,得到了附图5中所示的所有实验结果。本发明可以有效且自然地将互联网中的人物肖像根据驱动人物生成肖像动画。对于一张640*480图像,整个处理流程大概需要55毫秒:特征点检测与追踪大概需要12毫秒;全局图像形变大概需要12毫秒;人脸区域纹理的优化大概需要11毫秒; 口腔区域细节填补大概需要9毫秒;剩下时间主要用于CPU及GPU之间数据的转移;另外wg-GAN与hrh-GAN分别需要训练12小时与20小时,且两者都仅需训练一次,便可用于任何目标人物图像。

Claims (5)

  1. 一种从单幅图像生成人脸动画的方法,其特征在于,包括以下步骤:
    (1)图像中人脸特征点生成:计算图像中人脸与背景区域的特征点。
    (2)图像全局二维形变:基于步骤1得到的初始特征点,以及用户或程序指定的特征点变化,通过全局二维形变,生成符合特征点约束的形变图像。
    (3)人脸区域细节的优化:通过生成对抗神经网络来优化形变图像中人脸区域的纹理,所述人脸区域不包含口腔区域。
    (4)口腔区域纹理的生成:通过生成对抗神经网络来合成口腔区域纹理,并生成最后的人脸动画图像。
  2. 根据权利要求1所述的从单幅图像生成人脸动画的方法,其特征在于,所述步骤1包括如下子步骤:
    (1.1)人脸区域特征点的生成:检测目标人物初始图像脸部二维特征点、人物身份系数、表情系数及刚体变换系数,通过传递驱动人物的表情系数及刚体变换系数,生成对应的三维混合形变模型,将其投影到二维平面,得到偏移后的人脸特征点。
    (1.2)背景区域特征点的生成:检测并追踪驱动视频中的非人脸区域特征点,并通过下式将其转化到目标图像中:
    Figure PCTCN2018108523-appb-100001
    其中,s表示驱动人物,t表示目标人物,
    Figure PCTCN2018108523-appb-100002
    是目标人物偏移后的非人脸区域特征点,
    Figure PCTCN2018108523-appb-100003
    是驱动人物当前第i帧对应的特征点,
    Figure PCTCN2018108523-appb-100004
    是目标人物初始人脸特征点与驱动人物初始人脸特征点之间的刚体变换矩阵。通过上式,可以得到目标图像的非人脸区域特征点。
  3. 根据权利要求2所述的从单幅图像生成人脸动画的方法,其特征在于,所述步骤2具体为:根据偏移后的目标人物特征点与初始特征点,计算得到每个特征点的偏移值。以人脸区域特征点和背景区域特征点作为顶点,进行三角化,并对每个三角形中顶点偏移值插值得到偏移图。另外为了消除非人脸区域偏移值不连续问题,通过高斯核对偏移图中非人脸区域进行滤波,高斯核半径随着距离人脸区域距离增大而增大,其范围在[7,32]。最后通过上述偏移图,将原始图像中相应位置的像素转移到当前图像位置,如此得到形变图像。
  4. 根据权利要求3所述的从单幅图像生成人脸动画的方法,其特征在于,所述步骤3包括如下子步骤:
    (3.1)训练基于形变引导的生成对抗神经网络(wg-GAN),具体如下:
    (3.1.1)训练数据:以视频为单位,对每段视频以10帧为间隔进行采样得到图像I i,并检测其人脸特征点P i。在{I i|0<i<N}中选取中性表情图像I *,并得到其对应特征点P *。用P *及P i计算得到特征点偏移D i,并通过对P i三角化及对D i插值形变I *,得到I i对应的形变图像W i。另外在所有训练数据上统计人脸各部分特征点偏移的标准差,并用上述标准差对D i按部位进行归一化处理,得到归一化的
    Figure PCTCN2018108523-appb-100005
    并以此生成偏移图M i,最终以(W i,M i,I i)组成一组训练数据。并利用翻转与裁剪操作进行数据增广。
    (3.1.2)网络结构:对抗神经网络的网络结构是一种编码解码结构。将输入图像下采样两次,并让经过下采样的特征图通过4块残差模块,最后通过缩放卷积输出原始尺寸大小图像。另外网络在对应的下采样与上采样中添加跳跃式传递来保证图像结构的正确性,即含有两次跳跃式传递,因此网络结构可以表示为(C64,K7,S1,LReLU,Skip1)->(C128,K5,S2,LReLU,Skip2)->(C256,K3,S2,LReLU)->4*(RB256,K3,S1,LReLU)->(RC128,K3,R2,LReLU,Skip1)->(RC64,K3,R2,LReLU,Skip2)->(C3,K3,S1,Sigmoid),其中C、RB、RC分别表示卷积层、残差模块、缩放卷积层,其后的数字表示该层输出的深度大小;K表示该模块中的核,其后的数字表示核的大小;S后的数字表示卷积层或残差模块中步幅大小,若该层进行下采样,则S2,否则S1;R后的数字表示缩放卷积层缩放比例,即当需要上采样是R2;另外Skip表示跳跃式传递,其后的数字表示编号,编号相同表示属于同一条跳跃式传递;分辨器网络结构是一种编码结构,其通过卷积层将输入内容编码成特征向量,并利用全连接层输出用来衡量输入内容真实度的值,其结构可以表示为(C64,K7,S1,LReLU)->(C128,K5,S2,LReLU)->(C256,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(C512,K3,S2,LReLU)->(FC1),其中FC表示全连接层,其后的数字表示输出为1,全连接层没有任何激活函数。
    (3.1.3)损失函数:用函数R(x w,M)来表示优化器,其中x w是输入的形变图像,M是偏移图。用D(x,M)来表示分辨器,其中x是优化器生成结果R(x w,M)或真实图像x g。训练网络的损失函数可以用下式定义:
    Figure PCTCN2018108523-appb-100006
    其中,min R表示对R优化器中参数求导,使得目标式子值最小化;max D表示对D分辨器中参数求导,使得目标式子值最大化;
    Figure PCTCN2018108523-appb-100007
    表示对每个mini-batch求期望;L(R)为正则项,是R(x w,M)与x g之间的L1损失函数,用以约束优化器优化结果,其具体形式如下:
    Figure PCTCN2018108523-appb-100008
    其中,α是超参,用来控制L(R)的比重,
    另外,公式
    Figure PCTCN2018108523-appb-100009
    是对抗损失函数,在训练过程中,为了提高对抗训练效果,在分辨器训练过程中,将当前迭代优化器生成结果结合优化器历史结果作为分辨器输入。
    (3.2)优化人脸区域细节:根据人脸特征点,对形变图像及初始人脸图像进行裁剪,分别得到两者的人脸区域图像,并对两者的人脸区域图像进行对齐,得到I i与I *及其对应人脸特征点P i与P *。用P i与P *做差,得到由I *到I i的特征点偏移D i。对特征点偏移D i按部位进行归一化处理,归一化操作如下:在整个训练数据集上按部位计算偏移值的标准差,并利用上述标准差,对D i相应部位进行归一化处理得到
    Figure PCTCN2018108523-appb-100010
    并通过以特征点为顶点进行三角化操作及插值操作,将
    Figure PCTCN2018108523-appb-100011
    生成偏移图M i。将I i与M i进行拼接,得到网络输入。输入网络后便可得到经过优化后的人脸图像。
  5. 根据权利要求4所述的从单幅图像生成人脸动画的方法,其特征在于,所述步骤4包括如下子步骤:
    (4.1)训练适用于口腔内部纹理合成的生成对抗神经网络(hrh-GAN),具体如下:
    (4.1.1)训练数据:通过收集人脸图像,并检测其人脸特征点,通过口腔区域特征点,生成其对应口腔区域掩码图。人脸图像及对应的口腔区域掩码图组成hrh-GAN训练数据。使用翻转与裁剪操作进行数据增广。
    (4.1.2)训练方式:以全卷积神经网络作为生成器来生成口腔区域,由全局分辨器及局部分辨器帮助生成器生成合理的口腔区域细节,全局分辨器与局部分辨器输入大小比例为8:3。
    (4.2)生成口腔区域纹理:根据人脸特征点,计算3.2中得到的优化后的人脸图像对应的口腔区域掩码图,并将该人脸图像与口腔区域掩码图进行拼接,作为hrh-GAN生成器的输入,如此得到填补口腔区域后的人脸图像。最后将人脸图像通过平移与旋转,与形变图像中人脸位置对齐,并与形变图像中非人脸区域进行组合,得到最终目标人物肖像图。
PCT/CN2018/108523 2018-09-29 2018-09-29 一种从单幅图像生成人脸动画的方法 WO2020062120A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/108523 WO2020062120A1 (zh) 2018-09-29 2018-09-29 一种从单幅图像生成人脸动画的方法
EP18935888.0A EP3859681A4 (en) 2018-09-29 2018-09-29 METHOD FOR GENERATING FACIAL ANIMATION FROM AN INDIVIDUAL IMAGE
US17/214,931 US11544887B2 (en) 2018-09-29 2021-03-29 Method for generating facial animation from single image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/108523 WO2020062120A1 (zh) 2018-09-29 2018-09-29 一种从单幅图像生成人脸动画的方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/214,931 Continuation US11544887B2 (en) 2018-09-29 2021-03-29 Method for generating facial animation from single image

Publications (1)

Publication Number Publication Date
WO2020062120A1 true WO2020062120A1 (zh) 2020-04-02

Family

ID=69949286

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108523 WO2020062120A1 (zh) 2018-09-29 2018-09-29 一种从单幅图像生成人脸动画的方法

Country Status (3)

Country Link
US (1) US11544887B2 (zh)
EP (1) EP3859681A4 (zh)
WO (1) WO2020062120A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524216A (zh) * 2020-04-10 2020-08-11 北京百度网讯科技有限公司 生成三维人脸数据的方法和装置
CN115272687A (zh) * 2022-07-11 2022-11-01 哈尔滨工业大学 单样本自适应域生成器迁移方法
CN117152311A (zh) * 2023-08-02 2023-12-01 山东财经大学 基于双分支网络的三维表情动画编辑方法及系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11263525B2 (en) 2017-10-26 2022-03-01 Nvidia Corporation Progressive modification of neural networks
US11250329B2 (en) * 2017-10-26 2022-02-15 Nvidia Corporation Progressive modification of generative adversarial neural networks
WO2020180134A1 (ko) * 2019-03-06 2020-09-10 한국전자통신연구원 이미지 수정 시스템 및 이의 이미지 수정 방법
US20230035306A1 (en) * 2021-07-21 2023-02-02 Nvidia Corporation Synthesizing video from audio using one or more neural networks
CN113674385B (zh) * 2021-08-05 2023-07-18 北京奇艺世纪科技有限公司 一种虚拟表情生成方法、装置、电子设备及存储介质
US11900519B2 (en) * 2021-11-17 2024-02-13 Adobe Inc. Disentangling latent representations for image reenactment
CN115345970B (zh) * 2022-08-15 2023-04-07 哈尔滨工业大学(深圳) 基于生成对抗网络的多模态输入视频条件生成方法
CN116386122B (zh) * 2023-06-02 2023-08-29 中国科学技术大学 高保真换脸方法、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404091A (zh) * 2008-11-07 2009-04-08 重庆邮电大学 基于两步形状建模的三维人脸重建方法和系统
CN107657664A (zh) * 2017-08-17 2018-02-02 上海交通大学 人脸表情合成后的图像优化方法、装置、存储介质和计算机设备
US20180075581A1 (en) * 2016-09-15 2018-03-15 Twitter, Inc. Super resolution using a generative adversarial network
CN107895358A (zh) * 2017-12-25 2018-04-10 科大讯飞股份有限公司 人脸图像的增强方法及系统
CN108288072A (zh) * 2018-01-26 2018-07-17 深圳市唯特视科技有限公司 一种基于生成对抗网络的面部表情合成方法
CN108596024A (zh) * 2018-03-13 2018-09-28 杭州电子科技大学 一种基于人脸结构信息的肖像生成方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9552510B2 (en) * 2015-03-18 2017-01-24 Adobe Systems Incorporated Facial expression capture for character animation
US10643383B2 (en) * 2017-11-27 2020-05-05 Fotonation Limited Systems and methods for 3D facial modeling
US10896535B2 (en) * 2018-08-13 2021-01-19 Pinscreen, Inc. Real-time avatars using dynamic textures

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404091A (zh) * 2008-11-07 2009-04-08 重庆邮电大学 基于两步形状建模的三维人脸重建方法和系统
US20180075581A1 (en) * 2016-09-15 2018-03-15 Twitter, Inc. Super resolution using a generative adversarial network
CN107657664A (zh) * 2017-08-17 2018-02-02 上海交通大学 人脸表情合成后的图像优化方法、装置、存储介质和计算机设备
CN107895358A (zh) * 2017-12-25 2018-04-10 科大讯飞股份有限公司 人脸图像的增强方法及系统
CN108288072A (zh) * 2018-01-26 2018-07-17 深圳市唯特视科技有限公司 一种基于生成对抗网络的面部表情合成方法
CN108596024A (zh) * 2018-03-13 2018-09-28 杭州电子科技大学 一种基于人脸结构信息的肖像生成方法

Non-Patent Citations (21)

* Cited by examiner, † Cited by third party
Title
ANDREW L MAASAWNI Y HANNUNANDREW Y NG: "Rectifier nonlinearities improve neural network acoustic models", PROC. ICML, vol. 30. 3, 2013
ASHISH SHRIVASTAVATOMAS PFISTERONCEL TUZELJOSH SUSSKINDWENDA WANGRUSS WEBB: "Learning from simulated and unsupervised images through adversarial training", THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, vol. 3. 6, 2017
DEBBIE S MAJOSHUA CORRELLBERND WITTENBRINK: "The Chicago face database: A free stimulus set of faces and norming data", BEHAVIOR RESEARCH METHODS, vol. 47, no. 4, 2015, pages 1122 - 1135, XP035970400, DOI: 10.3758/s13428-014-0532-5
FENGCHUN QIAONAIMING YAOZIRUI JIAOZHIHAO LIHUI CHENHONGAN WANG: "Geometry-Contrastive Generative Adversarial Network for Facial Expression Synthesis", ARXIV PREPRINT ARXIV: 1802.01822, 2018
HADAR AVERBUCH-ELORDANIEL COHEN-ORJOHANNES KOPFMICHAEL F COHEN: "Bringing portraits to life", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 36, no. 6, 2017, pages 196, XP055693839, DOI: 10.1145/3130800.3130818
JON GAUTHIER: "Conditional generative adversarial nets for convolutional face generation", CLASS PROJECT FOR STANFORD CS231N: CONVOLUTIONAL NEURAL NETWORKS FOR VISUAL RECOGNITION, WINTER SEMESTER 2014, vol. 5, 2014, pages 2
JUSTUS THIESMICHAEL ZOLLHMARC STAMMINGERCHRISTIAN THEOBALTMATTHIAS NIE.NER: "Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on", 2016, IEEE, article "Face2face: Real-time face capture and reenactment of rgb videos", pages: 2387 - 2395
KAIMING HEXIANGYU ZHANGSHAOQING RENJIAN SUN: "Deep residual learning for image recognition", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 770 - 778, XP055536240, DOI: 10.1109/CVPR.2016.90
LINGXIAO SONGZHIHE LURAN HEZHENAN SUNTIENIU TAN: "Geometry Guided Adversarial Facial Expression Synthesis", ARXIV PREPRINT ARXIV: 1712.03474, 2017
MAJA PANTICMICHEL VALSTARRON RADEMAKERLUDO MAAT: "Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on", 2005, IEEE, article "Web-based database for facial expression analysis", pages: 5
MARCEL PIOTRASCHKEVOLKER BLANZ: "Automated 3d face reconstruction from multiple images using quality measures", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 3418 - 3427, XP033021525, DOI: 10.1109/CVPR.2016.372
MARTIN ARJOVSKYSOUMITH CHINTALAL' EON BOTTOU: "Wasserstein gan", ARXIV PREPRINT ARXIV: 1701.07875, 2017
NIKI AIFANTICHRISTOS PAPACHRISTOUANASTASIOS DELOPOULOS: "2010 llth international workshop on", 2010, IEEE, article "The MUG facial expression database. In Image analysis for multimedia interactive services (WIAMIS", pages: 1 - 4
PABLO GARRIDOLEVI VALGAERTSOLE REHMSENTHORSTEN THORMAHLENPATRICK PEREZCHRISTIAN THEOBALT: "Automatic face reenactment", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2014, pages 4217 - 4224, XP032649461, DOI: 10.1109/CVPR.2014.537
PHILLIP ISOLAJUN-YAN ZHUTINGHUI ZHOUALEXEI A EFROS: "Image-to-image translation with conditional adversarial networks", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2017
PIA BREUERKWANG-IN KIMWOLF KIENZLEBERNHARD SCHOLKOPFVOLKER BLANZ: "Automatic Face & Gesture Recognition, 2008. FG' 08. 8th IEEE International Conference on", 2008, IEEE, article "Automatic 3D face reconstruction from single images or video", pages: 1 - 8
SATOSHI IIZUKAEDGAR SIMO-SERRAHIROSHI ISHIKAWA: "Globally and locally consistent image completion", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 36, no. 4, 2017, pages 107, XP058372881, DOI: 10.1145/3072959.3073659
See also references of EP3859681A4
TERO KARRASTIMO AILASAMULI LAINEJAAKKO LEHTINEN: "Progressive growing of gans for improved quality, stability, and variation", ARXIV PREPRINT ARXIV: 1710.10196, 2017
UMAR MOHAMMEDSIMON JD PRINCEJAN KAUTZ: "Visiolization: generating novel facial images", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 28, no. 3, 2009, pages 57
VOLKER BLANZTHOMAS VETTER: "A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques", 1999, ACM PRESS/ADDISON-WESLEY PUBLISHING CO., pages: 187 - 194

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524216A (zh) * 2020-04-10 2020-08-11 北京百度网讯科技有限公司 生成三维人脸数据的方法和装置
CN115272687A (zh) * 2022-07-11 2022-11-01 哈尔滨工业大学 单样本自适应域生成器迁移方法
CN117152311A (zh) * 2023-08-02 2023-12-01 山东财经大学 基于双分支网络的三维表情动画编辑方法及系统

Also Published As

Publication number Publication date
US11544887B2 (en) 2023-01-03
EP3859681A1 (en) 2021-08-04
US20210217219A1 (en) 2021-07-15
EP3859681A4 (en) 2021-12-15

Similar Documents

Publication Publication Date Title
WO2020062120A1 (zh) 一种从单幅图像生成人脸动画的方法
Geng et al. Warp-guided gans for single-photo facial animation
CN109448083B (zh) 一种从单幅图像生成人脸动画的方法
Kartynnik et al. Real-time facial surface geometry from monocular video on mobile GPUs
Nagano et al. paGAN: real-time avatars using dynamic textures.
US10019826B2 (en) Real-time high-quality facial performance capture
US9792725B2 (en) Method for image and video virtual hairstyle modeling
US10565758B2 (en) Neural face editing with intrinsic image disentangling
US9245176B2 (en) Content retargeting using facial layers
US9691165B2 (en) Detailed spatio-temporal reconstruction of eyelids
US20170069124A1 (en) Avatar generation and animations
US11308657B1 (en) Methods and systems for image processing using a learning engine
Sharma et al. 3d face reconstruction in deep learning era: A survey
US20220222897A1 (en) Portrait editing and synthesis
Ye et al. Audio-driven talking face video generation with dynamic convolution kernels
US10062216B2 (en) Applying facial masks to faces in live video
Gong et al. Autotoon: Automatic geometric warping for face cartoon generation
CN115004236A (zh) 来自音频的照片级逼真说话面部
CN115170559A (zh) 基于多层级哈希编码的个性化人头神经辐射场基底表示与重建方法
Zhao et al. Sparse to dense motion transfer for face image animation
Elgharib et al. Egocentric videoconferencing
US20220237879A1 (en) Direct clothing modeling for a drivable full-body avatar
Geng et al. Towards photo-realistic facial expression manipulation
Paier et al. Example-based facial animation of virtual reality avatars using auto-regressive neural networks
Ekmen et al. From 2D to 3D real-time expression transfer for facial animation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2018935888

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2018935888

Country of ref document: EP

Effective date: 20210429