CN110599573A

CN110599573A - Method for realizing real-time human face interactive animation based on monocular camera

Info

Publication number: CN110599573A
Application number: CN201910839412.7A
Authority: CN
Inventors: 谢宁; 杨心如; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2019-12-20
Anticipated expiration: 2039-09-03
Also published as: CN110599573B

Abstract

The invention relates to a three-dimensional character animation technology, and discloses a method for realizing real-time human face interactive animation based on a monocular camera. The method can be summarized as follows: capturing a face video image and voice input information, and extracting a face expression animation parameter and a voice emotion animation parameter; learning a training sequence consisting of skeleton motion and corresponding skin deformation through an action state space model, establishing a virtual character skeleton skin model based on an auxiliary bone controller, driving the virtual character skeleton skin model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating real-time interactive animation.

Description

A realization method of face real-time interaction animation based on monocular camera

技术领域technical field

本发明涉及三维角色动画技术，具体涉及一种基于单目相机的人脸实时交互动画的实现方法。The invention relates to a three-dimensional character animation technology, in particular to a method for realizing real-time interactive animation of human faces based on a monocular camera.

背景技术Background technique

近年来，随着计算机软硬件设备持续不断的发展(如：苹果公司最新发布的增强现实应用开发工具包ARKit，谷歌公司推出的ARCore 1.0及系列支持工具等)，将多媒体技术带入了一个全盛的发展时期,同时由于人们对人机交互界面的可视化要求越来越高,使得人脸建模与动画技术在人机交互中发挥着越来越重要的作用。三维人脸表情动画技术的应用领域非常广泛,如游戏娱乐、电影制作、人机交互、广告制作等等,具有重要的应用价值和理论意义。In recent years, with the continuous development of computer software and hardware equipment (such as: Apple's newly released augmented reality application development toolkit ARKit, Google's ARCore 1.0 and a series of supporting tools, etc.), multimedia technology has been brought into a heyday. At the same time, because people have higher and higher requirements for the visualization of human-computer interaction interface, face modeling and animation technology play an increasingly important role in human-computer interaction. The application fields of 3D facial expression animation technology are very extensive, such as game entertainment, film production, human-computer interaction, advertisement production, etc., which has important application value and theoretical significance.

自从Parke^[1]等人在1972年使用计算机生成人脸动画的开创性工作以来，世界上越来越多的研究者发现了三维人脸建模及其动画技术的研究使用价值，并作出了很多重要的贡献。如图1所示，这些工作主要包括如何使用有效的模型来表示人脸形状变化，如何准确、快速的捕获面部表情，如何实时精细的搭建三维人脸重建模型以及如何构造脸部数字化替身，并驱动它生成具有真实感的人脸模型。Since the pioneering work of Parke ^[1] and others using computer-generated face animation in 1972, more and more researchers in the world have discovered the research value of 3D face modeling and its animation technology, and made many important contribution. As shown in Figure 1, these tasks mainly include how to use an effective model to represent the shape change of the face, how to accurately and quickly capture facial expressions, how to build a real-time and fine 3D face reconstruction model, and how to construct a digital face substitute, and Drive it to generate realistic face models.

Cao^[2]等人在2013年提出了一个基于三维形状回归的实时人脸跟踪与动画方法。该方法使用单目视频相机作为人脸图像的采集设备，主要分为预处理和实时运行两个步骤进行。在预处理阶段，使用单目相机采集用户特定的姿势表情，包括一系列的面部表情和头部旋转动作，之后使用人脸特征点标记算法对用户人脸图像进行半自动特征点标记。基于标定特征点后的人脸图像，Cao^[3]等人在2014年构建了一个面向视觉计算应用的三维人脸表情库FaceWarehouse。在该数据库中，提出了包含个体和表情两个属性的双线性人脸模型(Bilinear Face Model)，用于拟合生成用户特定的表情融合模型。通过该表情融合模型计算相机采集的该用户的每张图像中由特征点的三维位置组成的三维人脸形状向量。该方法的回归算法采用了形状相关特征的双层级联回归(Two-level Boosted Regression)算法,所有图像和它们相对应的三维人脸形状向量被作为输入训练出该用户特定的三维形状回归器。在实时运行阶段，该用户特定的三维形状回归器基于当前帧和上一帧得到的人脸运动参数回归得到三维形状参数和人脸运动参数，包括头部的刚性变换参数和面部表情的非刚性面部运动参数，然后将这些参数迁移映射到虚拟角色上，驱动其生成和人脸运动相对应的表情动画。Cao ^[2] et al. proposed a real-time face tracking and animation method based on 3D shape regression in 2013. This method uses a monocular video camera as a face image acquisition device, and is mainly divided into two steps: preprocessing and real-time operation. In the preprocessing stage, a monocular camera is used to collect user-specific gestures and expressions, including a series of facial expressions and head rotations, and then the facial feature point marking algorithm is used to semi-automatically mark the user's face image. Based on the face image after calibrated feature points, Cao ^[3] et al. built a 3D facial expression library FaceWarehouse for visual computing applications in 2014. In this database, a bilinear face model (Bilinear Face Model) containing two attributes of individual and expression is proposed, which is used to fit and generate a user-specific expression fusion model. A three-dimensional face shape vector composed of three-dimensional positions of feature points in each image of the user collected by the camera is calculated through the expression fusion model. The regression algorithm of this method uses a two-level cascaded regression (Two-level Boosted Regression) algorithm of shape-related features, and all images and their corresponding 3D face shape vectors are used as input to train the user-specific 3D shape regressor. . In the real-time running stage, the user-specific 3D shape regressor regresses the 3D shape parameters and face motion parameters based on the face motion parameters obtained in the current frame and the previous frame, including the rigid transformation parameters of the head and the non-rigidity of facial expressions. Facial motion parameters, and then transfer and map these parameters to the virtual character to drive it to generate expression animation corresponding to the facial motion.

但上述方法具有一定的局限性，对于每一个新用户均需要生成用户特定的表情融合模型和三维人脸形状回归器，需要约45分钟的预处理过程。Cao^[4]等人在2014年又提出了一种基于偏移动态表情回归的实时人脸跟踪算法，该算法同是基于双层级联回归的算法，但对于新用户不需要进行任何的预处理操作，实现了任意用户的实时面部表情跟踪捕捉算法。However, the above method has certain limitations. For each new user, a user-specific expression fusion model and a 3D face shape regressor need to be generated, which requires about 45 minutes of preprocessing. In 2014, Cao ^[4] et al. proposed a real-time face tracking algorithm based on offset dynamic expression regression. The processing operation realizes the real-time facial expression tracking and capturing algorithm of any user.

Cao等人在2013年提出的基于三维形状回归的实时人脸跟踪与动画方法以及Cao^[4]等人在2014年又提出的基于偏移动态表情回归的实时人脸跟踪算法的工作重心都在于如何准确、高效、鲁棒的跟踪视频中人脸大幅度的运动，如皱眉、大笑、张嘴等大幅度表情，以及添加头部旋转、平移等刚性运动。但二者都忽略了人脸上的细节信息，如抬眉时人脸中的抬头纹，运动时引起的脸部皮肤的二次运动等等，而这些细节恰恰是帮助人们理解表情，让人脸更加富有表现力的重要特征。The real-time face tracking and animation method based on 3D shape regression proposed by Cao et al. in 2013 and the real-time face tracking algorithm based on offset dynamic expression regression proposed by Cao et al ^. in 2014 are all focused on How to accurately, efficiently, and robustly track large-scale movements of human faces in videos, such as large-scale expressions such as frowning, laughing, and mouth opening, as well as adding rigid movements such as head rotation and translation. But both ignore the detailed information on the human face, such as the forehead lines on the face when the eyebrows are raised, the secondary movement of the facial skin caused by exercise, etc., and these details are precisely to help people understand expressions and make people feel better. Important features for a more expressive face.

参考文献：references:

[1]Parke F I.Computer generated animation of faces[C]//ACMConference.ACM,1972:451-457.[1]Parke F I.Computer generated animation of faces[C]//ACMConference.ACM,1972:451-457.

[2]Cao C,Weng Y,Lin S,et al.3D shape regression for real-time facialanimation[J].ACM Transactions on Graphics,2013,32(4):1.[2] Cao C, Weng Y, Lin S, et al. 3D shape regression for real-time facial animation [J]. ACM Transactions on Graphics, 2013, 32(4): 1.

[3]Cao C,Weng Y,Zhou S,et al.FaceWarehouse:A 3D Facial ExpressionDatabase for Visual Computing[J].IEEE Transactions on Visualization&ComputerGraphics,2014,20(3):413-425.[3] Cao C, Weng Y, Zhou S, et al. FaceWarehouse: A 3D Facial Expression Database for Visual Computing [J]. IEEE Transactions on Visualization & Computer Graphics, 2014, 20(3): 413-425.

[4]Cao C,Hou Q,Zhou K.Displaced dynamic expression regression forreal-time facial tracking and animation[J].Acm Transactions on Graphics,2014,33(4):1-10.[4]Cao C,Hou Q,Zhou K.Displaced dynamic expression regression for real-time facial tracking and animation[J].Acm Transactions on Graphics,2014,33(4):1-10.

[5]Ekman P,Friesen W V.Facial Action Coding System:Manual[J].Agriculture,1978.[5] Ekman P, Friesen W V. Facial Action Coding System: Manual [J]. Agriculture, 1978.

[6]Duffy N,Helmbold D.Boosting Methods for Regression[J].MachineLearning,2002,47(2-3):153-200.[6] Duffy N, Helmbold D. Boosting Methods for Regression [J]. Machine Learning, 2002, 47(2-3): 153-200.

发明内容Contents of the invention

本发明所要解决的技术问题是：提出一种基于单目相机的人脸实时交互动画的实现方法，通过融合人脸表情捕捉和语音情感识别技术，生成动画参数，并通过基于骨骼的技术来实时合成可视化的动态皮肤变形动画，使得生成的实时动画的表情更加具有丰富性、自然性，真实感，更具备自身特色。The technical problem to be solved by the present invention is: to propose a method for realizing real-time interactive animation of human face based on a monocular camera, by integrating facial expression capture and voice emotion recognition technology, animation parameters are generated, and real-time animation parameters are generated through bone-based technology Synthetic and visualized dynamic skin deformation animation makes the generated real-time animation expression more rich, natural, realistic and has its own characteristics.

本发明解决上述技术问题采用的技术方案是：The technical solution adopted by the present invention to solve the problems of the technologies described above is:

一种基于单目相机的人脸实时交互动画的实现方法，包括以下步骤：A method for realizing real-time interactive animation of human faces based on a monocular camera, comprising the following steps:

S1、通过单目相机捕获人脸视频图像，获取人脸图像序列；同时通过语音传感器捕获语音输入信息；S1. Capture face video images through a monocular camera to obtain a sequence of face images; at the same time capture voice input information through a voice sensor;

S2、在人脸图像序列中标记人脸特征点，并提取人脸表情动画参数；S2. Mark the feature points of the face in the face image sequence, and extract the facial expression animation parameters;

S3、在捕获的语音输入信息中提取语音特征，并提取语音情感动画参数；S3. Extracting speech features from the captured speech input information, and extracting speech emotion animation parameters;

S4、通过动作状态空间模型学习由骨骼运动和相应的皮肤变形组成的训练序列，建立基于辅助骨控制器的虚拟角色骨骼蒙皮模型，通过提取的人脸表情动画参数和语音情感动画参数驱动所述虚拟角色骨骼蒙皮模型，生成实时交互动画。S4. Learn the training sequence consisting of skeletal motion and corresponding skin deformation through the action state space model, establish a virtual character skeletal skin model based on the auxiliary bone controller, and use the extracted facial expression animation parameters and voice emotion animation parameters to drive the The virtual character bone skin model is used to generate real-time interactive animation.

作为进一步优化，步骤S2中，采用双层级联回归模型进行人脸特征点的标记，利用基于脸部活动编码系统的Candide-3人脸模型作为参数载体，提取人脸表情动画参数。As a further optimization, in step S2, a double-layer cascaded regression model is used to mark facial feature points, and the Candide-3 facial model based on the facial activity coding system is used as a parameter carrier to extract facial expression animation parameters.

作为进一步优化，所述双层级联回归模型采用两层回归结构，第一层采用由T个弱回归器以叠加的方式组合起来的增强回归模型；第二层由针对第一层中的每一个弱回归器采用K个回归模型级联而成的强回归器叠加而成。As a further optimization, the two-layer cascaded regression model adopts a two-layer regression structure, and the first layer adopts an enhanced regression model combined by T weak regressors in a superimposed manner; the second layer is composed of A weak regressor is formed by superimposing a strong regressor formed by cascading K regression models.

作为进一步优化，步骤S3具体包括：As a further optimization, step S3 specifically includes:

S31、在语音输入信息中对语音情感信息特征进行分析与提取；S31. Analyzing and extracting the features of the voice emotion information in the voice input information;

S32、将提取的语音情感特征进行情感识别，完成情感的判断；S32. Perform emotion recognition on the extracted speech emotion features to complete emotion judgment;

S33、将语音情感结果对应于基于AU单元的脸部活动编码系统，提取相对于的AU参数，获得语音情感动画参数。S33. Corresponding the voice emotion result to the facial activity coding system based on the AU unit, extracting the corresponding AU parameters, and obtaining voice emotion animation parameters.

作为进一步优化，步骤S4中，所述动作状态空间模型由三个关键元素组成：(S,A,{P})As a further optimization, in step S4, the action state space model is composed of three key elements: (S, A, {P})

S表示虚拟角色每一帧的面部表情状态集合；S represents the facial expression state set of each frame of the virtual character;

A表示一组动作集合，通过人脸表情识别和语音情感识别获取的参数作为一组动作向量，驱动下一帧虚拟角色变化状态；A represents a set of actions, and the parameters obtained through facial expression recognition and voice emotion recognition are used as a set of action vectors to drive the virtual character to change state in the next frame;

P为状态转移概率，表示虚拟角色在当前帧t的表情状态s_t∈S，通过执行动作a_t∈A后转移到其他状态的概率分布。P is the state transition probability, which represents the probability distribution of the virtual character's expression state s _t ∈ S in the current frame t, after performing an action a _t ∈ A to other states.

作为进一步优化，步骤S4中，所述建立基于辅助骨控制器的虚拟角色骨骼蒙皮模型的方法包括：As a further optimization, in step S4, the method for establishing a virtual character skeleton skin model based on the auxiliary bone controller includes:

a.以已经制作好的没有辅助骨骼的虚拟角色的骨骼蒙皮模型作为原模型；a. Use the bone skin model of the virtual character that has already been made without auxiliary bones as the original model;

b.对骨骼蒙皮模型进行蒙皮权重优化；b. Optimize the skin weight of the bone skin model;

c.将辅助骨逐渐的插入到原模型与目标模型面部产生最大近似误差的区域；c. Gradually insert the auxiliary bone into the area where the maximum approximation error occurs between the original model and the face of the target model;

d.采用块坐标下降算法解决皮肤权重优化和辅助骨位置转换优化两个子问题；d. Using the block coordinate descent algorithm to solve the two sub-problems of skin weight optimization and auxiliary bone position transformation optimization;

e.构建辅助骨控制器:基于辅助骨控制器的皮肤变换q由静态组件x和动态组件y连接表示，q＝x+y；其中，静态组件X根据原模型中的主要骨架姿势计算；动态组件y是使用动作状态空间模型控制。e. Build an auxiliary bone controller: the skin transformation q based on the auxiliary bone controller is represented by a connection between a static component x and a dynamic component y, q=x+y; wherein, the static component X is calculated according to the main skeleton pose in the original model; dynamic Component y is controlled using an action state space model.

本发明的有益效果是：The beneficial effects of the present invention are:

1.面部表情是人们情感的流露，但在一些特殊情况下，面部表情并不能完全表达角色的内心情感。如果仅通过捕获并追踪人脸表情特征点作为参数进行面部点到点的驱动，显然生成的人脸动画是不够生动的。比如，角色在微笑和苦笑时，两者面部表情相似，但却会发出不同的语气词，因此，语音情感识别技术的加入能更好的从语音的角度捕捉角色当前的情感状态变化。本发明将人脸表情捕捉技术与语音情感识别技术相结合，可以大大提高虚拟角色表情动画的丰富性、自然性，真实感。1. Facial expressions are the expression of people's emotions, but in some special cases, facial expressions cannot fully express the inner emotions of characters. If only capturing and tracking facial expression feature points are used as parameters for face point-to-point driving, obviously the generated face animation is not vivid enough. For example, when a character smiles and wryly smiles, the facial expressions of the two are similar, but they will emit different tone words. Therefore, the addition of voice emotion recognition technology can better capture the current emotional state changes of the character from the perspective of voice. The invention combines the facial expression capture technology with the voice emotion recognition technology, which can greatly improve the richness, naturalness and realism of the virtual character's facial expression animation.

2.由于骨骼和肌肉的运动共同驱动了皮肤表情变化，为了更好的模拟皮肤运动，本发明采用骨骼蒙皮模型，通过基于骨骼的皮肤分解算法自动化添加辅助骨骼，将模拟头部骨骼运动的主要骨与模拟肌肉运动的辅助骨共同驱动虚拟角色进行动画。2. Since the movement of bones and muscles jointly drives the change of skin expression, in order to better simulate the movement of the skin, the present invention adopts the bone skin model, automatically adds auxiliary bones through the bone-based skin decomposition algorithm, and simulates the movement of the head bones The main bone and the auxiliary bone that simulates muscle movement jointly drive the virtual character for animation.

附图说明Description of drawings

图1为三维人脸动画的研究现状；Figure 1 is the research status of 3D facial animation;

图2为本发明的人脸实时交互动画的实现原理图；Fig. 2 is the realization schematic diagram of the face real-time interactive animation of the present invention;

图3为增强回归结构示意图；Figure 3 is a schematic diagram of the enhanced regression structure;

图4为双层级联回归结构示意图；Figure 4 is a schematic diagram of a two-layer cascade regression structure;

图5为ASSM的状态转移过程示意图。Fig. 5 is a schematic diagram of the state transition process of ASSM.

具体实施方式Detailed ways

本发明旨在提出一种基于单目相机的人脸实时交互动画的实现方法，通过融合人脸表情捕捉和语音情感识别技术，生成动画参数，并通过基于骨骼的技术来实时合成可视化的动态皮肤变形动画，使得生成的实时动画的表情更加具有丰富性、自然性，真实感，更具备自身特色。为实现该目的，本发明中的方案主要从以下几个方面实现：The present invention aims to propose a method for realizing real-time interactive animation of human face based on a monocular camera. By integrating facial expression capture and voice emotion recognition technology, animation parameters are generated, and a visualized dynamic skin is synthesized in real time through bone-based technology. The deformed animation makes the expression of the generated real-time animation more rich, natural, realistic and has its own characteristics. In order to achieve this purpose, the scheme among the present invention mainly realizes from the following aspects:

1.人脸运动捕获方面：1. Face motion capture:

人脸运动捕获包括两个部分：脸部表情的非刚性捕获和头部刚性变换捕获。根据人脸表情具有的独特肌肉运动特征，把人脸五官作为统一的整体进行协调，以展现出每一种人脸表情。通过基于人脸运动单元(AU)的脸部活动编码系统(FACS)，把人脸表情对应的编码作为人脸表情语义属性.使用这种具有不变性的中间描述方法作为人脸表情识别可靠的特征表示，来弥补底层特征在人脸表情识别中的不足。Face motion capture consists of two parts: non-rigid capture of facial expressions and rigid transformation capture of the head. According to the unique muscle movement characteristics of facial expressions, the facial features are coordinated as a unified whole to show each facial expression. Through the Facial Activity Coding System (FACS) based on the Facial Motion Unit (AU), the coding corresponding to the facial expression is used as the semantic attribute of the facial expression. Using this invariant intermediate description method is a reliable method for facial expression recognition Feature representation to make up for the lack of low-level features in facial expression recognition.

2.语音情感识别方面：通过表演者的语音输入捕捉人的当前情感状态，通过语音特征提取、维数约简、分类等步骤，生成与表演者当前情感状态相对应的语音情感动画参数。2. Voice emotion recognition: capture the current emotional state of the performer through the voice input of the performer, and generate voice emotion animation parameters corresponding to the current emotional state of the performer through steps such as voice feature extraction, dimension reduction, and classification.

3.目标数字化替身表达方面：使用基于骨骼的动态替身表达方法，该方法通过学习由骨骼运动和相应的皮肤变形序列组成的训练序列，得到包括软组织在内的非线性复杂变形的最优传递。通过人脸运动捕获中提取的用户表情语义，驱动角色头部骨骼运动对辅助骨进行程序化的控制，以模拟面部皮肤的动态变形。3. In terms of target digital stand-in expression: use the bone-based dynamic stand-in expression method, which learns the training sequence composed of bone motion and corresponding skin deformation sequence to obtain the optimal transmission of nonlinear complex deformation including soft tissue. Through the user's expression semantics extracted from facial motion capture, the motion of the character's head bones is driven to programmatically control the auxiliary bones to simulate the dynamic deformation of the facial skin.

在具体实现上，本发明中的基于单目相机的人脸实时交互动画的实现方法的原理如图2所示，其包括以下手段：In terms of specific implementation, the principle of the implementation method of the real-time interactive animation of the face based on the monocular camera in the present invention is shown in Figure 2, which includes the following means:

(1)通过单目相机捕获人脸视频图像，获取人脸图像序列；同时通过语音传感器捕获语音输入信息；(1) Capture the face video image through a monocular camera to obtain a sequence of face images; at the same time, capture voice input information through a voice sensor;

(2)人脸的捕获与追踪：从捕获的人脸图像中标记人脸特征点，提取人脸表情动画参数；(2) Face capture and tracking: Mark facial feature points from captured face images, and extract facial expression animation parameters;

人脸特征点定位是人脸识别、人脸追踪、人脸动画和三维人脸建模中关键的一个环节。由于人脸多样性、光照等因素，在自然环境下人脸特征点定位依然是一个困难的挑战。人脸特征点的具体定义是：对于一个包含N个人脸特征点的人脸形状S＝[x₁,y₁,...,x_N,y_N]，对于一张输入的人脸图片，人脸特征点定位的目标是估计一个人脸特征点形状S，使得S与人脸特征点真实形状的差值最小，S与之间的最小化对齐差值可以定义为L₂-范式用该式来指导人脸特征点定位器的训练或者用来评估人脸特征点的定位算法的性能。Face feature point location is a key link in face recognition, face tracking, face animation and 3D face modeling. Due to factors such as face diversity and illumination, it is still a difficult challenge to locate facial feature points in natural environments. The specific definition of face feature points is: for a face shape S=[x ₁ ,y ₁ ,...,x _N ,y _N ] containing N face feature points, for an input face picture, The goal of face feature point location is to estimate the shape S of a face feature point, so that S and the real shape of face feature points The difference between S and The minimum alignment difference between can be defined as L ₂ -normal form Use this formula to guide the training of the face feature point locator or to evaluate the performance of the face feature point localization algorithm.

本发明拟采用基于回归模型的算法框架来进行实时、高效地进行人脸检测和跟踪。The present invention intends to use an algorithm framework based on a regression model to perform face detection and tracking in real time and efficiently.

a)增强型回归(Boosted Regression)a) Boosted Regression

使用增强型回归将T个弱回归器(R¹,...R^t,...R^T)以叠加的方式组合起来。对于给定的人脸样本I以及初始化形状S⁰，每一个回归器根据样本特征计算一个形状增量并以级联的方式更新当前形状：Use enhanced regression to combine T weak regressors (R ¹ ,...R ^t ,...R ^T ) in a superimposed manner. For a given face sample I and initial shape S ⁰ , each regressor calculates a shape increment according to the sample features And update the current shape in a cascading fashion:

S^t＝S^t+1+R^t(I,S^t-1),t＝1,...,T (1)S ^t ＝S ^t+1 +R ^t (I,S ^t-1 ),t=1,...,T (1)

R^t(I,S^t-1)表示回归器R^t利用输入样本图像I和上一个形状S^t-1计算得到的形状增量，R^t由输入样本图像I和上一个形状S^t-1决定，使用形状索引特征来学习R^t,如图3所示；R ^t (I,S ^t-1 ) represents the shape increment calculated by the regressor R ^t using the input sample image I and the previous shape S ^t-1 , and R ^t is calculated by the input sample image I and the previous shape S ^t-1 decided to use the shape index feature to learn R ^t , as shown in Figure 3;

给定N个训练样本表示第i个样本的I_i的真实形状，对(R¹,...R^t,...R^T)循环训练，直到训练误差不再增加。每个R^t都是通过最小化对齐误差来计算，即:Given N training samples Represents the true shape of I _i of the i-th sample, and train (R ¹ ,...R ^t ,...R ^T ) in a loop until the training error no longer increases. Each R ^t is calculated by minimizing the alignment error, i.e.:

表示第i个图像的上一个形状估计值，R^t的输出为一个形状增量。 Represents the previous shape estimation value of the i-th image, and the output of R ^t is a shape increment.

b)双层提升回归(Two-level Boosted Regression)b) Two-level Boosted Regression

增强型回归算法是对整个形状进行回归，而输入图像中较大的外貌差异以及粗略的初始化人脸形状，使得单层弱回归器不再适用。单个回归器太弱，训练时收敛慢，测试时结果差。为了在训练的时候收敛更快更稳定，本发明采用了两层级联的结构，如图4所示。The enhanced regression algorithm is to regress the entire shape, and the large appearance difference in the input image and the rough initial face shape make the single-layer weak regressor no longer applicable. A single regressor is too weak, the convergence is slow during training, and the results are poor during testing. In order to converge faster and more stably during training, the present invention adopts a two-layer cascade structure, as shown in FIG. 4 .

第一层采用上述的增强回归模型。对第一层的每个回归器R^t，又使用K个回归模型学习，即R^t＝(r¹,...r^k,...r^K)，在这里称r为初级回归器，通过K个初级回归器级联成一个强回归器。第一层和第二层的差别是，第一层中每个回归器R^t的输入S^t-1都不一样，而第二层中的每个回归器r^k的输入都一样的。如R^t的第二层中所有的回归器输入均是S^t-1。The first layer employs the augmented regression model described above. For each regressor R ^t in the first layer, K regression models are used to learn, that is, R ^t = (r ¹ ,...r ^k ,...r ^K ), here r is called the primary regressor, A strong regressor is formed by cascading K primary regressors. The difference between the first layer and the second layer is that the input S ^t-1 of each regressor R ^t in the first layer is different, while the input of each regressor r ^k in the second layer is the same. For example, all regressor inputs in the second layer of R ^t are S ^t-1 .

在进行人脸表情动画参数生成上，本发明利用Ekman等人^[5]提出的基于AU单元的脸部活动编码系统FACS，该系统一共描述了44个基本运动单元，每个运动单元都由底层某部分或者某肌肉块控制。具体而言，可以利用基于脸部活动编码系统的Candide-3人脸模型作为参数载体，提取人脸表情对应AU参数E。In the generation of facial expression animation parameters, the present invention utilizes the facial activity coding system FACS based on the AU unit proposed by Ekman et al ^. Control of a certain part or mass of muscle. Specifically, the Candide-3 face model based on the facial activity coding system can be used as a parameter carrier to extract the AU parameters E corresponding to facial expressions.

Candide-3人脸模型表示如下：The Candide-3 face model is represented as follows:

式中，表示模型的基本形状，S为静态变形矩阵，A为动态变形矩阵，σ是静态变形参数，α是动态变形参数，R和t分别表示头部刚性旋转矩阵和头部平移矩阵。g为模型顶点坐标的列向量，用来表示各种特定的人脸表情形状。模型g由R,t,α,σ四个参数决定。In the formula, Indicates the basic shape of the model, S is the static deformation matrix, A is the dynamic deformation matrix, σ is the static deformation parameter, α is the dynamic deformation parameter, R and t represent the head rigid rotation matrix and head translation matrix, respectively. g is a column vector of model vertex coordinates, which is used to represent various specific facial expression shapes. Model g is determined by four parameters R, t, α, σ.

(3)在捕获的语音输入信息中提取语音特征，并提取语音情感动画参数；(3) Extract speech features in the captured speech input information, and extract speech emotion animation parameters;

在语音输入信息中对语音情感信息特征进行分析与提取；将提取的语音情感特征进行情感识别，完成情感的判断；将语音情感结果对应于基于AU单元的脸部活动编码系统，提取相对于的AU参数，获得语音情感动画参数V。Analyze and extract the voice emotion information features in the voice input information; perform emotion recognition on the extracted voice emotion features to complete the emotion judgment; correspond the voice emotion results to the facial activity coding system based on the AU unit, and extract the corresponding AU parameter, to obtain the voice emotion animation parameter V.

(4)通过动作状态空间模型学习由骨骼运动和相应的皮肤变形组成的训练序列，建立基于辅助骨控制器的虚拟角色骨骼蒙皮模型，通过提取的人脸表情动画参数和语音情感动画参数驱动所述虚拟角色骨骼蒙皮模型，生成实时交互动画。(4) Learn the training sequence consisting of bone motion and corresponding skin deformation through the action state space model, establish a virtual character bone skin model based on the auxiliary bone controller, and drive it through the extracted facial expression animation parameters and voice emotion animation parameters The skeleton skin model of the virtual character generates real-time interactive animation.

(a)动作状态空间模型(ASSM):(a) Action State Space Model (ASSM):

动作状态空间模型由三个关键元素组成(S,A,{P})，其中：The action state space model consists of three key elements (S,A,{P}), where:

S：表示状态集合，虚拟角色的面部表情状态(如高兴、难过等)；S: Indicates the state set, the facial expression state of the virtual character (such as happy, sad, etc.);

A：表示一组动作集合，通过人脸表情识别和语音情感识别获取的参数作为一组动作向量，驱动下一帧虚拟角色变化状态；A: Represents a set of actions. The parameters obtained through facial expression recognition and voice emotion recognition are used as a set of action vectors to drive the virtual character to change state in the next frame;

P:状态转移概率，表示虚拟角色在当前帧t的表情状态s_t∈S，通过执行动作a_t∈A后转移到其他状态的概率分布。P: state transition probability, which represents the probability distribution of the virtual character's expression state s _t ∈ S in the current frame t, and then transitions to other states after performing an action a _t ∈ A.

ASSM的动态过程如下：虚拟角色在状态s₀，被表演者的动作向量a₀∈A驱动下，根据概率P转移到下一帧状态s₁，然后执行动作a₁，…如此下去我们可以得到图5所示的过程。The dynamic process of ASSM is as follows: the virtual character is in the state s ₀ , driven by the performer’s action vector a ₀ ∈ A, and then moves to the next frame state s ₁ according to the probability P, and then performs the action a ₁ , ... and so on, we can get Figure 5 shows the process.

(b)辅助骨框架:(b) Auxiliary bone frame:

辅助骨定位过程:给定一组主要骨骼索引集P，用于计算主要骨骼的全局变换矩阵G_p∈P。令和表示在原始姿态下的主要骨骼矩阵和静态皮肤上第i个顶点的位置。表示主要骨骼对应的皮肤转换矩阵。被称为辅助骨的二次骨骼的索引集用H表示，对应的皮肤公式如下：Auxiliary bone localization process: Given a set of primary bone index sets P, it is used to calculate the global transformation matrix G _p∈P of the primary bones. make and Indicates the position of the i-th vertex on the main bone matrix and the static skin in the original pose. Indicates the skin transformation matrix corresponding to the main bone. The index set of the secondary bone called auxiliary bone is denoted by H, and the corresponding skin formula is as follows:

v_i表示变形皮肤顶点的位置，S_h表示对应于第h个辅助骨的皮肤矩阵。上式第一项对应于由主要骨骼驱动的皮肤变形，第二项为使用辅助骨骼的变形提供了额外的控制。辅助骨的数量是由设计者给出的，以平衡变形质量和计算成本。v _i represents the position of the vertices of the deformed skin, _Sh represents the skin matrix corresponding to the h-th auxiliary bone. The first term of the above equation corresponds to skin deformation driven by the main bone, and the second term provides additional control for deformation using auxiliary bones. The number of auxiliary bones is given by the designer to balance deformation quality and computational cost.

蒙皮分解:将皮肤分解分为两个子问题进行描述。第一个子问题估计所有最优皮肤权重和皮肤矩阵在每一帧t∈T的最佳近似训练数据。第二个子问题通过基于原始骨架的辅助骨控制模型近似估计离散转换 Skin Decomposition: The skin decomposition is divided into two sub-problems for description. The first subproblem estimates all optimal skin weights and skin matrix The best approximation to the training data at each frame t ∈ T. The second subproblem approximates the discrete transformation by an auxiliary bone control model based on the original skeleton

给定主要骨架蒙皮矩阵的训练序列和相应的顶点动画在这里，蒙皮优化分解问题被公式化为最小二乘约束问题，最小化原模型和目标模型之间在整个训练数据集上的平方形状差异的总和。Given the main skeletal skinning matrix The training sequence of and the corresponding vertex animation Here, the skinning optimization decomposition problem is formulated as a least-squares-constrained problem that minimizes the sum of squared shape differences between the original and target models over the entire training dataset.

其中, in,

上式中，|·|_n表示l_n范式，V表示顶点集合的下标。常数k表示皮肤网格顶点受到骨骼影响的最大数量，以调整计算成本和准确度之间的平衡。In the above formula, |·| _n represents the l _n paradigm, and V represents the subscript of the vertex set. The constant k represents the maximum number of skin mesh vertices to be affected by bones, to adjust the balance between computational cost and accuracy.

辅助骨控制器:假设辅助骨是由只有球形关节的原始骨架驱动的，则辅助骨的姿势是由所有转动组件的主要骨骼r_p∈SO(3)唯一决定的。由一个列向量表示为：Auxiliary bone controller: Assuming that the auxiliary bone is driven by the original skeleton with only spherical joints, the pose of the auxiliary bone is uniquely determined by the main bone r _p ∈ SO(3) of all rotating components. Represented by a column vector as:

u:＝Δt₀||Δr₀||r₁||r₂||…||r_|p| (9)u:＝Δt ₀ ||Δr ₀ ||r ₁ ||r ₂ ||…||r _|p| (9)

该式中u∈R^3|p|+6，||表示向量值的连接运算符，|P|为主要骨的数量，Δt₀∈R³表示根节点的时间变化，Δr₀∈SO(3)表示根节点的方向变化。In this formula, _u∈R ³ _| ^p|+6 , || represents the connection operator of vector value, |P| ) represents the direction change of the root node.

每一个辅助骨作为主要骨的子骨附着在主要骨骼上，例如，Φ(h)被认为是第h个辅助骨对应的主要骨，S(h)是第h个辅助骨骼对应的皮肤矩阵，令由局部转换L_h和全局转换组合而成。局部转换L_h由平移分量t_h和旋转分量r_h组成。Each auxiliary bone is attached to the main bone as a child bone of the main bone, for example, Φ(h) is considered as the main bone corresponding to the hth auxiliary bone, S(h) is the skin matrix corresponding to the hth auxiliary bone, make It is composed of the local transformation L _h and the global transformation. The local transformation L _h consists of a translation component t _h and a rotation component r _h .

该模型假设皮肤变形建模为静态和动态变形的串联，前者是根据主要骨架的姿势来确定的，后者则取决于骨架运动和皮肤变形在过去时间步长的变化。因此，辅助骨q的皮肤变换由一个静态组件x和一个动态组件y连接表示，q＝x+y。静态变换x是根据骨架姿势计算的，动态变换y是使用状态空间模型控制的，该模型考虑了之前骨骼姿势和辅助骨骼转换的积累信息。The model assumes that skin deformation is modeled as a concatenation of static and dynamic deformations, the former determined from the pose of the main skeleton, and the latter depending on the skeleton motion and changes in skin deformation over past time steps. Therefore, the skin transformation of the auxiliary bone q is represented by the connection of a static component x and a dynamic component y, q=x+y. The static transformation x is computed from the skeleton pose, and the dynamic transformation y is controlled using a state-space model that takes into account accumulated information from previous skeleton poses and auxiliary skeleton transformations.

Claims

1. A method for realizing real-time human face interactive animation based on a monocular camera is characterized by comprising the following steps:

s1, capturing a face video image through a monocular camera to obtain a face image sequence; simultaneously capturing voice input information through a voice sensor;

s2, marking human face characteristic points in the human face image sequence, and extracting human face expression animation parameters;

s3, extracting voice features from the captured voice input information and extracting voice emotion animation parameters;

s4, learning a training sequence consisting of skeleton motion and corresponding skin deformation through the action state space model, establishing a virtual character skeleton skin model based on the auxiliary bone controller, driving the virtual character skeleton skin model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating real-time interactive animation.

2. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

in step S2, a double-layer cascade regression model is used to mark the facial feature points, and a Candide-3 facial model based on a facial activity coding system is used as a parameter carrier to extract facial expression animation parameters.

3. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 2,

the double-layer cascade regression model adopts a two-layer regression structure, wherein the first layer adopts an enhanced regression model formed by combining T weak regressors in a superposition mode; the second layer is formed by superposing strong regressors which are formed by cascading K regression models aiming at each weak regressor in the first layer.

4. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

step S3 specifically includes:

s31, analyzing and extracting the speech emotion information features in the speech input information;

s32, performing emotion recognition on the extracted voice emotion characteristics to finish emotion judgment;

and S33, corresponding the voice emotion result to an AU unit-based face activity coding system, extracting corresponding AU parameters and obtaining voice emotion animation parameters.

5. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

in step S4, the motion state space model is composed of three key elements: (S, A, { P })

S represents a facial expression state set of each frame of the virtual character;

a represents a group of action sets, parameters obtained through facial expression recognition and speech emotion recognition are used as a group of action vectors, and the change state of the next frame of virtual characters is driven;

p is state transition probability and represents the expression state s of the virtual character in the current frame t_tE.g. S, by performing action a_te.A and then to other states.

6. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

in step S4, the method for creating a virtual character skeleton skin model based on an auxiliary bone controller includes:

a. taking a bone covering model of the manufactured virtual character without the auxiliary bone as an original model;

b. carrying out covering weight optimization on the skeleton covering model;

c. gradually inserting the auxiliary bone into a region where the maximum approximate error is generated between the original model and the target model face;

d. solving two sub-problems of skin weight optimization and auxiliary bone position conversion optimization by adopting a block coordinate descent algorithm;

e. constructing an auxiliary bone controller, wherein a skin transformation q based on the auxiliary bone controller is represented by connecting a static component x and a dynamic component y, and q is x + y; wherein, the static component X is calculated according to the main skeleton posture in the original model; the dynamic component y is controlled using a motion state space model.