CN110599573B

CN110599573B - Method for realizing real-time human face interactive animation based on monocular camera

Info

Publication number: CN110599573B
Application number: CN201910839412.7A
Authority: CN
Inventors: 谢宁; 杨心如; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2023-04-11
Anticipated expiration: 2039-09-03
Also published as: CN110599573A

Abstract

The invention relates to a three-dimensional character animation technology, and discloses a method for realizing real-time human face interactive animation based on a monocular camera. The method can be summarized as follows: capturing a face video image and voice input information, and extracting a face expression animation parameter and a voice emotion animation parameter; learning a training sequence consisting of bone motion and corresponding skin deformation through an action state space model, establishing a virtual character bone skinning model based on an auxiliary bone controller, driving the virtual character bone skinning model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating real-time interactive animation.

Description

Method for realizing real-time human face interactive animation based on monocular camera

Technical Field

The invention relates to a three-dimensional character animation technology, in particular to a method for realizing real-time human face interactive animation based on a monocular camera.

Background

In recent years, with the continuous development of computer hardware and software (such as the latest augmented reality application development kit (ARKit) released by apple Inc., the ARCore 1.0 and series of support tools released by Google Inc.), the multimedia technology is brought into a full development period, and meanwhile, as the visualization requirements of people on human-computer interaction interfaces are higher and higher, the human face modeling and animation technology plays more and more important roles in human-computer interaction. The application field of the three-dimensional facial expression animation technology is very wide, such as game entertainment, movie making, man-machine interaction, advertisement making and the like, and the three-dimensional facial expression animation technology has important application value and theoretical significance.

Since Parke ^[1] Since 1972, pioneering work using computer-generated face animation, more and more researchers around the world have found the research and use value of three-dimensional face modeling and animation technology and made many important contributions. As shown in fig. 1, these works mainly include how to use an effective model to represent the shape change of the face, how to accurately and rapidly capture facial expressions, how to finely build a three-dimensional face reconstruction model in real time, how to construct a digital face avatar, and how to drive it to generate a face model with a sense of reality.

Cao ^[2] In 2013, et al propose a method based on three-dimensional shape regressionA real-time face tracking and animation method. The method uses a monocular video camera as the acquisition equipment of the face image and mainly comprises two steps of preprocessing and real-time operation. In the preprocessing stage, a monocular camera is used to capture a user-specific gesture expression comprising a series of facial expressions and head rotation movements, followed by semi-automatic feature point labeling of the user's facial image using a human face feature point labeling algorithm. Based on the face image after the feature points are calibrated, cao ^[3] In 2014, the three-dimensional facial expression library FaceWarehouse oriented to the visual computing application is constructed. In the database, a bilinear face Model (bilinear face Model) containing two attributes of an individual and an expression is provided, and is used for fitting and generating a user-specific expression fusion Model. And calculating a three-dimensional face shape vector consisting of the three-dimensional positions of the feature points in each image of the user acquired by the camera through the expression fusion model. The Regression algorithm of the method adopts a double-level Boosted Regression (Two-level Boosted Regression) algorithm of shape related features, and all images and three-dimensional face shape vectors corresponding to the images are used as input to train a three-dimensional shape regressor specific to the user. In the real-time operation stage, the user-specific three-dimensional shape regressor obtains three-dimensional shape parameters and face motion parameters including rigid transformation parameters of the head and non-rigid face motion parameters of the facial expression based on the face motion parameters obtained from the current frame and the previous frame, and then the parameters are migrated and mapped to the virtual character to drive the virtual character to generate expression animation corresponding to the face motion.

However, the above method has certain limitations, and for each new user, a pre-processing process of about 45 minutes is required to generate a user-specific expression fusion model and a three-dimensional face shape regressor. Cao ^[4] In 2014, the real-time face tracking algorithm based on the offset dynamic expression regression is also provided, and is the algorithm based on the double-layer cascade regression, but no preprocessing operation is required for a new user, so that the real-time face expression tracking capture algorithm of any user is realized.

Cao et al, 2013, proposed real-time face tracking based on three-dimensional shape regressionAnimation method and Cao ^[4] The work gravity center of a real-time face tracking algorithm based on the offset dynamic expression regression, which was proposed again in 2014 by the people, lies in how to accurately, efficiently and robustly track large-amplitude movements of the face in the video, such as large-amplitude expressions like frowning, laughing and mouth opening, and rigid movements like head rotation and translation are added. But both ignore detail information on the face, such as a raised line in the face when lifting the eyebrow, secondary motion of the skin of the face caused by motion and the like, and the details are the important features for helping people to understand expressions and enabling the face to be richer in expressive force.

Reference:

[1]Parke F I.Computer generated animation of faces[C]//ACM Conference.ACM,1972:451-457.

[2]Cao C,Weng Y,Lin S,et al.3D shape regression for real-time facial animation[J].ACMTransactions on Graphics,2013,32(4):1.

[3]Cao C,Weng Y,Zhou S,et al.FaceWarehouse:A 3D Facial Expression Database forVisual Computing[J].IEEE Transactions on Visualization&Computer Graphics,2014,20(3):413-425.

[4]Cao C,Hou Q,Zhou K.Displaced dynamic expression regression for real-time facialtracking and animation[J].Acm Transactions on Graphics,2014,33(4):1-10.

[5]Ekman P,Friesen W V.Facial Action Coding System:Manual[J].Agriculture,1978.

[6]Duffy N,Helmbold D.Boosting Methods for Regression[J].Machine Learning,2002,47(2-3):153-200.

disclosure of Invention

The technical problem to be solved by the invention is as follows: a method for realizing real-time interactive animation of human faces based on a monocular camera is provided, animation parameters are generated by fusing human face expression capture and speech emotion recognition technologies, and visual dynamic skin deformation animation is synthesized in real time by a skeleton-based technology, so that the expression of the generated real-time animation has richer nature, more realistic sense and more self-characteristic.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for realizing real-time human face interactive animation based on a monocular camera comprises the following steps:

s1, capturing a face video image through a monocular camera to obtain a face image sequence; simultaneously capturing voice input information through a voice sensor;

s2, marking face characteristic points in the face image sequence, and extracting facial expression animation parameters;

s3, extracting voice features from the captured voice input information and extracting voice emotion animation parameters;

and S4, learning a training sequence consisting of skeleton motion and corresponding skin deformation through the action state space model, establishing a virtual character skeleton skin model based on the auxiliary bone controller, driving the virtual character skeleton skin model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating the real-time interactive animation.

As a further optimization, in step S2, a double-layer cascade regression model is used to mark feature points of the human face, and a Candide-3 human face model based on a facial activity coding system is used as a parameter carrier to extract parameters of facial expression animation.

For further optimization, the double-layer cascade regression model adopts a two-layer regression structure, wherein the first layer adopts an enhanced regression model formed by combining T weak regressors in a superposition mode; the second layer is formed by superposing strong regressors which are formed by cascading K regression models aiming at each weak regressor in the first layer.

As a further optimization, step S3 specifically includes:

s31, analyzing and extracting the speech emotion information features in the speech input information;

s32, performing emotion recognition on the extracted voice emotion characteristics to finish emotion judgment;

and S33, corresponding the voice emotion result to an AU unit-based face activity coding system, extracting corresponding AU parameters, and obtaining voice emotion animation parameters.

As a further optimization, in step S4, the motion state space model is composed of three key elements: (S, A, { P })

S represents a facial expression state set of each frame of the virtual character;

a represents a group of action sets, and parameters obtained through facial expression recognition and voice emotion recognition are used as a group of action vectors to drive the change state of the next frame of virtual characters;

p is state transition probability and represents the expression state s of the virtual character in the current frame t _t E.g. S, by performing action a _t And e, transferring to the probability distribution of other states after the state A is formed.

As a further optimization, in step S4, the method for establishing a virtual character skeleton skin model based on an auxiliary bone controller includes:

a. taking a bone covering model of the manufactured virtual character without the auxiliary bone as an original model;

b. carrying out covering weight optimization on the skeleton covering model;

c. gradually inserting the auxiliary bone into a region where the maximum approximate error is generated between the original model and the target model face;

d. solving two sub-problems of skin weight optimization and auxiliary bone position conversion optimization by adopting a block coordinate descent algorithm;

e. constructing an auxiliary bone controller, wherein a skin transformation q based on the auxiliary bone controller is represented by a static component x and a dynamic component y in a connected mode, and q = x + y; wherein, the static component X is calculated according to the main skeleton posture in the original model; the dynamic component y is controlled using a motion state space model.

The invention has the beneficial effects that:

1. facial expressions are the affluence of human emotions, but in some special cases, facial expressions do not fully express the internal emotion of a character. If facial point-to-point driving is performed by capturing and tracking facial expression feature points as parameters, it is obvious that the generated facial animation is not vivid enough. For example, when a character smiles and laughs, the facial expressions of the character and the character are similar, but different language words are emitted, so that the current emotional state change of the character can be better captured from the perspective of voice by adding the voice emotion recognition technology. The invention combines the facial expression capturing technology and the voice emotion recognition technology, and can greatly improve the richness, naturalness and reality of the expression animation of the virtual character.

2. Because the movement of the bones and the muscles drives the change of the skin expression together, in order to better simulate the skin movement, the invention adopts a bone skin model, automatically adds auxiliary bones through a skin decomposition algorithm based on the bones, and drives the main bones for simulating the head bone movement and the auxiliary bones for simulating the muscle movement to drive the virtual character together for animation.

Drawings

FIG. 1 is a current state of the art of three-dimensional face animation;

FIG. 2 is a schematic diagram of the implementation of real-time human face interaction animation according to the present invention;

FIG. 3 is a schematic diagram of an enhanced regression structure;

FIG. 4 is a schematic diagram of a two-layer cascade regression structure;

FIG. 5 is a diagram illustrating the state transition process of the ASSM.

Detailed Description

The invention aims to provide a method for realizing real-time human face interactive animation based on a monocular camera, which generates animation parameters by fusing human face expression capture and speech emotion recognition technologies and synthesizes visual dynamic skin deformation animation in real time by a skeleton-based technology, so that the expression of the generated real-time animation has richer nature, more realistic sense and more self-characteristics. To achieve the purpose, the scheme of the invention is mainly realized from the following aspects:

1. in the aspect of human face motion capture:

face motion capture includes two parts: non-rigid capture of facial expressions and head rigid transformation capture. According to the unique muscle movement characteristics of the facial expressions, the facial five sense organs are coordinated as a unified whole to show each facial expression. The method uses the intermediate description method with invariance as the reliable feature representation of the facial expression recognition to make up the deficiency of the bottom layer feature in the facial expression recognition.

2. And speech emotion recognition: the current emotional state of a person is captured through voice input of a performer, and voice emotional animation parameters corresponding to the current emotional state of the performer are generated through steps of voice feature extraction, dimensionality reduction, classification and the like.

3. And (3) target digital substitution expression aspect: a bone-based dynamic avatar expression method is used, which obtains the optimal transfer of nonlinear complex deformation including soft tissue by learning a training sequence consisting of bone motion and a corresponding skin deformation sequence. And driving the movement of the bones of the head of the character to perform programmed control on the auxiliary bones through the expression semantics of the user extracted from the human face motion capture so as to simulate the dynamic deformation of the facial skin.

In terms of specific implementation, the principle of the method for implementing the real-time interactive animation of the human face based on the monocular camera is shown in fig. 2, and the method comprises the following steps:

(1) Capturing a face video image through a monocular camera to obtain a face image sequence; simultaneously capturing voice input information through a voice sensor;

(2) Capturing and tracking a human face: marking human face characteristic points from the captured human face image, and extracting human face expression animation parameters;

the positioning of the face feature points is a key link in face recognition, face tracking, face animation and three-dimensional face modeling. Due to factors such as human face diversity and illumination, locating human face feature points in natural environments is still a difficult challenge. The specific definition of the human face characteristic points is as follows: for a face shape containing N face feature points S = [ x ] ₁ ,y ₁ ,...,x _N ,y _N ]For an input face picture, the aim of face feature point positioning is to estimate a face feature point shape S, so that S is equal to the real shape of the face feature point

Difference of (2)Minimum value, S and->

The minimized alignment difference between can be defined as L ₂ -normal form

This equation is used to guide the training of the face feature point locator or to evaluate the performance of the face feature point location algorithm.

The invention aims to adopt an algorithm frame based on a regression model to carry out real-time and efficient human face detection and tracking.

a) Enhanced Regression (boost Regression)

Using enhanced regression to combine T weak regressors (R) ¹ ,...R ^t ,...R ^T ) Are combined in a superimposed manner. For a given face sample I and an initialization shape S ⁰ Each regressor calculates a shape increment based on the sample features

And updates the current shape in a cascading manner:

S ^t ＝S ^t+1 +R ^t (I,S ^t-1 ),t＝1,...,T(1)

R ^t (I,S ^t-1 ) Represents a regressor R ^t Using the input sample image I and the last shape S ^t-1 Calculated shape increment, R ^t From the input sample image I and the last shape S ^t-1 Deciding, using shape index features, to learn R ^t As shown in fig. 3;

given N training samples

I representing the ith sample _i True shape of (c), pair (R) ¹ ,...R ^t ,...R ^T ) And (5) circulating the training until the training error is not increased any more.Each R ^t Are calculated by minimizing the alignment error, i.e.:

representing the last shape estimate, R, of the ith image ^t The output of (c) is a shape increment.

b) Double-layer lifting Regression (Two-level Boosted Regression)

The enhanced regression algorithm regresses the entire shape, but the large appearance difference in the input image and the rough initialized face shape make the single-layer weak regressor no longer applicable. A single regressor is too weak, convergence is slow during training, and results are poor during testing. In order to make the convergence faster and more stable during training, the invention adopts a two-layer cascade structure, as shown in fig. 4.

The first layer employs the enhanced regression model described above. For each regressor R of the first layer ^t Again using K regression models, i.e. R ^t ＝(r ¹ ,...r ^k ,...r ^K ) Herein, r is called a primary regressor, and is cascaded by K primary regressors to form a strong regressor. The difference between the first layer and the second layer is that each regressor R in the first layer ^t Is inputted by ^t-1 Are all different and each regressor r in the second layer ^k The inputs are all the same. Such as R ^t All regressor inputs in the second layer of (1) are S ^t-1 。

In generating the parameters of facial expression animation, the invention utilizes Ekman and the like ^[5] Proposed AU-unit based face activity coding system FACS describes a total of 44 basic motion units, each controlled by an underlying part or muscle mass. Specifically, the Candide-3 face model based on the facial activity coding system can be used as a parameter carrier to extract the AU parameter E corresponding to the facial expression.

The Candide-3 face model is represented as follows:

in the formula (I), the compound is shown in the specification,

the basic shape of the model is represented, S is a static deformation matrix, A is a dynamic deformation matrix, sigma is a static deformation parameter, alpha is a dynamic deformation parameter, and R and t respectively represent a head rigid rotation matrix and a head translation matrix. g is a column vector of the vertex coordinates of the model, and is used for representing various specific facial expression shapes. The model g is determined by four parameters of R, t, alpha and sigma.

(3) Extracting voice features from the captured voice input information and extracting voice emotion animation parameters;

analyzing and extracting the speech emotion information characteristics in the speech input information; performing emotion recognition on the extracted voice emotion characteristics to finish emotion judgment; and corresponding the voice emotion result to an AU unit-based face activity coding system, and extracting corresponding AU parameters to obtain a voice emotion animation parameter V.

(4) Learning a training sequence consisting of bone motion and corresponding skin deformation through an action state space model, establishing a virtual character bone skinning model based on an auxiliary bone controller, driving the virtual character bone skinning model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating real-time interactive animation.

(a) Action State Space Model (ASSM):

the action state space model consists of three key elements (S, a, { P }), where:

s: representing a state set, facial expression states of the virtual character (such as happy, sad, etc.);

a: representing a group of action sets, taking parameters obtained through facial expression recognition and speech emotion recognition as a group of action vectors, and driving the change state of the next frame of virtual character;

p is state transition probability which represents the expression state s of the virtual character in the current frame t _t E.g. S, by performing action a _t e.A and then to other states.

The dynamic process of ASSM is as follows: virtual character in state s ₀ Motion vector a of the performer ₀ E is driven by A, and the state is transferred to the next frame state s according to the probability P ₁ Then perform action a ₁ 8230, and so on we can get the process shown in fig. 5.

(b) Auxiliary bone framework:

auxiliary bone location procedure giving a set of primary bone index sets P for computing a global transformation matrix G for the primary bone _p∈P . Order to

And &>

Representing the location of the ith vertex on the static skin and the principal skeleton matrix in the original pose. />

Representing the skin transformation matrix corresponding to the dominant bone. The index set of secondary bones called auxiliary bones is denoted by H and the corresponding skin formula is as follows:

v _i indicating the location of the deformed skin vertex, S _h Representing the skin matrix corresponding to the h-th auxiliary bone. The first term of the above equation corresponds to skin deformation driven by the primary bone, and the second term provides additional control over deformation using the secondary bone. The number of auxiliary bones is given by the designer to balance the quality of the deformation and the computational cost.

Skin decomposition, which is to divide the skin decomposition into two sub-problems for description. First sub-problem estimates all optimal skin weights

And skin matrix->

The best approximation of the training data at each frame T e T. The second sub-problem approximates discrete transitions ≧ based on the original skeleton by the auxiliary bone control model>

Given main skeleton skin matrix

And the corresponding vertex animation>

Here, the skinning optimization decomposition problem is formulated as a least squares constraint problem, minimizing the sum of the squared shape differences between the original and target models over the entire training data set.

Wherein the content of the first and second substances,

in the above formula, | · non-conducting fume _n Is represented by _n In the paradigm, V represents the subscript of the vertex set. The constant k represents the maximum number of skin mesh vertices affected by bone to adjust the computational cost and accuracyBalance between them.

Auxiliary bone controller-assuming that the auxiliary bone is driven by the original skeleton with only spherical joints, the posture of the auxiliary bone is determined by the main skeleton r of all rotating assemblies _p Is uniquely determined by e SO (3). Represented by a column vector as:

u:＝Δt ₀ ||Δr ₀ ||r ₁ ||r ₂ ||…||r _|p| (9)

in which u is E.R ^3|p|+6 And | represents the join operator of vector values, | P | is the number of major bones, | t ₀ ∈R ³ Representing the time variation of the root node, Δ r ₀ E SO (3) represents the change of direction of the root node.

Each auxiliary bone is attached to the main bone as a sub-bone of the main bone, for example, Φ (h) is regarded as a main bone corresponding to the h-th auxiliary bone, and S (h) is a skin matrix corresponding to the h-th auxiliary bone, such that

By local conversion of L _h And global translation. Local transformation L _h By a translational component t _h And a rotational component r _h And (4) forming.

The model assumes that skin deformation is modeled as a concatenation of static and dynamic deformation, the former determined from the pose of the main skeleton, and the latter dependent on the skeleton motion and the change in skin deformation over the past time step. Thus, the skin transformation of the auxiliary bone q is represented by a static component x and a dynamic component y connected, q = x + y. The static transformation x is computed from the skeletal pose and the dynamic transformation y is controlled using a state space model that takes into account the accumulated information of previous skeletal poses and ancillary skeletal transformations.

Claims

1. A method for realizing real-time human face interactive animation based on a monocular camera is characterized by comprising the following steps:

s4, learning a training sequence consisting of skeleton motion and corresponding skin deformation through the action state space model, establishing a virtual character skeleton skin model based on an auxiliary bone controller, driving the virtual character skeleton skin model through the extracted facial expression animation parameters and voice emotion animation parameters, and generating real-time interactive animation;

in step S4, the motion state space model is composed of three key elements: (S, A, { P })

a represents a group of action sets, parameters obtained through facial expression recognition and speech emotion recognition are used as a group of action vectors, and the change state of the next frame of virtual characters is driven;

p is state transition probability and represents the expression state s of the virtual character in the current frame t _t E.g. S, by performing action a _t e.A and then to other states.

2. The method for realizing the real-time interactive animation of the human face based on the monocular camera as recited in claim 1,

in the step S2, a double-layer cascade regression model is adopted to mark the human face characteristic points, and a Candide-3 human face model based on a facial activity coding system is used as a parameter carrier to extract the parameters of the facial expression animation.

3. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 2,

the double-layer cascade regression model adopts a two-layer regression structure, wherein the first layer adopts an enhanced regression model formed by combining T weak regressors in a superposition mode; the second layer is formed by superposing strong regressors which are formed by cascading K regression models aiming at each weak regressor in the first layer.

4. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

step S3 specifically includes:

5. The method for realizing real-time interactive animation of human faces based on a monocular camera as recited in claim 1,

in step S4, the method for establishing a virtual character skeleton skin model based on an auxiliary bone controller includes:

b. carrying out covering weight optimization on the skeleton covering model;