CN113255457A

CN113255457A - Animation character facial expression generation method and system based on facial expression recognition

Info

Publication number: CN113255457A
Application number: CN202110470655.5A
Authority: CN
Inventors: 潘烨; 张睿思
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-08-13

Abstract

The invention belongs to the technical field of animation production, and discloses an animation character facial expression generation method based on facial expression recognition, which comprises the following steps: s1, recognizing the expressions of the human face and the animation character by the human face data set and the animation data set through an emotion recognition network, and matching the human face and the animation data pictures; s2, obtaining an animation training network through deep learning of the mapping relation between the facial pictures with the same expression and the character skeleton parameters; s3, aiming at the input of each frame of the video, using a network animation training network to output a skeleton parameter result; and S5, performing three-dimensional reconstruction on the input picture to obtain the motion parameters of the role, and optimizing the bone parameters by combining the geometric information of the face picture. Correspondingly, the invention also discloses an animation character facial expression generation system based on facial expression recognition. The invention improves the perception degree of the audience to the role emotion change by more finely controlling the geometric characteristics of the human face such as the mouth, the eyes and the like.

Description

Animation character facial expression generation method and system based on facial expression recognition

Technical Field

The invention belongs to the technical field of animation production, and particularly relates to an animation character facial expression generation method and system based on facial expression recognition.

Background

In facial motion capture, a traditional approach, such as ARkit, extracts facial geometric information by using a camera, mapping into a 3D model. And obtaining parameter information of the 3D model by learning the mapping relation between the two-dimensional video and the three-dimensional model parameters. In addition, some commercial software, such as Faceware, can also obtain parameter information of the 3D model by reconstructing the two-dimensional input picture. These methods can effectively extract geometric information of the face, but it is difficult for the viewer to feel the change of the character expression. Models such as exprpgen, depexppr, etc. attempt to improve on this problem, but while optimizing the appearance information, important face geometry information may be lost, making it difficult to control the details of the animated character face. These problems make it difficult to accurately express emotional information of a character while accurately communicating changes in facial information. When the audience watches the video, the change of the expression and the geometric details of the animated character has very important influence on the watching of the video.

Expression recognition and analysis are widely applied in the fields of human-computer interaction, computer graphics and the like. Ekman in 1978 suggested that the facial muscle status of a particular expression could be characterized using an expression action coding system (FACS). In general, facial features can be divided into upper and lower faces, with a small association between the two. The upper expression refers to the expression of the eyes, eyebrows, and cheeks, and the lower expression refers to the expression of the lips, the nasal root, and between. The expression action coding system (FACS) defines 46 basic action units, forming seven thousand or more combinations that can characterize most observed expressions.

In animation, FACS is widely used for emotional perception and manipulation of cartoon characters. Wherein the FACSGen controls the three-dimensional facial expression of the cartoon character by controlling the action unit. But the superposition of the microscopic level hardly causes the user to perceive the cartoon emotion on the whole level. HapFACS therefore improves it, allowing animators to control the character expression at both the level of the control unit and the overall mood. However, microscopic and macroscopic controls also limit the generalization performance of character expressions, and the same expressions are difficult to migrate from one character to another. The 'space-time facial expression animation edition' published by Wanxian Mei is used for the later edition of the facial animation and is used for meeting the special application requirements of an animator. The displacement of the edited characteristic points is propagated to other vertexes of the face model in a spatial domain by using a face expression synthesis technology based on Laplacian; and the Gaussian function is used for transmitting the editing effect of the user to an adjacent animation sequence in a time domain, so that the smooth transition of the facial expression animation is ensured, and the geometric details of the face are kept. The blue sky studio proposes a method for using differential subspace reconstruction to automatically generate role skeletons. By learning the marking information of the differential coordinates and then reconstructing the subspace grids, the deformation error can be effectively reduced, and the generalization performance is improved.

For face tracking, a conventional method detects each window of an input picture by using a multi-layer perceptron or an ensemble learning method, finds out a part containing a face, and synthesizes the part. With the development of deep learning, the application of the deep convolutional network and various variants thereof greatly improves the face detection effect. Models such as Fast R-CNN, YOLO and the like can efficiently and accurately detect a plurality of faces in a picture. The deep ir researchers Sun X, etc. combined the RCNN framework with feature cascading, multi-scale training, model pre-training, and correct calibration of key parameters, to achieve 83% accuracy in FDDB benchmark. Researchers at Chinese academy of sciences Wu S and the like introduce a layered attention mechanism into face recognition, and realize layered perception of face features by extracting local features of a face by using a Gaussian kernel model and modeling relationships among the features by using an LSTM. The model achieves 96.42%, 94.84% and 74.60% accuracy in the FDDB data set, the WIDER FACE data set and the UFDD data set respectively.

By extracting the facial features of the face, the facial emotion feature information can be effectively extracted and used as a feature vector for subsequent recognition and expression migration. Researchers of the empire' S, Zafeirious S, etc., characterize facial expressions using a sparse signal processing method derived from the 11 optimization problem, and classify feature vectors in conjunction with the SVM algorithm. By performing the gridding preprocessing on the face, the algorithm obtains better effect than directly processing the original picture. The algorithm achieved 92.4% accuracy in the CK data set.

With the development of deep learning technology, deep neural networks are used for emotional perception of human face images. Different from the traditional feature extraction task, in the deep learning, the two processes of face detection and feature extraction are performed simultaneously. The recognition of emotion using deep learning can be divided into three steps: preprocessing, deep feature learning and deep feature extraction. The preprocessing refers to processing such as face extraction, rotation correction and data enhancement on an input image. And then, completing the extraction of the picture characteristics in an end-to-end learning mode. Yang H, a researcher at Binghanton university, New York State, etc. learns facial expression information by using a De-expression response Learning (DeRL) method. They first generate a neutral facial picture corresponding to the expression using the generative model, and although the expression information is finally filtered, the emotion information is stored in the middle layer of the generative model. Facial expressions are classified by learning the information remaining in the middle layer. The algorithm achieved 97.30%, 88.00%, 73.23% and 84.17% accuracy in the CK + dataset, the Ouu-CASIA dataset, the MMI dataset and the BU-3DFE dataset, respectively.

The traditional animation capturing method uses a depth camera or a 3D scanner to directly extract face information and map the face information in an animation character. Because the equipment is expensive and the construction is complex, the wide use is difficult to obtain. For example, researcher Weise T of the sons federal institute of technology combines geometric information of a face with pre-stored face depth information, optimizes a probability problem to obtain a blendshape parameter sequence, and effectively improves the speed and stability of expression migration. However, in a specific use, due to the limited resolution, it is difficult to capture subtle geometric and motion changes of the human face. And it is also difficult to capture the change of facial expression. Researchers at Qinghua university such as Bouaziz S improve the 3D parameter optimization algorithm to obtain low-dimensional human face parameter representation, and the extraction speed of the blendshape parameter is effectively improved. Firstly, using an RGB-D camera to obtain a face depth picture, and mapping the face depth picture to a face 3D model; and then carrying out PCA dimension reduction and geometric transformation on the parameters, and mapping the parameters into the animation model. The 2D picture is independently used as the expression input, so that the equipment requirement can be effectively reduced, and the extensive use of the animation expression migration becomes possible. Researchers at university of Zhejiang Cao C and the like propose a method without pre-calibration to realize real-time expression migration. The algorithm firstly regresses the extracted two-dimensional feature points of the human face by a DDE (displaced Dynamic expression) model. And then optimizing the face parameters by combining the errors of the camera, and mapping the parameters in the cartoon head portrait.

Disclosure of Invention

The invention provides an animation character facial expression generation method and system based on facial expression recognition, and aims to solve the problems that in the existing animation production, the human face movement needs to simulate the geometric information of the human face and the human face expression needs to be accurately transmitted.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a facial expression generation method of an animation character based on facial expression recognition comprises the following steps:

s1, recognizing the expressions of the human face and the animation character by the human face data set and the animation data set through an emotion recognition network, and matching the human face and the animation data pictures;

s2, obtaining an animation training network through deep learning of the mapping relation between the facial pictures with the same expression and the character skeleton parameters;

s3, aiming at the input of each frame of the video, using a network animation training network to output a skeleton parameter result;

and S5, performing three-dimensional reconstruction on the input picture to obtain the motion parameters of the role, and optimizing the bone parameters by combining the geometric information of the face picture.

Preferably, in step S1: selecting a front face picture with seven types of labels of six-degree emotion and neutrality in a face data set; the character in the animation dataset comprises skeletal parameters for controlling the expression of the character, and seven classes of labels corresponding to six degrees of emotion and neutrality.

Preferably, step S1 includes the steps of:

s11, firstly, labeling feature points of the 3D animation character, extracting facial feature vectors of the animation image according to the selected facial marker points, and rendering the 3D character to obtain a two-dimensional image;

s12, the pictures in the face data set are then searched for the samples that correspond most similarly in the 3D animation data set.

Preferably, step S12 includes the steps of:

s121, firstly, searching all animation data sets to find a plurality of pictures with the most similar expression distances;

and S122, searching the picture with the characteristic point most similar to the human face in the plurality of pictures by using the geometric distance, and outputting the picture as a result.

Preferably, the present invention further comprises the steps of: and S4, interpolating the output bone parameters by using the relation between the previous key frame and the next key frame.

Preferably, step S4 includes:

s41, firstly, classifying the input face emotion by using the emotion recognition network in the step S1 to obtain a face parameter set which is used for parameter set search and corresponds to the expression in the animation data set;

and S42, searching two parameter combinations which are closest to the optimization result L2 in all expression parameters as results, and interpolating between key frames.

Preferably, in step S2: and training the matched human face and animation data picture in the step S1 as data to generate the skeleton parameters of the animation character.

Preferably, in step S5: the face geometric information includes any one or more of left and right intereyebrow height, left and right eye height, nose width, left and right nose height, mouth width, mouth height, and lip height.

An animation character facial expression generation system based on facial expression recognition comprises:

the data preprocessing module is used for identifying the expressions of the human faces and the animation characters in the human face data set and the animation data set through a deep convolution network and matching the human faces and the animation data pictures;

the off-line training module is used for obtaining the mapping relation between the facial pictures with the same expression and the character skeleton parameters through deep learning;

the online generation module firstly inputs a face key frame to the offline training module to obtain role skeleton parameters; then, interpolating the obtained role skeleton parameters by utilizing the relation between the previous key frame and the next key frame; and then, carrying out three-dimensional reconstruction on the input human face key frame to obtain the motion parameters of the role, and finally optimizing the bone parameters by combining the geometric information of the human face picture.

Compared with the prior art, the invention has the beneficial effects that:

in the traditional method, only the geometric characteristics of the character are considered, the emotion change of the character is introduced into animation three-dimensional modeling, and the character is optimized by combining specific geometric details, so that the human animation capturing effect is improved;

realizing real-time automatic control of the animation role;

the traditional animation production needs a depth camera and the like, and the face information is extracted, so that the production cost is reduced by reconstructing the input two-dimensional face;

by interpolating between the two key frames, the facial expression of the character is effectively and smoothly transited, and the use experience of a user is improved.

Drawings

FIG. 1 is a flow chart of an animation character facial expression generation method based on facial expression recognition according to the present invention;

FIG. 2 is a diagram of the optimization results of different phases of the animated character according to the present invention;

FIG. 3 is a diagram of interpolation effects for animated characters according to the present invention;

FIG. 4 is a graph of the same face migration results of the present invention;

fig. 5 is a comparison graph of the effect of the present invention compared to the prior art based on expression information.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example one

the face data set used in this embodiment is from CK +, dispa, KDEF, and MMI. Wherein each data set selects a front face picture with seven types of labels of six degrees emotion and neutrality. And performing operations such as rotation, scaling and the like on the image pairs under each label in the data set to perform data enhancement, so that the number of the images under each label is equal, and finally approximately 10000 images are obtained. In the embodiment, the illustration and the experimental result display picture are both from the data set.

The 3D animation dataset used in this example is from the FERG-3D-DB dataset, including four roles of memory, Bonnie, Ray, Malcolm. Each character comprises a skeletal parameter value for controlling the expression of the character, seven classes of marks corresponding to six-degree emotion and neutrality, and about 40000 samples exist.

The purpose of this process is to train a neural network for the classification of 2D face pictures and animation pictures. A face data set was first trained using the neural network shown in table 1. The face data set is classified into seven categories of angry, aversion, fear, happiness, sadness, neutrality and surprise. In the 2D cartoon data set, the parameters of the front network structure of the POOL layer are kept unchanged, and the FC layer of the trained network is finely adjusted in the cartoon data set. The animation data set is classified into seven categories of angry, disgust, fear, happiness, sadness, neutrality, and surprise.

TABLE 1 Emotion recognition network

This embodiment uses the PyTorch framework for end-to-end training of the network. During the training process, the data set was used for training, validation and testing at a ratio of 8: 1. An sgd (stored statistical gradient device) optimizer was used, where the momentum was 0.9, the weight was reduced to 0.0005, and the initial learning rate was 0.01. The learning rate is reduced to 1/10 every 10 cycles. Training is performed for 60 cycles in the face data set and 50 cycles in the animation data set, each batch being 50 in size.

And in the 3D network, searching an animation picture corresponding to the face data set picture as a reference value for training the 3D animation network and the character migration network.

Firstly, firstly

In software, marking the feature points of the 3D animation role, extracting the facial feature vector of the animation picture according to the selected geometric features of the human face, and rendering the 3D role to obtain a two-dimensional picture. The pictures in the face data set are then searched for the most similar examples that correspond to in the 3D animation data set. The specific mode is as follows: firstly, searching all cartoon data sets to find out 30 pictures with the most similar expression distances; and then searching the picture with the characteristic point most similar to the human face in 30 pictures by using the geometric distance, and outputting the picture as a result.

In the expression distance, the JS dictionary is used as a basis for measuring the expression distance of the two pictures in the embodiment. After the face picture and the animation picture are input into the CNN network of table 1, 512-dimensional vectors output at the FC2 layer are used as expression feature vectors and are marked as H and C. As shown in formula (1), wherein

D (H M) and D (C M) are Kullback-Leibler divergence.

And in the geometric distance, the selected geometric features of the human face are used as geometric feature vectors and are normalized. And searching the animation geometric feature vector c closest to the human face geometric feature vector h after normalization and outputting the result.

the purpose of this process is to train a neural network for 3D animated character parameter generation. The present embodiment uses the above-mentioned matched face-3D parameters as data for training. In the 3D animation training network shown in table 2, a human face picture is used as an input, and a cross entropy loss function of a network FC3 layer output result and a reference value is used as an objective function to optimize. The specific formula is shown in (2), wherein y is the output result of the 3D animation training network, and y' is the reference value of the face-3D parameter matching.

H(y，y′)＝-∑_iy′_ilog(softmax(y_i)) (2)

TABLE 23D animation training network

This embodiment uses the PyTorch framework for end-to-end training of the network. During training, an SGD optimizer was used, with a momentum of 0.9, a weight drop of 0.0005, and an initial learning rate of 0.001. The learning rate is reduced to 1/10 every 10 cycles. Train 50 cycles, each train batch size is 50.

s4, interpolating the output bone parameters by using the relation between the previous and the next key frames;

because the time interval exists in the reading of the facial expression of the human face, the interpolation algorithm can be utilized to interpolate the change of the character expression, so that the expression is excessively smooth. Because the reading time interval is short, and the facial expression generally hardly changes greatly, the embodiment obtains two skeleton parameters with the shortest emotional distance and geometric distance between two input key frames as candidate parameters for interpolation by filtering the 3D data set by two layers: firstly, classifying the emotion of the input face by using the emotion recognition network to obtain a parameter set of the corresponding expression in the 3D data set for searching geometric parameters. Of all expression parameters, two parameter combinations closest to the optimization result L2 are searched as a result, and interpolation is performed between key frames. The interpolation results are shown in fig. 3: by interpolating between the two key frames, the character expression transition problem between the front frame and the back frame can be smoothed, so that the expression is more natural.

The overall characteristics of the animated character, such as the expression of the character, can be effectively captured through the design of the neural network. However, the detailed parameters of the animated character, such as the eye opening size, the nose width, etc., are difficult to ensure that the neural network can completely learn. Therefore, the present embodiment is generated as the final bone parameters by fusing the geometric features and the overall features of the character.

For face and animation datasets, we extract the following facial landmark points as the face geometric features: the height between the left and right eyebrows (the height from the highest feature point to the lowest feature point of the eyebrows); left and right eye height (eye highest feature point to lowest feature point height); nose width (nose rightmost to leftmost feature point width); left and right nose height (left/right most feature point to nose bottom height); mouth width (rightmost to leftmost feature point width); mouth height (highest feature point to lowest feature point height); lip height (upper lip to fundus height). For each picture, scale to 256 x 256 pixel size for neural network input.

And optimizing the facial parameters, and extracting the obtained facial feature vectors to map in the parameters of the corresponding control facial role.

And optimizing the rotation parameters. And 3D reconstruction is carried out by utilizing information such as the depth of the two-dimensional image extracted by OpenCV, and the xyz coordinate value of the head rotation is obtained and output as a result. The specific method is that facial feature point coordinates (two sides of eyes, two sides of nose and two sides of mouth) corresponding to the facial feature point coordinates of the role and camera internal parameters are utilized to obtain a facial rotation matrix by solving a N-point perspective pose problem (PNP). And converting the obtained face rotation matrix into corresponding xyz coordinate values.

The image of the character after the optimization of the face parameters and the rotation parameters is shown in fig. 2. The following are sequentially arranged from left to right in the figure: inputting a face picture, a role whole skeleton parameter generation result, a role skeleton parameter optimization result and a motion parameter adding result.

Example two

With reference to fig. 1 to 5, an animation character facial expression generation system based on facial expression recognition includes:

It should be noted that, in the embodiment, any module or function implemented by the module may be added to achieve the object of the first embodiment of the present invention, which is not described in detail herein.

When the facial expression recognition network and the role emotion recognition network are trained, 80% of pictures are used as a training set for training, 10% of pictures are used as a verification set for verification, and 10% of pictures are used as a test set for testing. The emotion recognition accuracy rates of the different characters are shown in table 4. Test results prove that the neural network designed by the invention can effectively identify the expressions of the human face and the cartoon character.

TABLE 4 Emotion recognition accuracy

The algorithm provided by the invention can effectively transfer the facial expression to different animation roles. Meanwhile, the method can also migrate to the same role aiming at different faces, thereby ensuring the robustness of the effect. Figure 4 shows the results of the migration of the sequence of pictures of the changes in the mood of the face to the memory. Figure 5 shows the results of migration of different facial emotional changes to memory. The result shows that the role migration system can ensure the accuracy of expression transformation before and after the key frame aiming at a single face. Aiming at different faces, the system can accurately identify different expressions, and the robustness of different inputs is ensured.

The invention compares the proposed algorithm with animation effects of character migration based on emotion only and character migration based on geometric information only, and the final result is shown in fig. 5: the first line is the input face picture, the second line is the output result of the text system, and the third line is the output result based on the expression method. Compared with a role migration method based on emotion, the algorithm disclosed by the invention can be used for more finely controlling the mouth and eyes and improving the perception degree of the audience on role emotion change.

The invention provides a real-time animation generation algorithm combining facial expressions and geometric characteristics. By transferring the facial expressions, the role perception of the audience to the animation characters is effectively improved. And simultaneously, the control of the geometric characteristics enables the manipulation of the character details. Meanwhile, the role is controlled in real time and automatically generated through an interpolation optimization algorithm.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims

1. A facial expression generation method of an animation character based on facial expression recognition is characterized by comprising the following steps:

2. The method for capturing facial expressions of characters based on facial expression recognition of claim 1, wherein in step S1: selecting a front face picture with seven types of labels of six-degree emotion and neutrality in a face data set; the character in the animation dataset comprises skeletal parameters for controlling the expression of the character, and seven classes of labels corresponding to six degrees of emotion and neutrality.

3. The method for capturing facial expressions of characters based on facial expression recognition of claim 2, wherein the step S1 comprises the following steps:

4. The method for generating facial expressions of animated characters based on facial expression recognition according to claim 3, wherein the step S12 comprises the following steps:

5. The character facial expression operation capturing method based on facial expression recognition is characterized by further comprising the following steps of: and S4, interpolating the output bone parameters by using the relation between the previous key frame and the next key frame.

6. The method for capturing facial expressions of characters based on facial expression recognition as claimed in claim 5, wherein step S4 includes:

7. The method for capturing facial expressions of characters based on facial expression recognition of claim 1, wherein in step S2: and training the matched human face and animation data picture in the step S1 as data to generate the skeleton parameters of the animation character.

8. The method for capturing facial expressions of characters based on facial expression recognition of claim 1, wherein in step S5: the face geometric information includes any one or more of left and right intereyebrow height, left and right eye height, nose width, left and right nose height, mouth width, mouth height, and lip height.

9. An animation character facial expression generation system based on facial expression recognition is characterized by comprising: