CN108717732B

CN108717732B - Expression tracking method based on MobileNet model

Info

Publication number: CN108717732B
Application number: CN201810486472.0A
Authority: CN
Inventors: 饶云波; 宋佳丽; 吉普照; 范柏江; 苟苗; 杨攀
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2022-05-17
Anticipated expiration: 2038-05-21
Also published as: CN108717732A

Abstract

The invention belongs to an expression tracking technology, and particularly relates to an expression tracking method based on a MobileNet model. The invention mainly comprises the following steps: generating a training data set through preprocessing, wherein the preprocessing is to enable the face of each picture in the data set to have three-dimensional characteristic coordinates; constructing a neural network MobileNet model by adopting a standard convolution layer, 12 separation convolution layers, 1 mean pooling layer, a full-connection layer and Softmax; the 12 separate convolution layers are 6 depth convolutions and 6 point convolutions; training the constructed neural network MobileNet model by using the obtained training data set; acquiring the coordinates of the three-dimensional feature points of the face of the input image by adopting a trained neural network MobileNet model; and performing grid reconstruction on the coordinates of the three-dimensional feature points of the face extracted by the model to generate a deformation coefficient, and controlling the 3D model of the face to realize expression tracking. The method has the advantages that the size and the running speed of the model are considered, so that the method is suitable for mobile equipment and has high practicability.

Description

Expression tracking method based on MobileNet model

Technical Field

The invention belongs to an expression tracking technology, and particularly relates to an expression tracking method based on a MobileNet model.

Background

With the improvement of hardware equipment, the facial expression tracking technology is gradually applied to various fields such as movie making, VR social interaction, game making and the like. Movies such as interstellar war and avatar fully utilize tracking technology in the production of expressions and actions of characters, and the movies have excellent expression and action extensibility in facial expressions and actions, thereby achieving the effect similar to that of real people. In addition, according to the research results of some famous psychologist in the United states, the human emotions transmitted in the social contact are 7% in characters, 38% in tone and 55% in expression. Aiming at the phenomenon of popularization of the current Internet, the research work of the facial expression tracking technology has very practical research significance for improving the effectiveness of communication and the interest of leisure and entertainment on the premise of respecting the privacy of users.

The positioning of the human face characteristic points is a key link in the facial expression tracking process, and whether the extraction of the characteristic points is accurate or not directly influences the authenticity of a subsequent expression mapping part. The main realization process is as follows: after a face is input by equipment or a source path, the positions of the five sense organs and the facial contour of the face are positioned, the coordinate values of the positioning points are extracted, and the coordinate values are used for modeling a triangular mesh in the expression mapping process. The extracted feature points are subjected to difference calculation with the actual feature point coordinate values of the human face to evaluate the accuracy of the algorithm, and different feature positioning algorithms are mainly evaluated from three aspects of positioning accuracy (accuracy), speed (speed) and robustness (robustness).

The traditional cascade regression method firstly constructs an initial face, and then gradually approximates to the shape of the real face through training of a plurality of weak regressors. However, once the original face shape is too far from the actual face shape, the subsequent regression optimization will also have a large deviation. At present, some researchers are working on improving the quality of an initial face, the effect is improved to some extent, but errors caused by the initial face cannot be completely avoided. In addition, many researchers achieve feature extraction by training a neural network model at present, the method mainly depends on preprocessing of training data and construction and optimization of the network model, and a large number of network structures achieve certain results in speed and precision at present, but the results still need to be improved.

Disclosure of Invention

Aiming at the problems, the invention realizes the extraction of the feature points of the human face by building and training a deep learning model MobileNet and realizes the tracking and migration from the human face expression to the animation model expression in a mode of generating a deformation coefficient by the feature points. With the popularization of networks and the emergence of various intelligent applications, simple text information and voice information are difficult to meet interesting requirements of users on daily social contact and game entertainment, and potential safety hazards such as personal safety of users and privacy disclosure of users are considered, the expression tracking technology effectively solves the problem and is mainly realized by tracking the expressions of the users and migrating the expressions to the face of a virtual model.

The technical scheme of the invention is as follows:

an expression tracking method based on a MobileNet model is characterized by comprising the following steps:

s1, generating a training data set through preprocessing, wherein the preprocessing is to enable the face of each picture in the data set to have three-dimensional characteristic coordinates;

s2, constructing a neural network MobileNet model by adopting a standard convolution layer, 12 separation convolution layers, 1 mean pooling layer, a full-connection layer and Softmax; the 12 separate convolution layers are 6 depth convolutions and 6 point convolutions;

training the constructed neural network MobileNet model by adopting the training data set obtained in the step S1;

s3, acquiring coordinates of three-dimensional feature points of the face of the input image by adopting the trained neural network MobileNet model;

and S4, performing grid reconstruction on the coordinates of the three-dimensional feature points of the face extracted by the model to generate a deformation coefficient, and controlling the 3D model of the face to realize expression tracking.

Further, the three-dimensional feature coordinates in step S1 are a plurality of three-dimensional feature coordinates including facial features and external contours.

Further, the specific method for training the constructed neural network MobileNets model in step S2 is as follows:

setting 64 three-dimensional characteristic coordinates in total, wherein a standard convolution layer of a first layer of the neural network MobileNet model comprises 64 convolution kernels, and the height and width of a training set picture are h and w respectively, then:

after the first layer of standard convolution layer, processing the input picture into a characteristic size with the step length of 2 and the convolution of (h/2) × (w/2) × 64;

in the second layer, sequentially iterating 12 layers by the step size of 1 or 2, and processing the feature map into a feature size of (h/32) × (w/32) × 1024;

the mean pooling layer normalizes the feature map to 1 × 1 × 1024 size by step size m;

and finally, classifying the features into 3 x 68 three-dimensional coordinate points through a full connection layer, and realizing feature extraction of the training set picture.

The invention has the advantages that the lightweight network MobileNet is used, and the size and the running speed of the model are considered, so the method is suitable for mobile equipment and has strong practicability.

Drawings

FIG. 1 is a diagram of a depth separable convolution;

FIG. 2 is a graph of training results;

FIG. 3 is a diagram illustrating a single picture test result;

fig. 4 is a diagram illustrating expression mapping results.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.

In the invention, firstly, facial feature points of a user are extracted, an original MobileNet model is improved into three-dimensional feature point data which is used for training and outputting, the three-dimensional feature point data comprises 68 feature points of five sense organs and external contours, and a data set comprises different facial images of the old, the young, the Chinese and foreign countries and the like. The trained model effectively realizes the extraction of the three-dimensional characteristics of the human face, the extracted characteristic information is subjected to triangular mesh reconstruction to generate a deformation coefficient, and the expression of the animation model is changed along with the change of the deformation coefficient.

The preparation and processing of the data set are the basis of model training, for example, 100 videos containing human faces are downloaded, the videos are simply processed, and the three-dimensional characteristic coordinates of the human faces in the videos are marked. And then, cutting the video into single pictures according to frames, and subpackaging the three-dimensional coordinates according to the single frames to finally obtain tens of thousands of pictures and corresponding labels thereof. In the selection of the video, people in different age groups of different countries in different scenes can be selected, and the facial expressions in the video are rich enough, so that the trained model has stronger robustness.

In order to meet the requirements of simplification of image acquisition equipment and high speed of feature extraction, the model is improved based on the lightweight deep neural network MobileNet. The conventional convolutional neural network has good use effect on image processing and target detection, and a model with higher accuracy is trained by using a deeper network structure, however, the network has the problems that the speed is difficult to increase, and the network cannot be embedded into a mobile device for use due to an overlarge model. Compared with the traditional convolutional neural network, the MobileNet model is different in that the lightweight neural network is constructed on the basis of a deep separable convolutional construction model, and the problems of operation speed and model size are considered on the premise of ensuring precision.

The depth separable convolution structure is shown in fig. 1, and it decomposes the standard convolution into a depth convolution for convolution filtering and a 1 x 1 point convolution that combines the outputs of the depth convolutions. Assuming M as the number of input channels, N as the number of output channels, D_FFor the spatial width and height of the square input feature map, the temporal complexity of the standard convolution is D_k·D_k·M·N·D_F·D_FThe time complexity of the deep convolution of the present invention is D_k·D_k·M·D_F·D_F+M·N·D_F·D_FComparing the two results, the result is as follows:

therefore, the model effectively reduces the complexity of the calculation time and the size of the model. In the invention, the MobileNet model is applied to human face feature point extraction for the first time, the structure treats deep convolution and point convolution as two independent modules, each convolution operation is followed by a Batchnorm and a ReLU, and downsampling is processed in the deep convolution and the first layer of standard convolution. The model structure is improved as the subsequent work needs triangular mesh reconstruction of the extracted feature points, and the improved structure is shown in table 1, namely a standard convolution layer, 12 separated convolution layers (6 deep convolutions +6 point convolutions), 1 mean pooling layer, a full connection layer and Softmax. Assuming that a picture I with a height h and a width w is input, the first layer is a standard convolution containing 64 convolution kernels, and the picture I is convolved with a step size of 2 to a characteristic size of (h/2) × (w/2) × 64. The second layer starts with a combination of split convolution and point convolution, and iterates through the 12 layers in sequence with a step size of 1 or 2, to step the feature map to the exact feature size of (h/32) × (w/32) × 1024. And then, a mean pooling layer normalizes the feature graph into a size of 1 × 1 × 1024 by step length m, and finally classifies the features into 3 × 68 three-dimensional coordinate points through a full connection layer to realize feature extraction of the picture I.

TABLE 1 MobileTs network architecture

Taking the unified processing of the sizes of the input pictures as 224 × 224 as an example, the method specifically includes:

the method comprises the following steps: assuming that the data set comprises N face images and corresponding label files (the (x, y, z) coordinates of 68 feature points), the face images and the corresponding label files are divided into a training set and a verification set according to the proportion of 7:3, and a test set picture is prepared independently, and M pictures obtained by dividing a complete video into frames and corresponding labels are generally used as a test set. The invention uses twenty thousand pictures for training, the video time length used for testing is about 20 seconds, and the video time length is divided into about 600 pictures according to frames.

Step two: the model is trained under a Pythrch framework, which provides a Tensor supporting a CPU and a GPU, and can greatly accelerate the calculation. To reduce the delay of picture reading and processing, the resize of the input picture (height h, width w) is unified to 224 × 224, and the corresponding x, y coordinates are scaled down according to the same proportion, and the z coordinate is unchanged, as follows:

h_r＝224/h,w_r＝224/w (1)

new_x＝x×h,new_y＝y×w_r (2)

wherein h _ r refers to the compression ratio of the picture height, w _ r refers to the compression ratio of the picture width, new _ x refers to the compressed x coordinate, and new _ y refers to the compressed vertical coordinate. Due to the large data size, the training sample is divided into 128-sized batchs, the value of epoch is set to 20 every time, the constructed model is stored every 5 epochs, and in addition, due to the small parameters in the separation convolution, the weight attenuation value is set to a small value: 1 e-4. To evaluate the error between the model output coordinates and the true coordinates, the SmoothL1Loss function under Pythrch is used as the Loss function of the model as follows:

the functional error is a squared loss at (-1,1), otherwise the L1 loss. Where the subscript i refers to the ith element of x. The data set was iteratively trained on the NVIDIA GeForce GPU, with the results of each training (20epoch) shown in fig. 2.

After training is finished, as the models are saved every 5 epochs, a plurality of models are finally generated, and the generated models are used for processing the data of the verification set, wherein val _ size is set to be 4, parameters are continuously adjusted to obtain the optimal model, weight _ decay is set to be 1e-4, lr is 1 e-3, and lr _ decay is 0.95.

Step three: after the optimal model is obtained, the model is predicted by using the test data, and the batch _ size is set to 1. Assuming that a single picture m is input for testing, as shown in fig. 3, the test result is an array structure. In the test process, resize is carried out on the picture m, so that the final output result is correspondingly reduced in equal proportion_i/h_r,Y＝Y_iAnd amplifying in a/w _ r mode for subsequent grid reconstruction.

Step four: through the steps, the trained convolutional neural network MobileNet model can be used for obtaining the three-dimensional characteristic point coordinates of the human face in the video, and the three-dimensional characteristic point coordinates are recorded as: s ═ X₁,Y₁,Z₁,X₂,...,Y₆₈,Z₆₈)^T∈R³ⁿ. The expression mapping from the characteristic points to the animation model is realized by adopting a two-layer feedforward neural network containing SigMID activation and linear output, and the network is a triangular mesh constructed by a set S

The distance between the vertices within serves as input data. Triangular mesh

The construction process comprises the following steps: first, a structure is constructed that contains all scatter points S ═ S₁,s₂,...s_nAnd n is more than or equal to 1 and less than or equal to 68, and placing the super triangles into a triangle linked list k. Then, the point s is inserted₁Finding out the triangle set T containing s1 in the circumcircle in the triangle linked list, where T is { T { (T) }₁,t₂,...t_nAnd deleting the common edge of the set T, connecting the point s1 with all the vertices of the triangle in the set T, and completing the insertion of the points s1 to k. And finally, circularly inserting scattered points in the set S and optimizing the triangle to construct a grid. And (3) calculating Euclidean distances between vertexes in the grid, inputting the Euclidean distances into the model to obtain a deformation coefficient which can be identified by the animation model, and putting the generated deformation coefficient into a Unity3D project to realize expression mapping, as shown in FIG. 4.

In summary, the key point of the invention is a research method for extracting coordinates of three-dimensional feature points of a human face by using a lightweight convolutional neural network MobileNets and converting the feature points into deformation coefficients which can be identified by an animation model so as to realize expression mapping.

Extracting three-dimensional coordinates of the facial feature points: the invention utilizes the high efficiency of the lightweight convolutional neural network MobileNet to improve the lightweight convolutional neural network MobileNet and use the lightweight convolutional neural network MobileNet for extracting the human face characteristic points. And 2 thousands of face images and corresponding label file training networks are manufactured, and the optimal network is obtained by continuously adjusting parameters and is used for feature point extraction. Because the image and the label thereof are compressed in proportion after being input into the network, the extracted data size is processed, and finally the feature point 3D coordinates of the human face are obtained based on the MobileNet model.

Realizing expression mapping from the human face to the animation model by using the feature point 3D coordinates: after the 3D coordinates of the human face feature points are obtained, triangular mesh reconstruction is carried out on the feature points, Euclidean distances among vertexes are calculated, 178 distance values are used as input of a model for data conversion, a deformation coefficient which can be identified by an animation model is obtained, and finally the deformation coefficient is put into Unity3D to achieve expression mapping.

Considering that the problem that the model speed is higher and the running speed is lower due to the trend that the accuracy is improved by using deeper and more complex networks in the current deep learning research, the invention uses lightweight networks MobileNet, which takes the size and the running speed of the model into account, so that the method is suitable for mobile equipment and has stronger practicability. With the popularization of mobile electronic products, the time of a user staying in the network world is gradually increased, and the demand is gradually increased. In order to protect the personal privacy of the user on the premise of ensuring the interest of the user in leisure and entertainment, the expression transplantation technology is gradually developed, and the expression transplantation technology is mainly expressed as follows: in network social contact or office, a user can communicate with strange friends on the network through an expression transplanting technology, and the expression change of the opposite side can be seen under the condition that the personal appearance is not exposed, so that the social efficiency is higher compared with simple text and voice communication; in a game scene, the control of the expression of a game character by the expression of a player can be realized, so that the immersion of the game player is improved, and various interesting games which attract the player can be designed by the technology; in the production of the film with cool scenes such as the romance or the cartoon and the like, the film can be recorded by controlling the expressions and the limbs of the character model by the actors, so that the safety of the actors is protected, and a large amount of time and money cost can be saved. Therefore, the expression transplantation technology has great research significance. The invention uses lightweight network MobileNet to extract facial features, and performs specific processing on feature points to obtain a deformation coefficient which can be identified by an animation model, and finally realizes expression mapping in Unity 3D. The analysis shows that the invention can be applied to a plurality of fields and has stronger commercial value.

Claims

1. An expression tracking method based on a MobileNet model is characterized by comprising the following steps:

s1, generating a training data set through preprocessing, wherein the preprocessing is to enable the face of each picture in the data set to have a plurality of three-dimensional feature coordinates including the facial features and the external contour;

training the constructed neural network MobileNet model by adopting the training data set obtained in the step S1; the specific method comprises the following steps:

and setting 68 three-dimensional characteristic coordinates in total, wherein the standard convolution layer of the first layer of the neural network MobileNet model comprises 64 convolution kernels, and the height of the training set picture is h, and the width of the training set picture is w:

finally, classifying the features into 3 x 68 three-dimensional coordinate points through a full connection layer, and realizing feature extraction of the training set picture;