CN112200024B

CN112200024B - Two-dimensional facial expression recognition method through three-dimensional deformable model learning

Info

Publication number: CN112200024B
Application number: CN202011018179.5A
Authority: CN
Inventors: 时迎琰; 邹乔莎; 史传进
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-10-11
Anticipated expiration: 2040-09-24
Also published as: CN112200024A

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a two-dimensional facial expression recognition method based on three-dimensional deformable model learning. According to the method, the three-dimensional expression point cloud corresponding to the two-dimensional facial expression is reconstructed, the network structure of an encoder-decoder is adopted, the expression classifier, expression classification loss and three-dimensional point cloud regression loss are combined, and under end-to-end combined training, the finally obtained expression parameters not only enhance the discriminability and improve the accuracy of two-dimensional facial expression classification, but also keep the generation performance, and the corresponding three-dimensional facial expression point cloud can be recovered. The invention can realize two functions of two-dimensional facial expression classification and three-dimensional facial expression point cloud generation at the same time, and has wide application prospect.

Description

Two-dimensional facial expression recognition method through three-dimensional deformable model learning

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a two-dimensional facial expression recognition method through three-dimensional deformable model learning.

Background

The facial expression recognition technology plays an important role in emotional artificial intelligence, and is widely applied to digital entertainment, driver monitoring, children autism diagnosis and other human-computer interaction systems. Therefore, reliable intelligent facial expression recognition technology is receiving more and more attention from academia and industry.

The three-dimensional deformable model is a parameterized face three-dimensional point cloud model based on statistics, wherein expression parameters can describe different actions of the face. In order to classify two-dimensional facial expressions by using expression parameters of three-dimensional deformable models, expNet ^[1] Using ResNet 101 ^[2] And directly regressing the expression parameters of the two-dimensional face picture as a regressor, and then classifying the obtained expression parameters by using a k-nearest neighbor classification algorithm to obtain a final expression category result.

The extraction stage and the classification stage of the facial expression characteristics are separated by the ExpNet, and the classification effect of end-to-end training cannot be achieved. Therefore, expNet still has room to optimize.

In order to solve the problems, the invention provides a two-dimensional facial expression recognition method through three-dimensional deformable model learning. The invention can realize two functions of two-dimensional facial expression classification and three-dimensional facial expression point cloud generation at the same time, and has wide application prospect.

Reference:

[1]Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gérard G. Medioni. 2018. ExpNet: Landmark-Free, Deep, 3D Facial Expressions. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018. IEEE Computer Society, 122–129. https://doi.org/10.1109/FG.2018.00027

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, June 2016.

[3] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. 2009. A 3D Face Model for Pose and Illumination Invariant Face Recognition. In Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009, 2-4 September 2009, Genova, Italy, Stefano Tubaro and Jean-Luc Dugelay (Eds.). IEEE Computer Society, 296–301. https://doi.org/10.1109/AVSS.2009.58

[4] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. Face-Warehouse:A 3D Facial Expression Database for Visual Computing. IEEE Trans. Vis. Comput.Graph. 20, 3 (2014), 413–425. https://doi.org/10.1109/TVCG.2013.249

[5]Weiyang Liu, YandongWen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. SphereFace: Deep Hypersphere Embedding for Face Recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017.IEEE Computer Society, 6738–6746. https://doi.org/10.1109/CVPR.2017.713

[6] Patrick Lucey, Jeffrey F. Cohn, Takeo Kanade, Jason M. Saragih, Zara Ambadar, and Iain A. Matthews. 2010. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2010, San Francisco, CA, USA, 13-18 June, 2010. IEEE Computer Society, 94–101. https://doi.org/10.1109/CVPRW.2010.5543262

[7] Guoying Zhao, Xiaohua Huang, Matti Taini, Stan Z. Li, and Matti Pietikäinen. 2011. Facial expression recognition from near-infrared videos. Image Vis. Comput. 29, 9 (2011), 607–619. https://doi.org/10.1016/j.imavis.2011.07.002。

disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a two-dimensional facial expression recognition method based on three-dimensional deformable model learning.

The invention provides a two-dimensional facial expression recognition method based on a three-dimensional deformable model learning technology, which comprises the following steps: reconstructing a three-dimensional facial expression point cloud corresponding to a two-dimensional facial expression, adopting a network structure of an encoder-decoder, combining an expression classifier, expression classification loss and three-dimensional point cloud regression loss, and finally obtaining expression parameters under end-to-end joint training; the method not only enhances the discriminability and improves the accuracy of two-dimensional facial expression classification, but also keeps the generativeness and can recover the corresponding three-dimensional facial expression point cloud;

the two-dimensional facial expression is a facial image to be subjected to expression classification;

the three-dimensional facial expression point cloud is a part of the three-dimensional space after the calm face point cloud is removed from the face point cloud, namely the point cloud corresponding to the facial action;

the expression parameters are codes of facial expressions, and corresponding three-dimensional facial expression point clouds can be recovered through a decoder;

the encoder is used for encoding the input two-dimensional facial expression and outputting expression parameters;

the decoder is used for decoding the input expression parameters and outputting three-dimensional facial expression point clouds;

the expression classifier adopts a SoftMax classification layer and outputs the prediction probability of each expression;

the expression classification loss is applied to the expression parameters through an expression classifier by adopting cross entropy loss;

the three-dimensional point cloud regression loss is applied to the output of a decoder by adopting MSE mean square error loss.

The method provided by the invention adopts end-to-end joint training; the concrete mode is as follows:

the method comprises the following steps of firstly, a pre-training stage: respectively training an encoder by using the two-dimensional face picture and the expression parameters of the corresponding three-dimensional deformable model, and training a decoder by using the expression parameters of the three-dimensional deformable model and the corresponding three-dimensional face expression point cloud; thus, the encoder and the decoder have preliminary functions;

then a joint training phase: performing joint training on the pre-trained encoder and the pre-trained decoder, namely inputting expression parameters output by the encoder into the decoder, and generating three-dimensional facial expression point clouds corresponding to the facial pictures end to end; in the combined training process, expression classification loss is applied to expression parameters output by the encoder through the expression classifier, so that the distribution of the expression parameters in the space is more discriminative, the inter-class distance is increased, and the accuracy of expression classification is improved; and meanwhile, the mean square error loss is applied to the three-dimensional facial expression point cloud output by the decoder, so that the finally obtained expression parameters can still generate the corresponding three-dimensional facial expression point cloud.

The invention has the technical effects that the expression type prediction of the input face image can be realized through the encoder and the expression classifier, and the three-dimensional face expression point cloud can be generated through the encoder and the decoder. The two functions are realized simultaneously, and the method has wide application prospect.

Drawings

Fig. 1 is a simplified schematic diagram of the framework of the present invention.

Detailed Description

The present invention will be described more fully hereinafter in the reference to the accompanying drawings, which provide preferred embodiments of the invention, and which are not to be considered as limited to the embodiments set forth herein.

Fig. 1 shows a simplified schematic diagram of a two-dimensional facial expression recognition method through three-dimensional deformable model learning according to the present invention. The three-dimensional deformable Model selects the base Face Model ^[3] And faceWarehouse ^[4] Then the expression parameters are 29-dimensional vectors, the three-dimensional point cloud has 53215 vertexes, and the encoder selects SphereFace ^[5] Adding a 29-dimensional full-connection layer at the top end of the decoder for outputting 29-dimensional expression parameters, selecting a multi-layer perceptron with two layers, the dimension of the middle layer is 1000, and outputting the layerIs 159645; the expression classifier adopts a SoftMax classification layer (SoftMax is a common function); the expression classification loss is applied to the expression parameters through an expression classifier by adopting cross entropy loss; the three-dimensional point cloud regression loss is applied to the output of the decoder by adopting MSE mean square error loss.

In particular, assume thatIIs the input two-dimensional face picture,Ein order to be an encoder, the encoder is,Din order to be a decoder, the decoder,p=[p ₁ , p ₂ , …, p ₂₉ ]is an expression parameter of 29 dimensions and is,S = [x ₁ , y ₁ , z ₁ , x ₂ , y ₂ , z ₂ , …, x ₅₃₂₁₅ , y ₅₃₂₁₅ , z ₅₃₂₁₅ ]is a point cloud coordinate of 159645 dimensions, thenp=E(I),S=D(p)=D(E(I) SoftMax classification layer of expression classifierP _j =exp(w _j p+ b _j )/Σ _l exp(w _l p+b _l ) In whichP _j Is as followsjThe probability of a similar expression is,exp() As a natural exponential function, with a cross-entropy penalty ofH(P,Q)=-Σ _j Q _j log(P _j ) In whichPIs the probability predicted by the expression classifier,Qis a true probability distribution of the class,MSEthe loss of mean square error isL(S,T)= Σ _i (S _i -T _i ) ² /159645WhereinS _i Is the output of the decoder for each dimension,T _i is each dimension of the real point cloud.

In the pre-training stage, the encoder is trained by using the two-dimensional face picture and the expression parameters of the corresponding three-dimensional deformable model, and the decoder is trained by using the expression parameters of the three-dimensional deformable model and the corresponding three-dimensional face expression point cloud.

And in the joint training stage, expression classification loss and three-dimensional point cloud regression loss are applied simultaneously, and the pre-trained encoder and decoder are subjected to end-to-end joint training.

Specifically, in a joint training stage, two loss constraints, namely expression classification loss and three-dimensional point cloud regression loss, are simultaneously applied to a neural network of an encoder-decoder structure; in order to make the spatial distribution of the expression parameters more separable, the weight of the three-dimensional point cloud regression loss can be reduced at the initial stage of the joint training, and the weights of the expression classification loss and the three-dimensional point cloud regression loss are 1; when the expression classification loss is saturated, in order to ensure that the generated three-dimensional face expression is more natural and real, the weight of the three-dimensional point cloud regression loss is increased, and the weights of the expression classification loss and the three-dimensional point cloud regression loss are 1.

Finally, expression category prediction of the input face image can be realized through an encoder and an expression classifier, and a three-dimensional face expression point cloud can be generated through the encoder and a decoder.

The method of The invention is applied to two common facial expression data sets of The Extended Cohn-Kanade database ^[6] And The Oulu-CASIA database ^[7] In the above, the expression classification accuracy rates are respectively 77.2% and 72.8% by adopting the test condition of subject-independent 10 fold cross validation and adopting the average value of the classification accuracy rates of the expression categories as the evaluation index. While using ResNet 34 ^[2] And support vector machine to implement ExpNet ^[1] The final expression classification accuracy rates of the feature extraction and classification processes are respectively 63.3% and 56.4%, and the method is obviously lagged behind the method. Therefore, the method of the invention utilizes the end-to-end encoder-decoder structure to learn the three-dimensional deformable model, and can effectively improve the classification effect of the two-dimensional facial expression pictures.

The foregoing description of the embodiments of the present invention has been provided by way of specific examples, and other advantages and benefits of the present invention will become apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Claims

1. A two-dimensional facial expression recognition method based on three-dimensional deformable model learning is characterized in that a three-dimensional deformable model learning technology is used for reconstructing three-dimensional facial expression point clouds corresponding to two-dimensional facial expressions, and expression parameters are finally obtained under end-to-end joint training by adopting a network structure of an encoder-decoder and combining an expression classifier, expression classification loss and three-dimensional point cloud regression loss; wherein:

the three-dimensional facial expression point cloud is a part of the three-dimensional space after the calm face point cloud is removed, namely the point cloud corresponding to the facial action;

the three-dimensional point cloud regression loss is applied to the output of the decoder by adopting MSE mean square error loss.

2. The two-dimensional facial expression recognition method based on three-dimensional deformable model learning as claimed in claim 1, characterized in that the method employs end-to-end joint training; the concrete method is as follows:

the first is a pre-training phase: respectively training an encoder by using the two-dimensional face picture and the expression parameters of the corresponding three-dimensional deformable model, and training a decoder by using the expression parameters of the three-dimensional deformable model and the corresponding three-dimensional face expression point cloud; thus, the encoder and the decoder have preliminary functions;

then a joint training phase: jointly training the pre-trained encoder and decoder, namely inputting expression parameters output by the encoder into the decoder, and generating three-dimensional facial expression point clouds corresponding to the facial pictures end to end; in the combined training process, expression classification loss is applied to expression parameters output by the encoder through the expression classifier, so that the distribution of the expression parameters in the space is more discriminative, the inter-class distance is increased, and the accuracy of expression classification is improved; and meanwhile, the mean square error loss is applied to the three-dimensional facial expression point cloud output by the decoder, so that the finally obtained expression parameters can still generate the corresponding three-dimensional facial expression point cloud.

3. The two-dimensional facial expression recognition method based on three-dimensional deformable Model learning of claim 2, characterized in that the three-dimensional deformable Model adopts Basel Face Model and faceWarehouse, so that the expression parameters are 29-dimensional vectors, and the three-dimensional point cloud has 53215 vertexes; the encoder adopts SphereFace, and a layer of 29-dimensional full-connection layer is added at the top end of the encoder to output 29-dimensional expression parameters; the decoder employs a two-layer multi-layer perceptron, the dimension of the middle layer being 1000, and the dimension of the output layer being 159645.

4. The two-dimensional facial expression recognition method based on three-dimensional deformable model learning of claim 2, characterized in that in the joint training phase, two loss constraints of expression classification loss and three-dimensional point cloud regression loss are simultaneously applied to the neural network of the encoder-decoder structure; in order to make the spatial distribution of expression parameters more separable, in the initial stage of the joint training, the weight of the three-dimensional point cloud regression loss is reduced, and the weights of the expression classification loss and the three-dimensional point cloud regression loss are 1; when the expression classification loss is saturated, in order to ensure that the generated three-dimensional face expression is more natural and real, the weight of the three-dimensional point cloud regression loss is increased, and the weights of the expression classification loss and the three-dimensional point cloud regression loss are 1.