CN112070058A

CN112070058A - Face and face composite emotional expression recognition method and system

Info

Publication number: CN112070058A
Application number: CN202010985959.0A
Authority: CN
Inventors: 陈海波; 罗志鹏; 张治广
Original assignee: Shenyan Technology Beijing Co ltd
Current assignee: Shenyan Technology Beijing Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-11

Abstract

The invention discloses a face and face composite emotional expression recognition method, which comprises the following steps: carrying out face detection on the image, and extracting key points of facial features; calculating distance measurement between the key points to obtain a geometric representation vector of the face; constructing a double-branch face detection network, and obtaining a first feature vector from a face image through a first branch network structure; obtaining a second feature vector by the obtained geometric representation vector of the face through a second branch network structure; connecting the first feature vector with the second feature vector to obtain a third feature vector, and obtaining the expression category confidence of the current face image; and constructing a multi-classification loss function of the face detection network to perform optimization solution, and predicting the expression category. The method has higher identification precision for the composite emotional expression category expressed by the face in the high-resolution image, and the proposed model has stronger robustness and has good identification effect for classifying the micro-expressions of the face.

Description

Face and face composite emotional expression recognition method and system

Technical Field

The invention belongs to the technical field of image processing and computer vision, and particularly relates to a method and a system for recognizing face and face composite emotional expressions.

Background

In recent years, with the continuous update of intelligent devices, the continuous change of algorithms such as machine learning and deep learning, the development of face recognition technology is more and more mature, and the face recognition technology is widely applied to various large application platforms and daily life. Meanwhile, as an important branch of the field of face Recognition, Facial Expression Recognition (FER) is also regarded by more researchers. Facial expression recognition has gained wide attention in many fields, such as human-computer interaction, driver fatigue monitoring, intelligent robots, smart medicine, and the like. There are at least 21 facial expressions in human, and in addition to 6 common happiness, surprise, sadness, anger, disgust and fear, 15 distinguishable compound expressions such as surprise + surprise, anger + anger, etc., it is of course possible to further refine the facial expression categories according to different criteria.

Generally, the facial expression recognition algorithm mainly includes four steps: obtaining a face image, detecting a face, extracting face features and classifying the face features. Generally speaking, the facial expression recognition algorithm can be divided into a traditional research method and a deep learning-based research method. In the traditional research method, the face feature extraction and classification are often divided into two independent parts. Firstly, a mathematical method is adopted, a computer technology is used for processing a facial expression image, expression characteristics are extracted, and then a classifier is used for classifying the facial expression characteristics, so that the category of the expression is determined. The traditional feature extraction algorithm mainly comprises a principal component analysis method, a linear discriminant analysis method, an independent component analysis method and the like. A comparison method [1] (from stove, Tang Jing Hai, Lijing Wen, and the like.) support vector identification analysis and application in facial expression recognition [ J ]. electronic newspaper 2008,36(4): 725-. The conventional feature classification algorithm can be mainly classified into a classification method based on distance measurement and a classification method based on a bayesian network. The former mainly accomplishes the classification task by computing distance measures between data. The typical algorithm mainly comprises a nearest neighbor method and an SVM algorithm. The nearest neighbor method is used for classifying by comparing the distance between the sample to be predicted and the predicted sample, and whether the sample to be predicted and a certain predicted sample belong to the same class is determined by the distance. The SVM algorithm optimizes the objective function by finding a hyperplane that maximizes the distance between samples of different classes. The classification method based on the Bayesian network infers the unknown expression probability by analyzing the known expression information. The facial expression recognition method based on deep learning generally integrates the processes of facial feature extraction and classification into a network. The deep learning network has better feature extraction capability on the image, and the extracted features have rich semantic information, so that the complicated process of manually extracting the features is avoided. The facial expression recognition network based on deep learning usually extracts features of a facial image through a plurality of convolutional neural network layers, and then accesses a full connection layer to realize nonlinear classification. The number of the final neurons is determined by the type of the facial expression, and finally the probability value of each type is obtained through a softmax function. A comparison method [2] (Huiyuan Yang, Umur Ciftci, Lijun Yin. facial Expression registration by De-Expression Recognition Residue learning. in Proceedings of the IEEE Conference on Computer Vision and Pattern Registration (CVPR),2018, pp.2168-2177.) proposes a residual Expression Recognition algorithm based on cGAN (conditional GAN) and Expression element filtering. The method comprises the steps of filtering neutral elements of a face image through a cGAN network, and processing residual expression elements by using MLP (multi-level hierarchical processing), so that high-precision recognition of face expressions is realized. However, the algorithm only realizes the recognition of the basic seven types of expressions, the similarity between the types is higher for the compound expressions, the quality of the algorithm cannot be verified, and meanwhile, the GAN network is sampled, so that the training difficulty of the model is increased.

Disclosure of Invention

1. Objects of the invention

The method and the system for recognizing the facial expression of the human face are provided aiming at the problem that the facial expression of the human face is difficult to recognize under the condition of compound emotion in a high-resolution image.

2. The technical scheme adopted by the invention

A face and face composite emotional expression recognition method comprises the following steps:

s01: carrying out face detection on the image, and extracting key points of facial features;

s02: calculating distance measurement between the key points to obtain a geometric representation vector of the face;

s03: constructing a double-branch face detection network, wherein the double-branch face detection network comprises a first branch network structure and a second branch network structure, and a face image is subjected to the first branch network structure to obtain a first feature vector; obtaining a second feature vector by the obtained geometric representation vector of the face through a second branch network structure; the first feature vector and the second feature vector have the same size, and the first feature vector and the second feature vector are connected to obtain a third feature vector, so that the expression category confidence of the current face image is obtained;

s04: and constructing a multi-classification loss function of the face detection network to perform optimization solution, and predicting the expression category.

In a preferred technical solution, before the step S02, the method further includes marking the extracted key points, and performing a preprocessing operation on the image.

In a preferred embodiment, the method for detecting a human face in step S01 includes:

s11: taking an image containing a human face as a positive sample, taking an image not containing the human face as a negative sample, respectively extracting directional gradient histogram features from a certain number of positive and negative samples, and obtaining a directional gradient histogram feature descriptor;

s12: training the positive and negative samples by using a support vector machine algorithm to obtain a trained model for realizing secondary classification;

s13: carrying out hard-to-separate sample mining on the trained model, wherein the hard-to-separate sample mining comprises the steps of scaling negative sample data in a training set, matching with a template, and carrying out searching matching through a template sliding window; and if the false detection occurs, intercepting the false detection face area and adding the false detection face area into the negative sample data.

In a preferred technical scheme, the preprocessing operation comprises a first layer of regression training and a second layer of regression training;

the first layer of regression training comprises the following steps:

representing the data organization form in the first layer regression training as

Wherein, I_πiIs an image of a human face in the training data set,

for predicted keypoint locations, Δ S, of the t-th layer in the first layer regression_i ^(t)Is the difference between the predicted value and the true value of the t-th layer, and the iterative formula is as follows:

wherein I represents the input of each layer in the iterative process;

continuously iterating according to the iteration mode, and generating gamma when the regression cascade layer number of the first layer is set to be K layers₁,γ₂,…,γ_kRegressors, namely regression models obtained through training;

the second layer of regression training comprises the error delta S of each first layer after regression is completed_i ^(t)As input to each second-level regression, each regressor gamma is determined by a gradient lifting tree algorithm_t。

In a preferred technical solution, the obtaining of the geometric representation vector of the face in step S02 includes:

s21: calculating the distance between each feature key point and the feature key point at the nose:

l′⁽ⁱ⁾＝l⁽ⁱ⁾-l⁽³⁰⁾

wherein, l is a key point vector value, i is a characteristic key point number, and l⁽³⁰⁾The characteristic key points at the nose;

s22: then use the average key point face lm⁽ⁱ⁾Instead of the original face image, the formula is as follows:

wherein N is the number of samples of each face image, and j is a sampling number;

s23: obtaining a geometric representation vector of the human face:

lr⁽ⁱ⁾＝l′⁽ⁱ⁾-lm⁽ⁱ⁾。

in a preferred embodiment, in step S03, the first branch network structure is designed based on an AlexNet network structure, the last two full connection layers of the AlexNet structure are removed from the first branch network structure, other structures remain unchanged, and batch normalization operation is added after each convolution layer to obtain a first feature vector with a size of 256 dimensions.

In a preferred embodiment, in step S03, the second branch network structure is formed by a full connection layer without a bias term, the geometric representation vector obtains a 256-dimensional second feature vector through the second branch network structure, and sends the obtained third feature vector to the last full connection layer to obtain a feature vector F with an output size of 512 dimensions.

In a preferred technical solution, the face detection network multi-classification loss function constructed in step S04 is composed of two parts, where the first part of the loss function predicts the probability that an expression belongs to each category using a softmax function, and the formula is as follows:

p represents the probability of predicting a sample of class x as j, y is an indicator variable, where z is_i、Z_kThe prediction results of the ith and the K-th classes are shown, and K shows the number of the classes of the expression;

calculating the uncertainty between the predicted output value and the true tag value using a cross entropy loss function, the formula is as follows:

wherein C represents the number of predicted expression categories;

and the second part of loss function optimizes the distribution of the characteristics among different classes by using the triple loss function, and the formula is as follows:

l_tri＝[α+d_p-d_n]₊

wherein d is_pIs the characteristic distance of the positive sample pair, d_nIs the characteristic distance of the negative sample pair, alpha is the minimum separation between the two distances, [ z ]]₊Represents the function max (z, 0);

and adding the two loss functions to obtain the overall network loss function.

The invention also discloses a face and face composite emotional expression recognition system, which comprises:

the face detection extraction module is used for carrying out face detection on the image and extracting key points of facial features;

the face geometric representation module is used for calculating distance measurement between the key points to obtain a geometric representation vector of the face;

the double-branch face prediction module is used for constructing a double-branch face detection network, comprises a first branch network structure and a second branch network structure, and obtains a first feature vector from a face image through the first branch network structure; obtaining a second feature vector by the obtained geometric representation vector of the face through a second branch network structure; the first feature vector and the second feature vector have the same size, and the first feature vector and the second feature vector are connected to obtain a third feature vector, so that the expression category confidence of the current face image is obtained;

and the category prediction module is used for constructing a face detection network multi-classification loss function to carry out optimization solution and predicting the expression category.

In a preferred technical solution, the system further comprises an image preprocessing module, configured to mark the extracted key points and perform preprocessing operation on the image.

3. Advantageous effects adopted by the present invention

(1) A robust network structure is designed for realizing the function of recognizing facial expressions, the feature information of key points of the human face is used as one input of the network, the spatial geometric information of the image is used for assisting in recognition, and meanwhile, the other branch network extracts rich image texture information. Through a large number of case tests, the method has higher identification precision for the composite emotional expression categories expressed by the human face in the high-resolution image, and the proposed model has stronger robustness and has good identification effect on the classification of the human face micro-expression.

(2) The method adopts a Dlib face detection algorithm to carry out face detection on the image, extracts face feature key points and uses the key points as the basis of a subsequent identification process; marking the extracted key points by using a Face Alignment algorithm, and reducing the size of the image by using an image cutting algorithm so as to carry out preprocessing operation on the high-resolution image; and calculating distance measurement between the key points, using the average key point human face to replace the original human face image, calculating the space geometric characteristic information of the human face image, and assisting the whole recognition process. A double-branch face detection network is designed, and a face imaging branch is designed based on an AlexNet network and is mainly used for extracting rich texture feature information of a face image; the human face characteristic point branch is composed of a full connection layer and is identified by the aid of human face key point characteristic information; a Cross Entropy Loss function (Cross entry Loss) and a triple Loss (triple Loss) are adopted to design a multi-classification Loss function of the face detection network, so that positive samples of the same type of samples are closer, and negative samples of different types of samples are farther.

Drawings

FIG. 1 is a flow chart of a method for recognizing facial compound emotional expressions according to the present invention;

FIG. 2 is a schematic diagram of a face alignment algorithm in the present embodiment;

fig. 3 is a schematic diagram of a network structure in the present embodiment;

FIG. 4 is a diagram of the architecture of the facial compound emotional expression recognition system of the present invention.

Detailed Description

The technical solutions in the examples of the present invention are clearly and completely described below with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without inventive step, are within the scope of the present invention.

The present invention will be described in further detail with reference to the accompanying drawings.

Example 1

As shown in fig. 1, a method for recognizing compound emotional expressions of human faces and faces includes the following steps:

s03: constructing a double-branch face detection network, wherein the double-branch face detection network comprises a first branch network structure and a second branch network structure, a first feature vector is obtained by a face image through the first branch network structure, and texture features of the face image are extracted; obtaining a second feature vector by the obtained geometric representation vector of the face through a second branch network structure; the first feature vector and the second feature vector have the same size, and the first feature vector and the second feature vector are connected to obtain a third feature vector, so that the expression category confidence of the current face image is obtained;

In a preferred embodiment, after the step S01, the step S02 further includes marking the extracted key points, and performing a pre-processing operation on the image.

In a preferred embodiment, the face detection method in step S01 includes the following steps:

s11: taking an image containing a human face as a positive sample, taking an image not containing the human face as a negative sample, respectively extracting directional gradient histogram (Hog) features from a certain number of positive and negative samples, and acquiring a directional gradient histogram (Hog) feature descriptor; in particular, the amount of negative sample data is much larger than the amount of positive sample data, so that more data amount can be obtained by randomly clipping the negative sample.

S12: training positive and negative samples by using a Support Vector Machine (SVM) algorithm to obtain a trained model for realizing two-classification;

And obtaining the finally trained model through the steps. The classifier is used for detecting the face pictures, the pictures with different sizes are subjected to sliding scanning, the Hog characteristics of the pictures are sequentially extracted, and finally the pictures are classified by the classifier. And if the classification result is the face, calibrating the face. And if the same face is calibrated for multiple times after one round of sliding scanning, and the redundant face is removed by adopting NMS operation.

In a preferred embodiment, as shown in fig. 2, the step of marking the extracted key points and performing a preprocessing operation on the image includes:

a mathematical model was built using two-layer regression. The iterative formula of the first-layer regression is as follows:

wherein, S is a shape vector and stores the position information of all key feature points of the face. Where I denotes the input, γ, for each layer in the iterative process_tIs one layerAnd the input quantity of the regressor is the current shape variable and the training image corresponding to the shape variable, and the output quantity of the regressor is the position updating quantity of all the shape variables on the training image. Therefore, in the cascade regressor in the first layer, the positions of all key feature points in the training image are updated once every time the cascade regressor passes through the first-level cascade regressor, so that a more correct position is achieved. Gamma ray_tThe inner part is also the first regression, i.e. the second layer regression. The target of the second-order regression is the difference between the current predicted value and the true value.

The first layer of regression training process is described below. First, a training data set is represented as (I)₁,S₁)，(I₂,S₂)，…，(I_n,S_n) Wherein, I_iRepresenting the ith image, S_iIndicating the location of the corresponding feature keypoints in the image. The data organization form in the first layer regression training can be expressed as

Wherein, I_πiIs an image of a human face in the training data set,

for predicted keypoint locations, Δ S, of the t-th layer in the first layer regression_i ^(t)Is the difference between the predicted value and the true value of the t-th layer.

The iterative formula is shown in formula (1), Δ S_i ^(t)The iteration formula is specifically as follows:

the iteration is continuously carried out according to the iteration mode shown above, and when the regression cascade layer number of the first layer is set to be K layers, gamma is generated₁,γ₂,…,γ_kAnd the regressors are regression models obtained by training.

The second layer of regression training process determines each gamma_tHow to train the training is achieved, the method is realized by adopting a Gradient Boosting Tree Algorithm (Gradient Boosting Tree Algorithm). The error Delta S after each first layer regression is completed_i ^(t)As input to each second-level regression, each regressor gamma is determined by a gradient lifting tree algorithm_t。

Through the above steps, a plurality of feature key points are detected from each face image, and the number of the feature key points may be preset, and in this embodiment, the number is 68.

In a preferred embodiment, the method for obtaining the geometric representation vector of the human face in step S02 includes the following steps:

l′⁽ⁱ⁾＝l⁽ⁱ⁾-l⁽³⁰⁾

wherein l is a key point vector value, i is a characteristic key point number, i is more than or equal to 0 and less than or equal to 68, and l⁽³⁰⁾The characteristic key points at the nose;

wherein, N is the number of samples of each face image, and in this embodiment, 250 is adopted, and j is a sample number;

s23: obtaining a geometric representation vector of the human face:

lr⁽ⁱ⁾＝l′⁽ⁱ⁾-lm⁽ⁱ⁾。

in a preferred embodiment, as shown in FIG. 3, the network structure at this stage consists of two branches B₁，B₂And (4) forming. Wherein, B₁The branch is an imaging branch which is designed based on an AlexNet network structure. The network structure of the original AlexNet consists of five convolutional layers (Conv)₁，Conv₂，Conv₃，Conv₄，Conv₅) And three full connection layers (FC)_{1_1}，FC_{1_2}，FC_{1_3}) The imaging branch in the invention removes the last two full connection layers (FC) of the AlexNet structure_{1_2}，FC_{1_3}) Other structures remain unchanged and are on each convolutional layer (Conv)₁，Conv₂，Conv₃，Conv₄，Conv₅) The batch normalization (batch normalization) operation was added later. Inputting the original face Image into imaging branch to obtain 256-dimensional characteristic vector V₁And extracting the texture features of the face image. The main function of the Imaging branch is to capture richer face image semantic information as much as possible.

In the home network B₂The branch structure is a landworks branch, and the input of the branch is a geometric representation vector of the facial expression, namely the geometric representation obtained in the last step. Landmarks branches are formed by a fully-connected layer (FC) without bias terms_{2_1}) And (4) forming. After the geometric representation variable passes through the branch structure, the output characteristic vector V with the size of 256 dimensions is obtained₂. Finally, the output characteristic vectors (V) with the same size obtained by the Imaging branch and the Landmarks branch are divided₁，V₂) Concatenated (concatenated) together to form a new vector V₃And forming the newly formed vector V₃Into the last full connection layer (FC)_final) And obtaining a feature vector F with the output size of 512 dimensions. Via a full connection layer FC_finalAnd then, obtaining the expression category confidence of the current face image.

In this embodiment, a multi-classification loss function is designed in the network. Expressions in the face image are divided into 50 category labels. The number of category labels may be predetermined. The specific labeling method may adopt an existing labeling method, and this embodiment is not described in detail.

The face detection network multi-classification loss function is composed of two parts, wherein the first part of the loss function predicts the probability that an expression belongs to each class by using a softmax function, and the formula is as follows:

wherein C represents the number of predicted expression categories;

l_tri＝[α+d_p-d_n]₊

and adding the two loss functions to obtain the overall network loss function.

For each mini-batch, we set a batch size to P K, in this case we take P to 32 and K to 2. The method carries out data enhancement operation, carries out horizontal turning operation on each image and the corresponding key point thereof, trains a model by adopting a Stochastic Gradient Descent (SGD), and saves the temporary parameters of the model as a checkpoint file after each epoch is trained.

In order to verify the effectiveness of the method, the experimental example compares the effect of a common micro expression data set CASME2 and the like with that of the existing identification method, and the result shows that the method has higher identification precision for the composite emotional expression type expressed by the face in the high-resolution image, and the proposed model has stronger robustness and has good identification effect for the classification of the micro expression of the face.

In another embodiment, a facial complex emotional expression recognition system is provided, which corresponds to the facial complex emotional expression recognition method in the above embodiments one to one, as shown in fig. 4, the facial complex emotional expression recognition system includes a face detection and extraction module 10, an image preprocessing module 20, a face geometric representation module 30, a dual-branch face prediction module 40, and a category prediction module 50. The functional modules are explained in detail as follows:

the face detection extraction module 10 is used for carrying out face detection on the image and extracting key points of facial features;

the face geometric representation module 30 calculates the distance measurement between the key points to obtain a geometric representation vector of the face;

the double-branch face prediction module 40 is used for constructing a double-branch face detection network, and comprises a first branch network structure and a second branch network structure, wherein the face image obtains a first feature vector through the first branch network structure, and the texture feature of the face image is extracted; obtaining a second feature vector by the obtained geometric representation vector of the face through a second branch network structure; the first feature vector and the second feature vector have the same size, and the first feature vector and the second feature vector are connected to obtain a third feature vector, so that the expression category confidence of the current face image is obtained;

and the category prediction module 50 is used for constructing a face detection network multi-classification loss function to carry out optimization solution and predicting the expression category.

And the image preprocessing module 20 is configured to mark the extracted key points and perform preprocessing operation on the image.

The specific implementation method of each module may refer to the face-face composite emotional expression recognition method in the above embodiment, and details are not repeated in this embodiment.

The face-face composite emotional expression recognition system provided by the embodiment of the invention is applied to the environments of the client and the server, and the client and the server communicate through a network to solve the problem that face expression information in an image cannot be accurately acquired. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A face and face composite emotional expression recognition method is characterized by comprising the following steps:

2. The method for recognizing composite emotional expression of human faces and faces according to claim 1, wherein the step S02 is preceded by labeling the extracted key points and performing a preprocessing operation on the images.

3. The method for recognizing composite emotional expression of human face according to claim 1, wherein the method for detecting human face in step S01 comprises:

4. The method for recognizing composite emotional expression of human faces and faces according to claim 2, wherein the preprocessing operation comprises a first layer of regression training and a second layer of regression training;

the first layer of regression training comprises the following steps:

Wherein, I_πiIs an image of a human face in the training data set,

wherein, I represents the input of each layer in the iteration process;

the second layer of regression training comprises the error delta S of each first layer after regression is completed_i ^(t)As input to each second-level regression, each level of regressor gamma is determined by a gradient lifting tree algorithm_t。

5. The method for recognizing composite emotional expression of human face according to claim 1, wherein the step S02 of obtaining geometric expression vector of human face includes:

l′⁽ⁱ⁾＝l⁽ⁱ⁾-l⁽³⁰⁾

s23: obtaining a geometric representation vector of the human face:

lr⁽ⁱ⁾＝l′⁽ⁱ⁾-lm⁽ⁱ⁾。

6. the method for recognizing composite emotional expression of human faces and faces according to claim 1, wherein in step S03, the first branch network structure is designed based on an AlexNet network structure, the last two fully connected layers of the AlexNet structure are removed from the first branch network structure, other structures remain unchanged, and batch normalization operation is added after each convolution layer to obtain a first feature vector with 256 dimensions.

7. The method for recognizing composite emotional expression of human faces and faces according to claim 1 or 6, wherein in step S03, the second branch network structure is composed of a full-connected layer without bias terms, the geometric representation vector obtains a 256-dimensional second feature vector through the second branch network structure, and the obtained third feature vector is sent to the last full-connected layer to obtain a feature vector F with an output size of 512-dimensional.

8. The method for recognizing composite emotional expression of human face according to claim 1, wherein the human face detection network multi-classification loss function constructed in step S04 is composed of two parts, the first part loss function predicts probability of the expression belonging to each category by using softmax function, and the formula is as follows:

p represents the probability of predicting a sample of class x as j, y is an indicator variable, where z is_iZk represents the prediction results of the ith and K classes, and K represents the number of the classes of the expression;

wherein C represents the number of predicted expression categories;

l_tri＝[α+d_p-d_n]₊

and adding the two loss functions to obtain the overall network loss function.

9. A facial composite emotional expression recognition system, comprising:

10. The system of claim 9, further comprising an image preprocessing module for labeling the extracted key points and preprocessing the image.