CN113869221B

CN113869221B - Expression recognition method based on multistage deep neural network

Info

Publication number: CN113869221B
Application number: CN202111148260.XA
Authority: CN
Inventors: 利节
Original assignee: Chongqing Daipu Technology Co ltd
Current assignee: Chongqing Daipu Technology Co ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-05-24
Anticipated expiration: 2041-09-29
Also published as: CN113869221A

Abstract

The invention relates to the technical field of expression recognition, and particularly discloses an expression recognition method based on a multistage deep neural network. Next, the dataset labeled as angry, nausea, fear, and contempt is fed into the feature extraction network model, and feature vectors of these data are output. These feature vectors are then processed through a normalized flow model. Finally, the processed feature vectors are fed into a multi-layer perceptron (MLP). Through continuous learning and training of the MLP, finally, the MLP can successfully identify four expressions of Qi, nausea, fear and contempt. Therefore, the seven basic expressions can be identified with higher precision by utilizing the trained multi-level neural network model.

Description

Expression recognition method based on multistage deep neural network

Technical Field

The invention relates to the technical field of expression recognition, in particular to an expression recognition method based on a multistage deep neural network.

Background

Expression is a very important information in our daily communications. In actual communication, expressions often play a role in enhancing the communication effect with each other. Psychologist a. Mehrasia mentions in literature An Approach to Environment Psychology that in daily communications of people, information conveyed by language accounts for only 7% of the total information, while information conveyed by facial expression accounts for 55% of the total information. Meanwhile, with the development of machine learning technology in recent years, the face recognition technology has also received a great deal of attention. Especially facial expression recognition, the method has gained more extensive attention in the fields of safety, robot manufacturing, automation, automatic driving, man-machine interaction and the like. There are at least 21 expressions of human beings, seven of which are: happy, surprised, sad, angry, nausea, fear and contempt. They are all composed of human expression basic units, i.e. one or more actions and states of muscles of various parts of the face. However, the accuracy of the present expression recognition is not too high, and particularly for some expressions with overlapped partial features, the recognition effect is not excellent.

Disclosure of Invention

The invention provides an expression recognition method based on a multistage deep neural network, which solves the technical problems that: how to improve the recognition effect of the expression with the overlapped partial characteristics.

In order to solve the technical problems, the invention provides an expression recognition method based on a multistage deep neural network, which comprises the following steps:

s1, preprocessing a training data set containing seven expression labels; wherein,

Seven big expression labels are happy, surprised, sad, angry, nausea, fear and contempt, respectively;

The preprocessing is to change the labels of the picture data of 4 expression labels with higher recognition complexity, namely, the breath, nausea, fear and contempt, into other labels, and the labels of the rest picture data are not changed, and cut all the picture data into a size of B multiplied by B;

S2, sending the data of the happy, surprised, sad and other four expression labels obtained after the pretreatment in the step S1 into a first expression recognition network model for training so as to fix weight data of the first expression recognition network model;

S3, cutting data of 4 expression labels with higher recognition complexity, namely, angry, nausea, fear and contempt, into a B multiplied by B size, and then sending the data into a feature extraction network model to obtain corresponding feature data;

S4, sending the characteristic data obtained in the step S3 into a standardized flow model for processing, so that the data is subjected to Gaussian distribution;

s5, sending the data obtained in the step S4 into a multi-layer perceptron for training, and storing the trained parameters;

s6, testing the trained expression recognition model consisting of the first expression recognition network model, the feature extraction network model, the standardized flow model and the multi-layer perceptron;

s7, using the expression recognition model passing the test for recognizing the unknown expression image, wherein the recognition process comprises the following steps:

Cutting the unknown expression image into B multiplied by B, then sending the B multiplied by B into the first expression recognition network model passing the test, directly outputting a recognition result if the B multiplied by B is judged to be the other expression except the other expression, otherwise sending the B multiplied by B into a second expression recognition network model consisting of the feature extraction network model, the standardized flow model and the multi-layer perceptron, and outputting a classification result with the maximum probability.

Specifically, the first expression recognition network model is built based on ResNet network model, and comprises a first convolution module and a full connection module;

the feature extraction network model is built based on ResNet a 18 a network model, including a second convolution module.

This has the advantage that the extracted expressive features are more spatially aggregated, since DNF requires a certain degree of aggregation of the data when using DNF.

Specifically, the first convolution module includes:

a first block: a convolution layer consisting of 64 7 x 7 convolution kernels, with a step size of 2;

and a second block: is composed of a 3 x 3 maximum pooling layer and a convolution layer composed of two 64 3 x 3 convolution kernels;

third block: a convolution layer consisting of two layers of 128 3 x3 convolution kernels;

fourth block: a convolution layer consisting of two layers of 256 3 x3 convolution kernels;

fifth block: a convolution layer consisting of two layers of 512 3 x3 convolution kernels;

the fully connected module comprises:

An averaging pooling layer, a full connectivity layer, and a Softmax layer;

A Dropout strategy was added before the fully connected layer and 50% of neurons were randomly inactivated.

Specifically, the first expression recognition network model uses cross entropy as a loss function thereof, and the formula is as follows:

In formula (1), N represents the number of categories, y ⁽ⁱ⁾ is whether the output category is the same as the label, the same as 1, otherwise 0, Representing the predicted value.

Specifically, the second convolution module includes:

plus the final average pooling layer.

Specifically, the training objective function of the standardized flow model is:

In formula (2), Θ= { { μ _y }, Σ, θ } represents all parameters, where y represents a class, μ _y represents a variance of the y class, Σ represents a covariance, θ represents a parameter of the model; y (x _i) represents the class of the ith sample, Z _i represents samples that conform to a gaussian distribution after DNF processing, Representing the distribution of each class y, after training, the normalized flow model creates a normalized space for Z, and wherein each class conforms to a Gaussian distribution; det represents the determinant of the square matrix, x _i＝f(Z_i), where f is a combination of T inverse autoregressive transforms, expressed as:

f＝f_T·f_T-1...·f₀ (3)

Wherein each f _t is a structured neural network, and T is more than or equal to 0 and less than or equal to T.

Specifically, the standardized flow model is composed of 10 mask autoregressive flow blocks, and each mask autoregressive flow block is realized by three layers of fully connected neural networks and is an inverse autoregressive transformationWherein/>Refers to the j-th output of the i-th chunk masked autoregressive stream chunk,/>And/>{ F _μ,f_α } is an unconstrained function, exp denotes an exponential function based on a natural constant e. In the experiment, the model was trained using Adam optimizer, with batch size set to 300 and learning rate set to 0.003.

Specifically, the multi-layer perceptron comprises an input layer, a hidden layer and an output layer; the input layer has 5 nodes, the hidden layer has 6 nodes, the output layer has 4 nodes, and the connection between the nodes of adjacent layers has weight.

Preferably, b×b=224×224.

Specifically, when testing, firstly, a camera is required to collect a photo of a human face, after the 68 key points of the human face are identified, the photo which is cut into 224×224 size and contains 68 key points is sent into an expression identification model.

According to the expression recognition method based on the multistage deep neural network, the labels of the picture data of the 4 expression labels with higher recognition complexity, namely, the breath, nausea, fear and contempt are uniformly modified into other labels, and then the data of the four expression labels, namely, happiness, surprise, sadness and other expression labels, are sent into the first expression recognition network model for training, and the first expression recognition network model can successfully recognize the 4 classifications of other, happy, surprise and sadness through continuous learning and training. Then, the data set consisting of the 4 expression labels of the Qi, nausea, fear and contempt is sent into a feature extraction network model, and the feature extraction network model outputs feature vectors of the data. And then, processing the feature vectors through a standardized flow model to enlarge the class interval and reduce the class inner distance, so that the later training and recognition are facilitated. Finally, the processed feature vectors are fed into a multi-layer perceptron (MLP). Through continuous learning and training of the MLP, finally, the MLP can successfully identify 4 expressions of Qi, nausea, fear and contempt. Through the complete flow, the seven basic expressions can be identified with higher precision by utilizing the trained multi-level neural network model (a first expression identification network model, a feature extraction network model, a standardized flow model and a multi-layer perceptron).

Compared with a single-stage neural network, the multi-stage neural network utilizes the multi-stage characteristics of the single-stage neural network, and different modules distinguish expressions with different complexity, so that the complex expressions can be better resolved, and the recognition capability of higher precision on all expressions is achieved.

In a specific experiment, a dataset containing only asian faces was used. For this dataset, the VGG16 was used for training and testing first, and the final accuracy could only reach 38.72%. Then, single-stage ResNet is used for training and testing, and the accuracy reaches 42.58%. After the invention is applied, the accuracy reaches 86.32 percent.

Drawings

Fig. 1 is a step flowchart of an expression recognition method based on a multi-level deep neural network according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-stage deep neural network according to an embodiment of the present invention;

Fig. 3 is a frame structure diagram of a first expression recognition network model ResNet (1) provided in an embodiment of the present invention;

FIG. 4 is a block diagram of a feature extraction network model ResNet (2) provided by an embodiment of the invention;

FIG. 5 is a block diagram of a standardized flow model (DNF model) provided by an embodiment of the present invention;

FIG. 6 is a block diagram of a multi-layer perceptron (MLP) provided by an embodiment of the invention;

FIG. 7 is a block diagram of a mask autoregressive flow block provided by an embodiment of the invention.

Detailed Description

The following examples are given for the purpose of illustration only and are not to be construed as limiting the invention, including the drawings for reference and description only, and are not to be construed as limiting the scope of the invention as many variations thereof are possible without departing from the spirit and scope of the invention.

The expression recognition method based on the multistage deep neural network provided by the embodiment of the invention, as shown in figures 1 and 2, comprises the following steps:

s1, preprocessing a training data set containing seven-big expression labels.

Among the seven-big expression labels are Happy ("Happy"), surprised ("Surprise"), sad ("Sad"), angry ("Angry"), nausea ("Disgust"), fear ("Fear") and contempt ("Contempt"), respectively.

Here, preprocessing refers to changing the labels of the picture data identifying the 4 expression labels of higher complexity, namely, the breath, nausea, fear and contempt, to "Other" ("Other"), and the labels of the rest of the picture data are not changed, and cutting all the picture data to be b×b size. Namely, pretreatment is to modify the labels of the 4 expressions of Qi, nausea, fear and contempt into Other labels (Other) to obtain the data of the 4 labels of the Other labels, happy, surprise and sadness. The present example uniformly clips data to a 224×224 size, i.e., b=224. Of course, the size of B can be determined according to actual requirements.

S2, sending the data of the happy, surprised, sad and other four expression labels obtained after the pretreatment in the step S1 into a first expression recognition network model for training so as to fix weight data of the first expression recognition network model.

As shown in fig. 2, the first expression recognition network model is built based on ResNet a network model, which is abbreviated as ResNet (1) in this example, and includes a first convolution module and a full connection module, and an image with a size of b×b=224×224 is input.

As shown in fig. 3, the first convolution module includes:

And a second block: is composed of a3 x 3 max pooling layer (Maxpool) and a convolution layer composed of two layers of 64 3 x 3 convolution kernels;

the fully connected module comprises:

an averaging pooling layer (Avg Pooling), a fully-connected layer (FC) and a Softmax layer;

S3, cutting data of the 4 expression labels with higher recognition complexity, namely, the breath, nausea, fear and contempt, into a B multiplied by B size, and then sending the data into a feature extraction network model to obtain corresponding feature data.

Here, as shown in fig. 2, the feature extraction network model is also built based on ResNet a network model, which is simply referred to as ResNet (2) model in this example, but includes only the second convolution module, and a picture of size b×b=224×224 is input. The purpose of this is to make the extracted expressive features spatially more aggregated, since a certain degree of aggregation of the data is required when using the normalized flow model afterwards. As shown in fig. 4, the second convolution module includes:

plus the final average pooling layer.

And S4, sending the characteristic data obtained in the step S3 into a standardized flow model (DNF model) for processing, so that the data is subjected to Gaussian distribution.

The input DNF model is the eigenvectors from the Resnet (2) model, and the output is the normalized eigenvector space. The DNF model will change the original data of features with smaller class-spacing into data with larger class-spacing, making it subject to gaussian distribution, as shown in fig. 5. Wherein, the training objective function of the DNF model is:

f＝f_T·f_T-1...·f₀ (3)

The normalized flow model consists of 10 masked autoregressive flow blocks, each of which (MAF) is implemented by a three-layer fully connected neural network, all being an inverse autoregressive transformationWherein/>Refers to the j-th output of the i-th chunk masked autoregressive stream chunk,/>And/>{ F _μ,f_α } is an unconstrained function, exp denotes an exponential function based on a natural constant e. In the experiment, the model was trained using Adam optimizer, with batch size set to 300 and learning rate set to 0.003. The structure of MAF is shown in FIG. 7.

S5, sending the data obtained in the step S4 into a multi-layer perceptron (MLP) for training, and storing the trained parameters.

The data input to the MLP are vectors after the DNF model processing. The vector class spacing in the vector space is larger and the class inter-class spacing is smaller. The MLP model comprises an input layer, a hidden layer and an output layer. Wherein the input layer has 5 nodes, the hidden layer has 6 nodes, and the output layer has 4 nodes. The connections between adjacent tier nodes are weighted. By training, these edges are assigned the correct weights. The network structure of the MLP is shown in fig. 6.

S6, testing the trained first expression recognition network model, the feature extraction network model, the standardized flow model and the expression recognition model (multi-level deep neural network) formed by the multi-level perceptron.

When testing, firstly, a camera is required to collect a photo of a human face, after the key points at 68 positions of the human face are identified, the photo which is cut into 224 multiplied by 224 and contains 68 key points is sent into an expression identification model.

And cutting the unknown expression image, sending the cut unknown expression image into the first expression recognition network model passing the test, directly outputting a recognition result if judging that the unknown expression image is the expression except the other expressions, otherwise, sending the unknown expression image into a second expression recognition network model consisting of the feature extraction network model, the standardized flow model and the multi-layer perceptron, and outputting a classification result with the maximum probability.

According to the expression recognition method based on the multistage deep neural network, the labels of the 4 expression labels ('gas', 'nausea', 'fear' and 'contempt') with higher recognition complexity are uniformly modified into 'others', all the data are sent into the first expression recognition network model ResNet (1) to be trained, and the first expression recognition network model ResNet (1) can successfully recognize the 'others' and the remaining 3 classifications ('happiness', 'surprise', 'sadness') through continuous learning and training. Next, the dataset consisting of the 4 tags ("vital", "nausea", "fear", and "contempt") is fed into the feature extraction network model ResNet (2), and the feature extraction network model ResNet (2) outputs feature vectors for these data. Then, these feature vectors are processed by a standardized flow model (DNF model) to enlarge the class spacing and reduce the class inner spacing, so that the later training and recognition are facilitated. Finally, the processed feature vectors are fed into a multi-layer perceptron (MLP). Through continuous learning and training of the MLP, finally, the MLP can successfully identify the A expression. Through the complete flow, the seven basic expressions can be identified with higher precision by utilizing the trained multi-level neural network model (a first expression identification network model, a feature extraction network model, a standardized flow model and a multi-layer perceptron).

Compared with a single-stage neural network, the multi-stage neural network utilizes the multi-stage characteristics of the single-stage neural network, and different modules distinguish expressions with different complexity, so that the complex expressions can be better resolved, and the high-precision recognition of all the expressions is achieved.

In particular experiments, datasets containing only asian faces were particularly employed herein. The dataset contains 40005 Asian face pictures, and each basic expression has 5715 pictures. Wherein 16002 pictures are used for training, 12001 pictures are used for verification, and 12002 pictures are used for testing. During training, the batch size was set to 64, the learning rate was set to 0.01, and the number of iterations was 50. At the end of the test, the formula of the accuracy is as follows:

Wherein T refers to the total number of correctly judged expressions, and F refers to the total number of incorrectly judged expressions. For this dataset, the VGG16 was used for training and testing first, and the final accuracy could only reach 38.72%. Then, single-stage ResNet is used for training and testing, and the accuracy reaches 42.58%. After the invention is applied, the accuracy reaches 86.32 percent.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The expression recognition method based on the multistage deep neural network is characterized by comprising the following steps of:

S1, preprocessing a training data set containing seven expression labels;

wherein the seven big expression labels are happy, surprised, sad, angry, nausea, fear and contempt respectively;

2. The expression recognition method based on the multistage deep neural network according to claim 1, wherein:

The first expression recognition network model is built based on ResNet network models and comprises a first convolution module and a full connection module;

The feature extraction network model is built based on ResNet-18 network models, including a second convolution module.

3. The expression recognition method based on the multi-level deep neural network according to claim 2, wherein the first convolution module comprises:

the fully connected module comprises:

An averaging pooling layer, a full connectivity layer, and a Softmax layer;

4. The expression recognition method based on a multi-level deep neural network according to claim 3, wherein the first expression recognition network model uses cross entropy as its loss function, and the formula is:

5. The expression recognition method based on the multi-level deep neural network according to claim 2, wherein the second convolution module comprises:

plus the final average pooling layer.

6. The expression recognition method based on the multistage deep neural network according to claim 1, wherein the training objective function of the standardized flow model is:

f＝f_T·f_T-1...·f₀ (3)

7. The expression recognition method based on the multistage deep neural network according to claim 6, wherein: the standardized flow model consists of 10 mask autoregressive flow blocks, each mask autoregressive flow block is realized by a three-layer fully-connected neural network and is an inverse autoregressive transformationWherein/>Refers to the j-th output of the i-th chunk masked autoregressive stream chunk,/>And/>{ F _μ,f_α } is an unconstrained function, exp denotes an exponential function based on a natural constant e.

8. The expression recognition method based on the multistage deep neural network according to claim 1, wherein: the multi-layer perceptron comprises an input layer, a hidden layer and an output layer; the input layer has 5 nodes, the hidden layer has 6 nodes, the output layer has 4 nodes, and the connection between the nodes of adjacent layers has weight.

9. The expression recognition method based on the multistage deep neural network according to any one of claims 1 to 8, wherein: b×b=224×224.

10. The expression recognition method based on the multistage deep neural network according to claim 9, wherein: when testing, firstly, a camera is required to collect a photo of a human face, after the key points at 68 positions of the human face are identified, the photo which is cut into 224 multiplied by 224 and contains 68 key points is sent into an expression identification model.