CN108304823B

CN108304823B - Expression recognition method based on double-convolution CNN and long-and-short-term memory network

Info

Publication number: CN108304823B
Application number: CN201810156983.6A
Authority: CN
Inventors: 唐贤伦; 伍亚明; 虞继敏; 万辉; 谢涛; 魏畅; 昌泉
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-02-24
Filing date: 2018-02-24
Publication date: 2022-03-22
Anticipated expiration: 2038-02-24
Also published as: CN108304823A

Abstract

The invention requests to protect an expression recognition method based on a double-convolution CNN and a long-time and short-time memory network. Firstly, preprocessing such as mean value removing, filtering, normalization and the like is carried out on the obtained expression picture; then, inputting the preprocessed expression picture into the double convolution layer and the pooling layer to extract the characteristics of the expression picture; and then, further extracting the characteristics of the full connection layer and a long-short time memory network (LSTM), and finally, identifying the expression characteristics of the LSTM by using a Support Vector Machine (SVM) and outputting a classification result. The invention can fully utilize the space-time characteristics of the facial expression, extract the expression characteristics which are not obvious enough or are easy to be confused, and effectively improve the expression recognition rate.

Description

Expression recognition method based on double-convolution CNN and long-and-short-term memory network

Technical Field

The invention belongs to the technical field of image recognition and deep learning, and particularly relates to a recognition method for facial expressions of human faces.

Background

The expression recognition means that required expression information is extracted from an existing picture or video. With the gradual development of labeling and recognition technologies of expression libraries, the expression recognition technology is gradually improved. The key step of the expression recognition technology is expression feature extraction, which extracts expression features of expression pictures from an existing labeled expression library and classifies the expression features. The high-accuracy and stable facial expression recognition technology has great application prospects in life and industry. Machine learning is a branch of artificial intelligence, and the machine learning is subjected to shallow learning and deep learning, and the deep learning is divided into two categories of supervised learning and unsupervised learning, wherein the representative unsupervised learning comprises a deep belief network, and the supervised learning comprises a convolutional neural network. In recent years, convolutional neural networks have been continuously developed in many aspects, and have a breakthrough in speech recognition, face recognition, general object recognition, motion analysis, natural language processing, and even brain waves. The facial expression recognition mainly comprises four steps of obtaining an original image, preprocessing the image, extracting expression characteristics, classifying the characteristics and the like, and the facial expression recognition in the prior art is easily influenced by various factors such as image background, illumination, unobvious expression and the like.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The expression recognition method based on the double-convolution CNN and the long-time memory network is capable of remarkably improving the facial expression recognition rate and stability. The technical scheme of the invention is as follows:

the expression recognition method based on the double-convolution CNN and the long-and-short-term memory network comprises a training stage and a recognition stage, wherein the training stage comprises the following steps:

s1, extracting the expressions of all the characters from the facial expression database to establish a database, and classifying and integrating the images; s2, carrying out preprocessing steps including horizontal translation, vertical translation and overturning on the facial expression image; s3, dividing the preprocessed image into a training sample and a test sample, and converting the training set and the test set into LMDB formats respectively; s4, training the convolutional neural network and the long-short term memory network by using the training set;

the identification phase comprises the following steps: s5, selecting a facial expression image to be recognized; s6, preprocessing the image to be recognized; and S7, recognizing the facial expression picture through the convolutional neural network and the LSTM which are trained in the step S4, and outputting a recognition result.

Further, the step S1 classifies the facial expressions into seven categories of happy, surprised, angry, sad, disgust, fear, and neutral.

Further, in step S2, a plurality of expression images are collected, and the original image is converted into a black-and-white image of 256 × 256 through color transformation, denoising, and dimension reduction operations.

Further, in the step S4, the spatial features of the expression picture are extracted by using a convolutional neural network, and the training of the convolutional neural network specifically includes the following steps:

establishing a double-convolution CNN network model, wherein the network layers sequentially comprise 2 convolution layers with convolution kernels of 3 × 3, a maximum pooling layer, 2 convolution layers with convolution kernels of 3 × 3, 1 maximum pooling layer, 1 convolution layer of 5 × 5 and 1 maximum pooling layer, and inputting the preprocessed expression picture into the double-convolution CNN to extract the spatial information of the expression picture.

Further, the initial value of the learning rate of the double convolution CNN network model is set to be 0.001, a step strategy is adopted, the learning rate is reduced by 0.0001 every 100 times of training, and the maximum iteration number of the training network is 2500.

Further, the step S4 trains the LSTM network and extracts timing information of the emoticons, and the calculation process is as follows:

(1) at the current time t, the value of the candidate cell is

The calculation formula is as follows:

in the formula, x_tFor input data at time t, output by double convolution CNN, h_t-1For the output value w of the hidden layer at time t-1_xc、w_hcAre respectively the corresponding weight values, b_cIs an offset;

(2) the input gate determines how much new information to add to the cell state, and the calculation formula is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+b_i)

in the formula, W_xi、W_hiAre respectively the corresponding weight, b_iIs an offset;

(3) the forgetting gate determines which information of the cell is to be removed, and the calculation formula is as follows:

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f)

in the formula, W_xf、W_hfAre respectively the corresponding weight, b_fIs an offset;

(4) the state value c of the cell at the current time t_tThe calculation formula is as follows:

in the formula, "-" indicates a dot-by-dot multiplication;

(5) the output gate determines which part of the cell is to be output, and the calculation formula is as follows:

o_t＝σ(W_xox_t+W_hoh_t-1+b_o)

in the formula, W_xo、W_hoAre respectively the corresponding weight, b_oIs the offset.

(6) Output of LSTM unit:

h_t＝o_t⊙tanh(c_t)。

further, the step S7 identifies the preprocessed pictures through the trained dual convolutional neural network and the long-short term memory network, and outputs the identification result to be presented in a high-to-low probability manner.

The invention has the following advantages and beneficial effects:

the innovation point of the invention is that step S7 combines a double convolution neural network and an LSTM network, the double convolution neural network more accurately extracts the spatial characteristics of the image, and the LSTM network extracts the temporal characteristics, thereby overcoming the influence of unobvious background, illumination, expression and the like.

Drawings

Fig. 1 is a flowchart of an expression recognition method based on a double-convolution CNN and a long-and-short-term memory network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

a recognition method for facial expressions of human faces comprises the following steps:

a training stage: s1, extracting the expressions of all characters from the facial expression database to establish a library, classifying and integrating the images, and dividing the facial expressions into seven categories of happiness, surprise, anger, sadness, disgust, fear, neutrality and the like; s2, preprocessing the facial expression images, including horizontal translation, vertical translation, turning and other operations; s3, dividing the preprocessed image into a training sample and a test sample, and converting the training set and the test set into LMDB formats respectively; and S4, training the convolutional neural network and the long-short term memory network by using the training set.

And (3) identification: s5, selecting a facial expression image to be recognized; s6, preprocessing the image; and S7, recognizing the facial expression picture through the trained convolutional neural network and the LSTM, and outputting a recognition result.

The present invention will be described in further detail below with reference to examples, but the embodiments of the present invention are not limited thereto. A recognition method for facial expressions of people combines a face recognition technology with a facial expression recognition technology of people as described above, a special recognition method is customized for the people, certain stability and accuracy can be guaranteed, and the method is suitable for various occasions and terminal equipment including personal computers, notebook computers and smart phones. Specifically, the method comprises the following steps:

step 1, extracting facial expressions of 10 women from a facial expression library JAFFE to establish a gallery, wherein images are exemplified by an online standard expression library, and classifying and integrating the gallery to divide the human expressions into seven categories of happiness, surprise, anger, sadness, disgust, fear and neutrality.

And 2, preprocessing the character expression picture to obtain an image with 256 pixels by 256, and converting the color picture into a gray-white image with the gray value of [0, 255 ].

And 3, dividing the processed picture into a training sample and a test sample, wherein the test sample accounts for one seventh of the whole picture library, and converting the training set and the test set into an LMDB format.

And 4, training the double convolution neural network and the long-short term memory network (LSTM) by using the obtained training set, setting the initial value of the learning rate to be 0.001, adopting a step strategy, reducing the learning rate by 0.0001 every 100 times of training, and enabling the maximum iteration number of the training network to be 2500 times for better adapting to the facial expression recognition process. The structure diagrams of the double convolution neural network and the long-short term memory network are shown as follows, and the specific network structure comprises: and (3) a data layer: the batch size is 5 for the training set and 5 for the test set. And extracting a file name and a label from the processed facial expression data set, taking the data set with the label as the input of a data layer, and taking the data layer as the bottom layer of the whole network.

(1) Taking convolutional layer 1 as the top layer of the data layer, the size of the convolutional kernel is 3, and taking 1 as the step length of the convolutional kernel, the formula is called as follows:

for convolutional layer 1, the RELU is used as the activation function to introduce the nonlinearity, calling the formula: θ (x) is max (0, x).

(2) Taking convolutional layer 2 as the top layer of convolutional layer 1, the size of the convolutional kernel is 3, and taking 1 as the step length of the convolutional kernel, the formula is called as follows:

adding local response normalization layer Norm1 on top of convolutional layer 2, the formula called:

the local response normalization layer is beneficial to improving the network training performance. At the same time, RELU is used as an activation function to introduce non-linearity.

(3) The pooling layer 1 is used as the top layer of the local response normalization layer 1, the size of a pooling window is 2, a pooling mode of maximum pooling is adopted, 1 is selected as a pooling step length, and the invoked formula is as follows:

setting to 1, although it consumes training time and hardware resources, increases the accuracy of training.

(4) Convolutional layer 3 is used as the top layer of pooling layer 1, the size of the convolutional kernel is 3, and the convolution step is 1. The RELU function is employed as the activation function for convolutional layer 3 to introduce non-linearity.

(5) Convolutional layer 4 is the top layer of convolutional layer 3, the size of the convolutional kernel is 3, and the convolution step is 1. A local response normalization layer Norm2 is added on top of convolutional layer 4, while RELU is used as an activation function to introduce non-linearity.

(6) The pooling layer 2 is used as the top layer of the local response normalization layer 2, the size of a pooling window is 2, a pooling mode of maximum pooling is adopted, and the pooling step length is 1.

(7) Convolutional layer 5 is used as the top layer of pooling layer 2, the size of the convolutional kernel is 5, and the convolution step is 1. A local response normalization layer Norm3 is added on top of convolutional layer 5, while RELU is used as an activation function to introduce non-linearity.

(8) The pooling layer 3 is used as the top layer of the local response normalization layer 3, the size of a pooling window is 3, a pooling mode of maximum pooling is adopted, and the pooling step length is selected to be 1.

(9) With the fully-connected layer as the top layer of pooling layer 3, the output of the fully-connected layer is 4096, the formula called: h is_w,b(x)＝θ(w^Tx + b), to prevent overfitting due to too little data, a Dropout layer is added on top of the fully connected layer, which is temporarily discarded with a probability of 0.5 for the neural unit. The nonlinear is introduced using the RELU function as the activation function for the fully connected layer.

(10) On top of the fully-connected layer, 5 hidden layers were added, each containing a 5-layer LSTM layer of 128 neurons (this parameter was confirmed by literature and experiments). The nonlinearities are introduced using the RELU function as the activation function for the LSTM layer.

(11) The SVM layer is used as the top layer of the LSTM layer, namely, the data set input is divided into seven classes, namely happiness, surprise, anger, sadness, disgust, fear and neutrality. And adding an Accuracy layer and a Loss layer on the top of the SVM layer. Accuracy layer: and outputting the accuracy. And a Loss layer: and (4) outputting the loss, judging the current network training state through the change trend of the loss, and indicating that the network still learns seriously when the train loss and the test loss are both reduced.

Step 5, identifying the calibrated facial expression images through the trained double convolution neural network and the long-short term memory network, and outputting the probabilities of the seven types of expressions corresponding to the facial expression images, wherein the probabilities are sorted from high to low; specifically, the method comprises the following steps: the user optionally uploads a picture in color or black and white. The image of 256 x 256 pixels in standard format is preprocessed, and if the image is a color image, the image is converted into a gray-white image with the gray value of 0, 255, so that the subsequent identification work can be facilitated. And identifying the preprocessed pictures through the trained double convolution neural network and the long-short term memory network, and outputting an identification result to present in a mode of high probability to low probability.

And identifying the preprocessed pictures through the trained double convolution neural network and the long-short term memory network, and outputting an identification result to present in a mode of high probability to low probability.

Table 1 recognition results of two randomly selected pictures

The results of the experiments conducted to train and test the JAFFE expression library are shown in the following table.

	Happy	Is surprised	Anger and anger	Sadness and sorrow	Aversion to	Fear of	Neutral property
								Happy	0.987	0.002	0.001	0.001	0.002	0.001	0.003
Is surprised	0.002	0.975	0.005	0.006	0.003	0.008	0.002
								Anger and anger	0.001	0.007	0.979	0.002	0.023	0.007	0.004
Sadness and sorrow	0.002	0.004	0.002	0.984	0.012	0.012	0.001
								Aversion to	0.001	0.005	0.008	0.004	0.954	0.008	0.006
Fear of	0.001	0.006	0.003	0.002	0.003	0.963	0.007
								Neutral property	0.006	0.001	0.002	0.001	0.003	0.001	0.977

Compared with the prior expression recognition technology, the invention has the following gains: the method has better discriminability for complex images, has obvious recognition effect for certain micro expressions, and also has better discriminability for pictures with lower recognition rate. The invention is suitable for part of terminal equipment, such as an electroencephalogram controlled wheelchair, a smart phone, a notebook computer and the like.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. The expression recognition method based on the double-convolution CNN and the long-and-short-term memory network is characterized by comprising a training stage and a recognition stage, wherein the training stage comprises the following steps:

the identification phase comprises the following steps: s5, selecting a facial expression image to be recognized; s6, preprocessing the image to be recognized; s7, recognizing the facial expression picture through the convolutional neural network and the LSTM which finish training in the step S4, and outputting a recognition result;

adding 5 hidden layers on top of the fully-connected layer, wherein each layer comprises 5 LSTM layers of 128 neurons, and introducing nonlinearity by adopting a RELU function as an activation function for the LSTM layers;

taking the SVM layer as the top layer of the LSTM layer, adding an Accuracy layer and a Loss layer on the top of the SVM layer, wherein the Accuracy layer comprises the following steps: output accuracy, Loss layer: the loss is output, the current network training state can be judged according to the change trend of the loss, and when the trainloss and the testloss both decrease, the network still learns seriously;

in the step S4, the convolutional neural network is used to extract the spatial features of the expression picture, and the training of the convolutional neural network specifically includes the following steps:

establishing a double-convolution CNN network model, wherein the network layers sequentially comprise 2 convolution layers with convolution kernels of 3 × 3, a maximum pooling layer, 2 convolution layers with convolution kernels of 3 × 3, 1 maximum pooling layer, 1 convolution layer with convolution kernels of 5 × 5 and 1 maximum pooling layer, and inputting the preprocessed expression picture into the double-convolution CNN to extract spatial information of the expression picture;

step S4 trains the LSTM network and extracts timing information of the expression picture, and the calculation process is as follows:

(1) at the current time t, the value of the candidate cell is

The calculation formula is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f)

in the formula, "-" indicates a dot-by-dot multiplication;

o_t＝σ(W_xox_t+W_hoh_t-1+b_o)

(6) Output of LSTM unit:

h_t＝o_t⊙tanh(c_t)。

2. the method for recognizing expressions based on a double-convolution CNN and a long-and-short memory network as claimed in claim 1, wherein the step S1 classifies facial expressions into seven categories of happiness, surprise, anger, sadness, disgust, fear and neutrality.

3. The expression recognition method based on the double-convolution CNN and the long-and-short term memory network as claimed in claim 1, wherein S2 collects a plurality of expression images, and converts the original image into 256 × 256 black-and-white images through color transformation, denoising, and dimension reduction operations.

4. The expression recognition method based on the double-convolution CNN and the long-and-short term memory network as claimed in claim 1, wherein the initial value of the learning rate of the double-convolution CNN network model is set to 0.001, the learning rate is reduced by 0.0001 by using a step strategy every 100 times of training, and the maximum iteration number of the training network is 2500.

5. The expression recognition method based on double convolution CNN and long-and-short term memory network as claimed in claim 1, wherein the step S7 is implemented by performing recognition on the preprocessed pictures through the trained double convolution neural network and long-and-short term memory network, and outputting the recognition result to be presented in a manner of probability from high to low.