CN111160327A

CN111160327A - Expression recognition method based on lightweight convolutional neural network

Info

Publication number: CN111160327A
Application number: CN202010252867.1A
Authority: CN
Inventors: 赵光哲; 张雷; 杨瀚霆; 朱娜; 邵帅; 田军伟
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-05-15
Anticipated expiration: 2040-04-02
Also published as: CN111160327B

Abstract

The invention relates to the field of artificial intelligence, and particularly provides an expression recognition method based on a lightweight convolutional neural network, which is characterized by comprising the following steps of: s1: building and training a lightweight convolution network model, wherein the range of the number of convolution layers of the lightweight convolution network model is 36-58, the range of the number of grouped convolution groups is 2-4, and the range of compression factors of compression layers is 0.3-0.5; s2: building a human face corrector; s3: detecting and correcting an input image by adopting a face corrector to obtain a preprocessed image; s4: and classifying and preprocessing the facial expressions in the image by adopting a lightweight convolutional neural network model. The invention solves the technical problems of low identification accuracy and low identification speed in the prior art, and has high real-time performance while ensuring the accuracy.

Description

Expression recognition method based on lightweight convolutional neural network

Technical Field

The invention relates to the field of computer vision, in particular to an expression recognition method based on a lightweight convolutional neural network.

Background

Emotion is a cognitive experience produced by human beings under intense psychological activities and is an important element for guiding communication in social environments. The initiation of emotion comes from a variety of sources, including mood, character, motivation, and the like. Facial expressions, as a unique signal transmission system, can express the psychological state of a person, and are one of effective methods for analyzing emotions. The expression recognition mainly comprises the following four processes: face positioning, face correction, feature extraction and expression classification. The feature extraction and the expression classification are important parts in the process and are the core difficult problems of expression identification. Conventional methods extract facial information using manually designed geometric features based on geometric attributes in the image and appearance features based on grayscale information of the image. These methods have high recognition accuracy for specific data distributions, but are difficult to handle with a wide range of pose changes, and have poor results when generalized to other data sets. In recent years, a method based on data driving has been attracting attention. For example, the convolutional neural network model learns the features directly from the data by means of weight sharing and downsampling, and is robust to changes such as postures, shelters and light. However, in order to obtain higher accuracy, the depth of the model is continuously deepened by the scholars, and the quantity of the model parameters is excessive. This is not conducive to the training of the model and the use of practical applications.

Disclosure of Invention

In order to solve the technical problems of low identification accuracy and low identification speed in the prior art, the invention provides an expression identification method based on a lightweight convolutional neural network, which adopts calculated quantity to reduce parameters

Determining parameters of a lightweight convolutional network model, comprising the steps of:

s1: building and training a lightweight convolution network model, and collecting input image information by adopting the lightweight convolution network model; the range of the convolution layer number of the lightweight convolution network model is 36-58, and the range of the compression factor of the compression layer is 0.3-0.5;

s2: building a human face corrector;

s3: detecting and correcting the input image information by adopting a face corrector to obtain a preprocessed image;

s4: classifying and preprocessing the facial expressions in the image by adopting a lightweight convolutional neural network model;

the building and training of the lightweight convolutional network model comprises the following steps:

s1.1: building a network model, and transmitting the output of each convolutional layer into a subsequent convolutional layer as an additional input, wherein the number of initial grouped convolutional groups is 2-4, and the number of convolutional layers of a single dense block is not less than 12;

s1.2: determining a rate of increase of a structural parameter of a lightweight convolutional neural network

Length of convolution filter

Width of convolution filter

Number of layers of convolution

；

Determining a rate of increase of a structural parameter of a lightweight convolutional neural network

Length H of convolution filter_kWidth of convolution Filter Wk number of convolution layers

The method comprises the following steps:

calculating a calculation amount reduction parameter based on a lightweight convolutional neural network model

Using parametersNRate of growth of structural parameter at minimum

Length of convolution filter

Width of convolution filter

Number of layers of convolution

As parameters of the lightweight convolutional network model:

；

wherein

In order to increase the rate of growth of the structural parameter,

is the length of the convolution filter and,

in order to be the width of the convolution filter,

is the number of convolutional layers.

Preferably, the S1 includes: and training a lightweight convolutional neural network model according to a FERPLUS expression recognition database.

Preferably, the face corrector is built by adopting HOG characteristics and SVM algorithm.

Preferably, the S3 includes: detecting at least four reference points of a face in an input image through a logistic regression tree, matching the at least four reference points through the face corrector, and segmenting the input image according to the at least four reference points to obtain a preprocessed image.

Preferably, the step of training the lightweight convolutional neural network comprises:

acquiring a training sample, wherein the training sample comprises at least 1000 first expression images;

turning, rotating, cutting, scaling and deforming each first expression image by using a data augmentation method to obtain at least 10 corresponding second expression images;

randomly intercepting at least one picture block in the second expression image to obtain a third expression image with a blank area;

training a lightweight convolutional neural network model using the third expression image.

Preferably, the step of building the face corrector by using the HOG feature and the SVM algorithm comprises:

acquiring a training sample, wherein the training sample comprises at least 1000 standard face images;

the calculation formula for obtaining the gradient value and gradient direction of the HOG characteristic of the standard face image is as follows:

；

wherein the content of the first and second substances,

in order to specify the abscissa of a pixel point,

in order to specify the ordinate of the pixel point,

and

in the representative image

The gradient values of the horizontal direction and the vertical direction of the dot,

and

is in the range of 0 to 255,

the gradient values of the representative pixels are then calculated,

representing the direction of the gradient of the pixel point,

the value is limited between 0 and 180 degrees;

building a human face SVM model according to the support vector machine principle;

training a human face SVM model by using the gradient value and the gradient direction of the obtained HOG characteristic of the standard human face image to obtain a training result;

the training results are used to form a face detector.

Preferably, the step of training the lightweight convolutional neural network model according to the FERPLUS expression recognition database includes: calculating an updated gradient value:

；

wherein

The direction is updated for the current gradient,

for the update direction of the gradient of the previous step,

for the current gradient calculated from the second derivative of the gradient,

and

in order to attenuate the weight(s),

to update the gradient values.

Preferably, at least one type of classified expression classifier constructed by using the lightweight convolutional neural network model obtains the probability of expression prediction according to the extraction features of the input Softmax layer, and the calculation formula is as follows:

；

wherein

Is as follows

A label of a similar expression is displayed on the display,

is as follows

The characteristics of the class input are such that,

to be generalized to represent all weights of a dense network,

as predicted probability of all expressionsThe vector of the composition is then calculated,

in order to be transposed, the device is provided with a plurality of groups of parallel connection terminals,

、

is an integer variable.

According to the technical scheme provided by the invention, on the basis of face positioning and face correction, a light convolution mode is realized by reducing the parameter N by using the calculated amount, so that the calculated amount is lightened on the basis of keeping the accuracy of the dense convolution amount, and the advantages of high accuracy and low calculated amount are achieved. The invention combines the flows of facial feature extraction and expression classification by using a lightweight convolutional neural network model to realize facial expression recognition, realizes the recognition of facial expressions by using a single camera and image processing in a laboratory environment, has higher real-time property while ensuring the accuracy, and effectively analyzes facial expression information.

Drawings

Fig. 1 is a flowchart of an expression recognition method based on a lightweight convolutional neural network according to an embodiment of the present invention.

Fig. 2 is a schematic detection diagram of a face detector according to an embodiment of the present invention.

Fig. 3 is a schematic correction diagram of a face corrector according to an embodiment of the present invention.

Fig. 4a is a result of accuracy of the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention on a verification data set.

Fig. 4b is a result of accuracy of the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention on the verification data set.

Fig. 4c is a result of accuracy of the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention on the verification data set.

Fig. 4d is a result of accuracy of the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention on the verification data set.

Fig. 4e is a result of accuracy of the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention on the verification data set.

Fig. 5a is a comparison graph of model parameters required by the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention compared with other models.

Fig. 5b is a comparison graph of the model calculation amount compared with other models in the identification method of the lightweight convolutional neural network according to the first embodiment of the present invention.

Fig. 6a is a learning curve diagram of a lightweight convolutional neural network identification method in data set FER2013 according to an embodiment of the present invention.

Fig. 6b is a learning curve diagram of the lightweight convolutional neural network identification method in the data set FERPLUS according to the first embodiment of the present invention.

Fig. 6c is a learning curve diagram of the lightweight convolutional neural network identification method in the data set ferfn according to the first embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention.

Example one

The embodiment provides an expression recognition method based on a lightweight convolutional neural network.

As shown in fig. 1, the expression recognition method based on the lightweight convolutional neural network provided in this embodiment includes four parts: inputting a single-frame image, detecting a human face, correcting the human face and identifying an expression. Starting from an original input image, the classification of the facial expression is predicted after two links of image processing.

The light weight of the embodiment refers to a convolution network calculation mode which is more efficient and requires less calculation amount compared with standard convolution calculation, so that the calculation complexity of the convolution network is reduced, and the calculation efficiency is improved. Dense blocks refer to the number of convolutional networks, and within a certain range, the greater the number of convolutional networks, the higher the model accuracy. According to the embodiment, the calculation mode of the convolutional network is improved in a light weight mode, the dense number of the convolutional network is not changed, the light weight neural network structure is obtained by optimizing the parameters of the convolutional network, and the operation efficiency is improved while the accuracy of the dense number is maintained.

Firstly, recognition accuracy and calculation time are two standards for detecting and positioning human faces in a man-machine interaction environment, but in consideration of the real-time performance of an expression recognition system, on the premise of ensuring certain accuracy, features and learning algorithms with higher calculation speed need to be selected to optimize various parameters of a lightweight convolution model.

The model training speed is slow considering that dense connection can cause the model to be too large in calculation amount. The lightweight convolutional neural network used in the present embodiment optimizes the convolutional layer. Building a network model, and transmitting the output of each convolutional layer into a subsequent convolutional layer as an additional input, wherein the number of initial grouped convolutional groups is 2-4, and the number of convolutional layers of a single dense block is not less than 12;

calculating a calculation amount reduction parameter according to the following formula

：

；

Wherein

In order to increase the rate of growth of the structural parameter,

is the length of the convolution filter and,

in order to be the width of the convolution filter,

is the number of convolutional layers. Using calculation reducing parameters

Rate of growth of structural parameter at minimum

Length of convolution filter

Width of convolution filter

Number of layers of convolution

As a lightweight convolution model parameter.

The present embodiment forms a face detector based on the HOG features and the SVM algorithm for detecting the face position in a single frame image. Specifically, the present embodiment obtains training samples, where the training samples include 3000 face images in an LFW database; calculating the HOG characteristics of the face image according to a generation method of the direction gradient histogram; training a face detection SVM model by using the extracted HOG characteristics; and forming a face detector according to the training result. The HOG characteristic gradient calculation formula is as follows:

；

wherein the content of the first and second substances,

in order to specify the abscissa of a pixel point,

in order to specify the ordinate of the pixel point,

and

in the representative image

and

is in the range of 0 to 255,

the gradient values of the representative pixels are then calculated,

representing the direction of the gradient of the pixel point,

the value is limited to between 0 and 180 degrees.

As shown in fig. 2, after an original image is input, the HOG feature form of the image is first calculated, then the trained standard face HOG feature is compared with the HOG feature, and finally the position of the face in the original image is found and output.

The embodiment uses the regression tree set to detect the reference points in the face image block so as to correct the face in the single-frame image. Specifically, training samples are obtained, wherein the training samples comprise 2000 training face images and 330 testing face images; training the training sample by using a regression tree set of shape invariant feature segmentation; the training results are used to construct a face corrector. Fig. 3 is a schematic correction diagram of the face corrector according to the first embodiment. As shown in fig. 3, after the face image block is input, 68 feature points of the face are calculated first, then the face image block is compared with 68 feature points of the standard face, and finally the face image block is corrected.

The embodiment utilizes the lightweight convolutional neural network to perform feature extraction and prediction on the corrected human face so as to obtain expression classification. After a lightweight convolutional network model is built, model parameters are optimized through extreme value calculation; acquiring a facial expression recognition database FERFIN; preprocessing the expression data such as data augmentation; training by using the preprocessed expression image data set; and taking the training result as a final expression classification model.

Considering that applications in real environments require high real-time performance, an excessively large neural network architecture may result in an increased amount of computation. Setting the range of the convolution layer number of the lightweight convolution network model to be 36-58 and the range of the compression factor of the compression layer to be 0.3-0.5, and reducing parameters by adopting calculated amount

Rate of growth of structural parameter at minimum

Length of convolution filter

Width of convolution filter

Number of layers of convolution

The parameters of the lightweight convolutional network model are beneficial to reducing the quantity of the parameters of the model and learning more characteristic features.

In the middle of each dense block there is a transition layer in order to accomplish parameter compression and adjust the computation variables. After 3 dense blocks, the feature tensor calculated by the model is input into a full-connection network layer, the combined kernel function of the layer maps the features extracted from the image into a 1 × 7 vector, wherein the value of each position represents the confidence of the expression of the category, and a vector calculation formula consisting of the prediction probabilities of all expressions is as follows:

；

wherein

Is as follows

A label of a similar expression is displayed on the display,

is as follows

The characteristics of the class input are such that,

to be generalized to represent all weights of a dense network,

is a vector consisting of the predicted probabilities of all expressions,

、

is an integer variable.

The embodiment trains an expression classifier using a lightweight convolutional neural network. For the deep convolutional neural network model, a large amount of training data is required to achieve high accuracy, so the present embodiment adopts the ferfn data set as the training data set. The ferfn data set was improved from the FER2013 data set, and included 12858 cases of "neutral" images, "9354 cases of" happy "images," 4462 cases of "sad" images 4351 cases of "angry" images 3082 cases of "disgust" images 575 cases of "afraid" images and "afraid" images 816 cases, and 35498 cases of 48 by 48 pixel grayish human expression images were summed.

The present embodiment processes the rfin database in two steps, taking into account the changes in pose, light, occlusion, etc. that exist in the expression recognition task. Turning, rotating, cutting, scaling and deforming the original picture to obtain twelve new pictures by using a data augmentation method for each picture; and randomly cutting the new picture into 16-by-16 pixel blocks to obtain a picture with blank areas.

In order to accelerate the convergence rate of the lightweight convolutional network model, the present embodiment obtains the current gradient by using the following momentum method instead of the conventional gradient descent method, where the momentum method calculates the current gradient formula as follows:

；

wherein

And

respectively representing the update direction of the current and previous step gradients,

representing the current gradient calculated from the second derivative of the gradient,

and

respectively represent two attenuation weights which are respectively represented by,

to update the gradient values.

As shown in FIGS. 4a, 4b, 4c, 4d and 4e, the used PGC-DenseNet model for optimizing DenseNet, which comprises 3 Dense blocks with 12 convolutional layers, optimizes all convolutional layers into depth separable convolution after grouping before each input image enters the Dense Block, and reduces the parameters by the calculated amount

The model after the parameters are optimized is compared with other popular lightweight networks, and the result shows that the used method exceeds other models in accuracy performance, the used model converges earlier than other models, the speed reaching 80% is faster, and the final accuracy is higher.

As shown in FIG. 5a, the lightweight model comprises PGC-DenseNet, Squeezet, ShuffleNet1, ShuffleNet2, MobileNet1, MobileNet2 and MobileNet3 from left to right, and comparison of model parameters is carried out; as shown in FIG. 5b, the weight-reduced models were PGC-DenseNet, Squeezet, ShuffleNet1, ShuffleNet2, MobileNet1, MobileNet2, and MobileNet3 in the order from left to right, and were compared in terms of the amount of model calculations. As can be seen from fig. 5a and 5b, when the model used by PGC-DenseNet is the smallest in model parameters, and the calculated amount of the model is also in the same order as that of other models, only about 25 ten thousand of parameters are included, and the model parameters are reduced by at most 6 times compared with other lightweight models.

Fig. 6a, fig. 6b, and fig. 6c are the learning graphs of the PGC-DenseNet model in the data set FER2013, the data set FERPLUS, and the data set ferfn, respectively, in which the curves are more smoothly and continuously rising, and finally, the training set is located above, and the test set is where the curves fluctuate more and tend to converge. As can be clearly seen from fig. 6a, 6b, and 6c, the learning curves of the used models on the training set and the verification set exhibit sufficient robustness in the face of the over-fitting problem, and the curves of the training set and the verification set fit closely within 150 cycles.

The invention provides a lightweight rollThe expression recognition method of the neural network comprises the following steps: preprocessing facial expression data, training a lightweight convolutional neural network model to obtain an expression classification model, training a facial corrector, detecting a face based on a single frame image, correcting and recording the face based on the single frame image, and recognizing the expression based on the single frame image to obtain expression classification. The invention adopts calculation amount reduction parameters

The method optimizes the parameters of the model and adopts a light convolution mode, so that on the basis of keeping the accuracy of the dense convolution quantity, the light calculation quantity has the advantages of high accuracy and low calculation quantity, online expression recognition is realized by using a single camera and a network transmission scheme in a laboratory environment, and the method has high real-time performance while the accuracy is ensured.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An expression recognition method based on a lightweight convolutional neural network is characterized in that a calculated quantity reduction parameter is adopted

s2: building a human face corrector;

Length of convolution filter

Width of convolution filter

Number of layers of convolution

；

Length of convolution filter

Width of convolution filter

Number of layers of convolution

The method comprises the following steps:

Using parameters

Rate of growth of structural parameter at minimum

Length of convolution filter

Width of convolution filter

Number of layers of convolution

As parameters of the lightweight convolutional network model:

；

wherein

In order to increase the rate of growth of the structural parameter,

is the length of the convolution filter and,

in order to be the width of the convolution filter,

is the number of convolutional layers.

2. The expression recognition method based on the light-weighted convolutional neural network of claim 1, wherein the S1 includes: and training a lightweight convolutional neural network model according to a FERPLUS expression recognition database.

3. The expression recognition method based on the light-weight convolutional neural network as claimed in claim 1, wherein the face corrector is built by adopting HOG features and SVM algorithm.

4. The expression recognition method based on the light-weighted convolutional neural network of claim 1, wherein the S3 includes: detecting at least four reference points of a face in an input image through a logistic regression tree, matching the at least four reference points through the face corrector, and segmenting the input image according to the at least four reference points to obtain a preprocessed image.

5. The expression recognition method based on the light-weighted convolutional neural network of claim 2, wherein the step of training the light-weighted convolutional neural network comprises:

6. The expression recognition method based on the light-weight convolutional neural network as claimed in claim 3, wherein the step of constructing the face corrector by adopting the HOG feature and the SVM algorithm comprises the following steps:

；

wherein the content of the first and second substances,

in order to specify the abscissa of a pixel point,

in order to specify the ordinate of the pixel point,

and

in the representative image

and

is in the range of 0 to 255,

the gradient values of the representative pixels are then calculated,

representing the direction of the gradient of the pixel point,

the value is limited between 0 and 180 degrees;

the training results are used to form a face detector.

7. The expression recognition method based on the lightweight convolutional neural network as claimed in claim 1, wherein the step of training the lightweight convolutional neural network model according to the FERPLUS expression recognition database comprises: calculating an updated gradient value:

；

wherein

The direction is updated for the current gradient,

for the update direction of the gradient of the previous step,

for the current gradient calculated from the second derivative of the gradient,

and

in order to attenuate the weight(s),

to update the gradient values.

8. The expression recognition method based on the lightweight convolutional neural network as claimed in claim 1, wherein at least one type of classified expression classifier constructed by using the lightweight convolutional neural network model obtains the probability of expression prediction according to the extracted features of the input Softmax layer, and the calculation formula is as follows: