CN113989890A

CN113989890A - Face expression recognition method based on multi-channel fusion and lightweight neural network

Info

Publication number: CN113989890A
Application number: CN202111273460.8A
Authority: CN
Inventors: 霍华; 于亚丽; 刘俊强; 康世禄; 于春豪
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-28

Abstract

The invention provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network, which aims at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional facial expression recognition learning method, and provides the facial expression recognition method based on the multi-channel fusion and the lightweight neural network. Experiments show that the model provided by the invention can effectively extract facial expression characteristics and classify expressions, and has better accuracy and robustness.

Description

Face expression recognition method based on multi-channel fusion and lightweight neural network

Technical Field

The invention belongs to the technical field of facial expression recognition, and particularly relates to a facial expression recognition method based on multi-channel fusion and a lightweight neural network.

Background

In recent years, machine learning is rapidly developed in the field of artificial intelligence, how to realize better understanding of human emotion by a computer, and further change the relationship between a human and the computer are concerned by more and more researchers. Mehrabian's research has shown that in person-to-person communication, the information conveyed by facial expressions is a very large proportion, up to 55%, while only 7% of the proportion depends on the content of the speaker. It follows that human facial expressions play a crucial role in human-to-human communication. The expression recognition is a cross discipline spanning the fields of physiology, neurology, computer science and the like, and has great potential application value in the fields of psychology, intelligent robots, online education, intelligent monitoring and the like.

The expression is a form of non-verbal communication, and most of the expressions refer to states formed by facial muscles, eye muscles, oral muscles and five sense organs, such as smile, anger and the like, and can reflect emotional changes and psychological states of people in time. In 1971, psychologists Ekman and Friesen divided facial expressions into 6 basic expressions, namely, happy, sad, surprised, fear, angry and disgust, and systematically established a facial expression image library, and carefully described facial movement characteristics corresponding to each expression. Currently, facial expression recognition is mostly divided into two categories. One type is based on a traditional feature extraction algorithm, including an extraction algorithm based on geometric features, an extraction algorithm based on textural features and the like, and mainly comprises the steps of designing a proper feature extraction algorithm according to different feature requirements, combining different classifiers for classification, and then carrying out classification and identification on the facial expressions through algorithms such as a sparse representation classification method and a hidden Markov model. However, the traditional feature extraction algorithm can only extract some shallow features according to the artificially set features through non-autonomous learning, and errors are easily caused by human factors, so that the accuracy of expression classification is influenced. The other type is a deep learning-based feature extraction algorithm which comprises three models of a convolutional neural network, a deep confidence network and a restricted Boltzmann machine. The deep neural network has strong autonomous learning ability, can extract deeper features, and the extracted facial expression features are more favorable for visualization, and then the classification result is output through the classifier. Jain et al proposed a single deep convolutional neural network model containing deep residual blocks, and found that the combination of FCN and the residual block cloud greatly improved the overall result. Ko introduces a recent hybrid deep learning approach that combines Convolutional Neural Networks (CNN) with long-short term memory (LSTM). Ali et al propose a facial emotion recognition method that utilizes graph mining techniques to reduce the extracted features, where a gSpan frequent subgraph mining algorithm is used to find the frequent substructures of each emotion in the graph database. Yong Li et al propose a Convolutional Neutral Network (CNN) with attention mechanism (ACNN) that can effectively perceive the occluded regions of faces and improve the recognition accuracy of non-occluded faces and occluded faces. And the Tang combines the CNN and the SVM to recognize the facial expression, so that a better recognition effect is achieved on the Fer2013 data set. Fengyuan and the like combine with a SIFT and CNN feature fusion method to recognize the facial expressions, so that the accuracy of model recognition is improved.

Disclosure of Invention

In order to solve the problems of difficult feature extraction and incomplete feature extraction in the traditional method, the invention aims to provide a facial expression recognition method based on multi-channel fusion and a lightweight neural network, so that more complete image features are further extracted, and the accuracy and robustness of facial expression recognition are improved.

In order to achieve the purpose, the invention adopts the technical scheme that: the facial expression recognition method based on the multi-channel fusion and the lightweight neural network comprises the following steps:

s1, acquiring image data through an expression database or a camera, and performing face region detection on the face expression database image by using a Cascade Cascade classifier based on Haar characteristics to acquire a face image;

s2, extracting local texture features of the face region by adopting a local binary pattern, and detecting the edge of the face region based on a Canny edge detection algorithm;

s3, constructing and initializing a lightweight neural network;

s4, performing channel fusion on the obtained face image, the LBP texture feature image and the edge detection Canny image, performing data normalization and data enhancement on the fused image, and inputting the image into the constructed lightweight neural network for training and recognition.

Further, step S4 is to train the model using the data-enhanced training set when training the constructed lightweight neural network, and train the model in batches, generate data batch by batch, then perform back propagation, update weights in the model, and repeat the process until the desired epoch number is reached.

Further, in step S1, after the face image is obtained, image normalization processing is performed to adjust the feature value scales of different dimensions to be within a similar range.

Further, in step S2, the process of extracting the local texture features of the face region by using the local binary pattern is as follows:

given a pixel (x)_c,y_c) The number of sampling points is P and the radius of the sampling circle is R, and the obtained LBP can be expressed as the following decimal system:

where P represents the P-th sample point of the total P sample points in the circular region, i_cRepresenting the gray value of the central pixel in the neighborhood of the circle, i_pRepresenting the gray values of the P surrounding pixels in a circular neighborhood, the function s (x) is defined as:

the value of the original LBP is converted into binary code, which is subjected to a cyclic shift operation, expressed by a mathematical formula, taking the smallest value of all results:

where ROR (x, i) refers to performing a cyclic bit right shift i times on the P-bit number x.

Further, the process of detecting the edge of the face region based on the Canny edge detection algorithm in step S2 is as follows:

s21, performing convolution noise reduction by adopting a Gaussian smoothing filter, namely performing convolution operation on the original data and the Gaussian filter to make the image smoother, wherein the image is expressed by a mathematical formula, and the two-dimensional Gaussian function is as follows:

wherein (x, y) is the longitudinal and transverse coordinate point of the pixel point of the original image h (x, y), and sigma is the standard deviation of the Gaussian function; and obtaining H (x, y) after convolution processing is carried out on the Gaussian function and the original image H (x, y):

H(x,y)＝G(x,y)×h(x,y) (5)

after noise filtering, the gradient magnitude and direction are calculated for H (x, y) to estimate the edge strength and direction at each point, the gradient is calculated by using finite difference of first-order partial derivatives, and the matrices of the first-order partial derivatives in the x and y directions are P (i, j) and Q (i, j):

the gradient amplitude M (i, j) and gradient direction θ (i, j) are calculated by the following formula:

the gradient angle theta ranges from radian-pi to pi, the gradient angle theta is approximated to four directions which respectively represent the horizontal direction, the vertical direction and two diagonal directions (0 degrees, 45 degrees, 90 degrees and 135 degrees), the gradient amplitude is inhibited by non-maximum values along the gradient direction, the local maximum value of a pixel point is searched, on each point, a central pixel point of the field is compared with two pixels along the corresponding gradient direction, if the central pixel is the maximum value, the local maximum value is reserved, otherwise, the center is set to be 0, so that the non-maximum value can be inhibited, the point with the maximum local gradient is reserved, and the refined edge is obtained.

Further, in step S4, the image obtained after channel fusion is a three-channel image of 48 × 48 pixels.

Further, step S4, when recognizing the image in the input lightweight neural network:

firstly, carrying out convolution operation sequentially through two 2D convolution layers, wherein the sizes of convolution kernels are 3 multiplied by 3, the numbers of the convolution kernels are 32 and 64 respectively, and the step length is 1; then, the output sequentially passes through a residual error module I, a residual error module II and a residual error module I, the number of convolution kernels is 128, 256 and 512 respectively, the size of the convolution kernel of the depth separable convolution is 3 multiplied by 3, the size of the convolution kernel of the maximum pooling is 3 multiplied by 3, and the size of the convolution kernel of the 2D convolution layer is 1 multiplied by 1; then, the output is sequentially sent into two depth separable convolution layers, the sizes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 1024 and 512 respectively, and the step length is 1; and finally, the output is sequentially sent to a global averaging pooling layer and a Softmax classifier, and all 2D convolution and depth separable convolution operations are performed through a Batch Normalization layer and a ReLU6 activation layer to accelerate the convergence speed of the network and increase the capability of extracting nonlinear features.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a human face expression recognition method based on multi-channel fusion and a lightweight neural network, aiming at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional human face expression recognition learning method. Experiments show that the model provided by the invention can effectively extract facial expression characteristics and classify expressions, and has better accuracy and robustness.

Drawings

FIG. 1 is a frame diagram of a facial expression recognition method based on multi-channel fusion and a lightweight neural network according to the present invention;

FIG. 2 is a schematic diagram of face regions detected for different facial expressions, respectively;

FIG. 3 is a schematic diagram of LBP extraction of texture features;

FIG. 4 is a rotation invariant face expression LBP feature map extracted using local binary patterns;

fig. 5 is an edge detection diagram of a face region extracted based on the Canny edge detection algorithm;

FIG. 6 is a schematic diagram of a neural network;

FIG. 7 is a cross-channel correlation and spatial correlation decoupling schematic;

FIG. 8 is a network framework diagram of residual module one;

FIG. 9 is a network framework diagram of residual module two;

FIG. 10 is a schematic diagram of an improved Xmeeting network model;

FIG. 11 is a sample of a portion image of a Fer2013 dataset;

FIG. 12 is a sample of a CK + dataset image;

FIG. 13 is a confusion matrix on the Fer2013 dataset;

FIG. 14 is a confusion matrix on a CK + data set.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

1. The principle of the invention is as follows: firstly, image data is obtained through an expression database or a camera, a Cascade Cascade classifier based on Haar features is used for carrying out face region detection on images in a face expression database, whether faces exist in the images needs to be determined, and the positions of the faces need to be detected. Secondly, after the facial expression image is obtained, local texture features of the LBP extracted face region are combined with an edge detection algorithm Canny, and the detected face image, the LBP texture feature image and the edge detection Canny image are subjected to channel fusion to form a three-channel form. And finally, performing data normalization and data enhancement on the fused image, and inputting the image into the constructed lightweight neural network for training and recognition. And configuring a training model after the network model is built, then training the model by using a data-enhanced training set, training the model according to batches, generating data batch by batch, executing back propagation, updating the weight in the model, and repeating the process until the expected epoch number is reached. The basic flow chart of facial expression recognition is shown in fig. 1.

2. Image pre-processing

The expression database in the application preferentially adopts images of a Fer2013 database and a CK + database. However, the Fer2013 and CK + database images are original images and often contain some information irrelevant to expressions, such as gestures, sunglasses, hats and the like, so in order to improve the image recognition accuracy, redundant information needs to be removed through image preprocessing. The pretreatment adopted by the application is as follows: face detection and image normalization.

2.1 face detection

The method and the device adopt a Cascade Cascade classifier based on Haar characteristics to detect the human face region. The method comprises the steps of performing feature evaluation on Haar-like through an integrogram, screening N excellent feature values (namely, optimal weak classifiers), then transmitting the N optimal weak classifiers to AdaBoost for training, training the classifiers into a strong classifier by using an AdaBoost algorithm, distinguishing human faces and non-human faces, and finally cascading a plurality of strong classifiers together to improve the accuracy. The detector has the characteristics of high inter-class variability, local intensity difference, high calculation efficiency and the like.

Different features of the Haar can be combined in various ways to generate more complex cascade features, and the Haar feature value reflects the contrast and gradient change of the image. The cascade classifier is equivalent to a decision tree, the hierarchy judgment is more accurate, the human face area is searched by multi-scale scaling and traversing of a sliding window, the model is small and light, and the detection speed is high even on equipment with limited resources. As shown in fig. 2, the face regions detected for different facial expressions.

2.2 image normalization

In deep learning, the normalization of the images not only accelerates the convergence of the training network but also maintains affine invariance. The main objective of normalization is to provide invariance, which mainly aims at fluctuations of average pixel intensity and contrast, and the illumination intensity, image reflectivity and the like can change the pixel intensity and contrast of different areas of the picture, and normalization is to weaken the fluctuations through scaling, so that the brighter parts become darker and the darker parts become lighter. In the preprocessing step, normalization is adopted to adjust the feature value scales of different dimensions to be in a similar range, namely, each pixel value of the image is divided by 255 to be normalized to be in a 0-1 range.

2.3 LBP feature extraction

LBP is an operator used to describe local texture features of an image, and the extracted features are local texture features of the image, which reflect the relationship of each pixel to surrounding pixels. The original LBP operator is fixed in radius of the field and not rotation-invariant, and in order to meet the texture requirements of different sizes and frequencies, the LBP operator with rotation invariance is adopted in the method. It allows a group of pixel points with P uniform intervals in a circular neighborhood with radius R, the circle takes the pixel to be marked as the center, the gray value of P sampling points is calculated according to a bilinear interpolation algorithm, and the neighborhood can be represented by a symbol (P, R). The image rotation will obtain different LBP values, and the minimum value is taken as the LBP value after the rotation invariant processing. The principle diagram of LBP texture feature extraction is shown in FIG. 3. Black and white in the figure represent weaker and stronger pixels, respectively, than the central pixel.

Expressed by a mathematical formula, given a pixel (x)_c,y_c) The number of sampling points is P and the radius of the sampling circle is R, and the obtained LBP can be expressed as the following decimal system:

wherein ROR (x, i) refers to performing cyclic shift right i times on the P-bit number x, and the facial expression LBP feature map extracted by the present application is shown in fig. 4.

2.4 Canny edge detection Algorithm

The Canny edge detection operator is one of the most classical and advanced algorithms in image edge detection algorithms, and is a multi-stage edge detection algorithm developed by John f. The Canny algorithm includes four basic steps:

step 1, performing Gaussian filtering processing on an input image to remove noise;

step 2, calculating the gradient amplitude and the gradient direction to estimate the edge strength and the edge direction of each point;

step 3, according to the gradient direction, carrying out non-maximum suppression on the gradient amplitude;

and 4, processing and connecting edges by using double thresholds.

In the invention, before Canny operator edge detection is carried out, noise is firstly removed from an original image, and the noise is easily identified as a false edge because the place where the noise appears is a place where the gray value is changed violently. The filtering is to remove noise, and the Gaussian smoothing filter is adopted to perform convolution noise reduction, namely, the convolution operation is performed on the original data and the Gaussian filter, so that the image is smoother. Expressed by a mathematical formula, the two-dimensional gaussian function is:

H(x,y)＝G(x,y)×h(x,y) (5)

the gradient amplitude M (i, j) and the gradient direction θ (i, j) are calculated by the following formula (7):

the gradient angle theta ranges from radian-pi to pi, the gradient angle theta is approximated to four directions which respectively represent the horizontal direction, the vertical direction and two diagonal directions (0 degrees, 45 degrees, 90 degrees and 135 degrees), the gradient amplitude is inhibited by non-maximum values along the gradient direction, the local maximum value of a pixel point is searched, on each point, a central pixel point of the field is compared with two pixels along the corresponding gradient direction, if the central pixel is the maximum value, the local maximum value is reserved, otherwise, the center is set to be 0, so that the non-maximum value can be inhibited, the point with the maximum local gradient is reserved, and the refined edge is obtained. The Canny image extracted by the application for facial expression edge detection is shown in fig. 5.

2.5 feature fusion network

The fusion of features with different scales is an important means for improving detection and segmentation performance, and is classified into Early fusion (Early fusion) and Late fusion (Late fusion) according to the sequence of fusion and prediction. According to different image representation levels, image fusion can be divided into three levels of fusion: pixel level fusion, feature level fusion, and decision level fusion. By comparing the effects of several fusion modes through experiments, a method of performing early fusion before training a predictor by combining feature-level fusion and pixel-level fusion is finally selected as shown in fig. 1. Let us assume the original image pixel feature vector v1 ∈ RⁿLBP extracted texture feature vector v2 ∈ R^mAnd the feature vector v 3E R extracted by the edge detection Canny^kAnd splicing the two-dimensional vector with the same dimension, wherein the two-dimensional vector has fusion characteristic vectors:

v＝[v1，v2，v3]∈R^n+m+K (8)

the image fusion method has the following advantages:

(1) enhancing the image, and improving the resolution and definition of the image; (2) enhancing relevant features of the image; (3) mutually supplementing related information, and removing noise and redundancy; (4) the volume recognition capability of target detection is improved; (5) complete three-dimensional reconstruction data is obtained.

3. Building neural network model

3.1 lightweight convolutional neural network:

in deep learning, a large network often has a storage problem caused by a large number of weight parameters and a problem of a slow processing task speed caused by a large calculation amount. Therefore, the present application solves the above existing problems with a lightweight model. The design idea of the lightweight model is to design a more efficient network convolution mode, so that network performance is not lost while network parameters are reduced, and the lightweight model has the characteristics of local connection, weight sharing and hierarchical expression. As shown in fig. 6.

In image processing, convolution is a process in which a convolution kernel slides on an image and finds the multiplication sum of corresponding elements. The convolutional layer is the most important part in the convolutional network, and the main function of the convolutional layer is to extract image features by enhancing original signal features and reducing noise, and the calculation process of the convolutional operation is as follows:

wherein

For the jth feature map cell of the lth layer,

is the ith input of the L-1 layer, theta_ijRepresenting the convolution kernel, b is the bias unit, and g (x) is the activation function.

The pooling is also called down-sampling, no parameter needing learning exists in the pooling process, and the method mainly has the function of reducing the feature map and playing a role in reducing the dimension. A common method is to select the maximum or average of the local area. The present application employs maximum pooling of spatial data.

The general network adopts the full connection layer, but the full connection layer parameter is many to take place the overfitting easily, therefore, this application adopts global average pooling to replace the full connection layer, so not only can reduce the quantity of parameter and prevent the emergence of overfitting, and it has carried out the summation to spatial information moreover, therefore has stability to the spatial transform of input.

3.2 depth separable convolution:

in order to reduce the number of parameters of the network and improve the performance of the model, the idea of deep separable convolution can be used for processing the channel correlation and the spatial correlation separately. The core idea of the deep separable convolution is to convolve with different convolution kernels for different input channels, which decomposes the standard convolution operation into two processes, deep convolution and point-by-point convolution respectively. We use depth separable convolution by first performing a point-by-point convolution and then performing a depth convolution, in order to ensure that the data is not corrupted, there is no nonlinearities between the two convolutions caused by Relu, as shown in fig. 7. Compared with the standard convolution, the adopted deep separable convolution uses smaller space cost and smaller time cost, and achieves the same effect as the standard convolution layer, so that the calculation amount is greatly reduced while the accuracy of the neural network is kept.

3.3 activation function:

in the deep neural network, because the expression capability of the linear model is not enough, the problem of linear inseparability cannot be solved, the expression capability and the nonlinear modeling capability of the neural network on the model are improved by introducing an activation function into a hidden layer to add a nonlinear factor, and the problem which cannot be solved by the linear model is solved. The application uses ReLU6, referred to in MobileNet V2, as the nonlinear activation function. The ReLU6 is to limit the maximum output value to 6 in the ReLU function, which not only has stronger robustness but also has good numerical resolution at low precision. The ReLU6 function is defined as follows:

ReLU6＝min{max(0,x)，6} (10)

wherein x is the output characteristic of the last network layer.

The ReLU6 function has the following advantages:

(1) the problems of gradient disappearance and gradient explosion during gradient reverse transmission in the deep neural network are solved;

(2) the calculation amount is small, and the calculation speed is high;

(3) the convergence speed of gradient descent is accelerated;

(4) and the sparse expression capability of the neural network is improved.

3.4 residual error network

The method aims to solve the problems that the learning efficiency is low and the accuracy cannot be effectively improved due to the deepening of the network depth. The application adopts a residual error connection mechanism in Resnet50 to construct a network model, and converts the convolution layer therein into the depth separable convolution proposed by the application. The residual error network is characterized in that data output of a certain layer of a plurality of layers in front is directly skipped from a plurality of layers and introduced into an input part of a data layer behind, and can train a deeper neural network and solve the problems of network performance degradation and gradient disappearance caused by deepening of the depth. The introduction of the residual error not only obviously speeds up the convergence process of the network but also obtains higher accuracy in the deep neural network. The network structure of the first residual block and the second residual block are shown in fig. 8 and fig. 9, respectively.

3.5 network architecture

The application provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network, wherein a 48 x 48 pixel three-channel image subjected to multi-channel fusion is input into the constructed lightweight neural network, the network is designed by adopting an Xception-based structure recognition feature, and the accuracy of facial expression recognition is improved mainly by relying on layer-by-layer processing of the three-channel feature subjected to channel fusion. The proposed network structure is shown in fig. 10. Firstly, carrying out convolution operation sequentially through two 2D convolution layers, wherein the sizes of convolution kernels are 3 multiplied by 3, the numbers of the convolution kernels are 32 and 64 respectively, and the step length is 1; then, the output sequentially passes through a residual error module I, a residual error module II and a residual error module I, the number of convolution kernels is 128, 256 and 512 respectively, the size of the convolution kernel of the depth separable convolution is 3 multiplied by 3, the size of the convolution kernel of the maximum pooling is 3 multiplied by 3, and the size of the convolution kernel of the 2D convolution layer is 1 multiplied by 1; then, the output is sequentially sent into two depth separable convolution layers, the sizes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 1024 and 512 respectively, and the step length is 1; and finally, the output is sequentially sent to a global averaging pooling layer and a Softmax classifier, and all 2D convolution and depth separable convolution operations are performed through a Batch Normalization layer and a ReLU6 activation layer to accelerate the convergence speed of the network and increase the capability of extracting nonlinear features.

4. Results and analysis of the experiments

4.1 Experimental platform and data set:

the experimental software platform is a Python3.7 version under Linux, and adopts a Keras framework with TensorFlow as a rear end. The hardware platform is Dell Poweredge R940xa, and the GPU is 16GB NVIDIA Tesla T4. We optimize the training process using a batch size of 64, an initial learning rate of 0.001, and an Adam optimizer.

In the application, a Fer2013 data set and a CK + data set are respectively adopted to perform a facial expression recognition experiment. The data set adopts a cross-validation method to divide data randomly and objectively, so that human factors are reduced. Wherein 80% of the expression images are training sets, 20% of the expression images are testing sets, and the seeds of random numbers are set to be 2019 to control random states.

1) Fer2013 dataset

Fer2013 is a data set provided by a Kaggle facial expression analysis game, the data set is created by using google image search API, and the OpenCV face recognition is used for collecting the face area in the image. The data of the Fer2013 data set is more complete and better accords with the scenes of actual life, so that a Fer2013 training and testing model is mainly selected. The data set comprises 35887 pictures, wherein the number of the pictures is 4953, the pictures are angry 547, the pictures are fear 5121, the pictures are happy 8989, the pictures are sad 6077, the pictures are surprised 4002 and the pictures are neutral 6198, and 7 expressions are shared. Fig. 11 is a partial sample of the Fer2013 expression data set, which includes facial expressions of different ages, sexes, skin colors and different degrees of occlusion.

2) CK + data set

The CK + data set comprises 593 emotion sequences of 123 objects, and each picture sequence shows transition expression from neutral facial expression to peak, including 6 basic expressions plus slight and neutral, and because the fer2013 data set does not contain slight expression, 7 expressions are finally selected by the application, namely angry, neutral, disgust, fear, happiness, sadness and surprise. Fig. 12 shows some examples of CK + expression data sets, which include facial expressions of different ages, sexes, skin colors, and different degrees of occlusion.

4.2 analysis of the results:

the experimental results are obtained by optimizing the training model for multiple times, and the lower graph is a confusion matrix on the Fer2013 and CK + data sets, as shown in FIG. 13 and FIG. 14. The confusion matrix of the Fer2013 data set can show that the highest recognition rate is happy and surprised, and the fear and sad recognition rates are relatively low, because the Fer2013 data set has interference of non-face pictures, various shelters and the like, and is closer to scenes of real life, so that the difficulty in feature extraction is increased. The confusion matrix of the CK + data set shows that the recognition result is good, happy, angry and surprise, the fear and sadness recognition effect is not good enough, and the possible reason is that the fear and sadness expression characteristics have certain similarity and interfere with distinguishing different expressions, so that the expression recognition accuracy is not high.

4.2.1 Domain selection

Selecting the appropriate neighborhood for LBP-based techniques has a significant impact on the final performance. It relates to the number of sampling points, the radius of the neighborhood, the distribution of the sampling points and the shape of the neighborhood. In the LBP rotation invariant mode, in order to find out what values P and R should be, the highest accuracy can be obtained, the present application performs a comparison experiment on P and R with different values, and the experimental results are shown in table 1.

TABLE 1 Fer2013 and CK + data set experiment accuracy

	Fer2013	CK+
			Raw data set image	66.42％	92.35％
LBP(8，1)	68.51％	94.56％
			LBP(8，3)	70.54％	97.62％
LBP(8，5)	68.98％	96.83％
			LBP(24，1)	65.62％	86.61％

Column 1 in table 1 is the different input image types, and columns 2 and 3 are the accuracy of identifying images on the Fer2013 and CK + datasets using the present network model. The experimental data show that the rotation invariant pattern LBP (8, 3) feature graph extracted from the data set image is input into the constructed lightweight neural network after channel fusion, and the recognition rate can reach the highest.

4.2.2 comparative testing of different methods

In order to verify the effectiveness of the method in facial expression recognition, the experiment is compared with expression recognition algorithms of Xception, CNN and FER-Net on a Fer2013 data set. The experiment was compared with the expression recognition algorithms of Xception, inclusion v4, LBP on CK + dataset, and the comparison results are shown in tables 2 and 3. The identification accuracy rate of the model is 97.62% on the CK + data set, the identification accuracy rate of human on the Fer2013 data set is 65% + -5%, the identification accuracy rate of the model is 70.54% on the Fer2013 data set, and the human identification effect is achieved. Compared with the traditional facial expression recognition algorithm, the method for recognizing the facial expression by adopting the deep learning has high accuracy, compared with the method which singly adopts the deep learning, the method for recognizing the facial expression by adopting the deep learning combined with the traditional method has higher accuracy, the recognition rate of the method is higher than that of other methods on two data sets, and the model is verified to have certain feasibility and generalization capability in expression recognition.

Table 2 identification results of different methods on Fer2013 data set

TABLE 3 recognition results of different methods on CK + data sets

In summary, aiming at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional facial expression recognition learning method, the application provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network. Experiments show that the model provided by the application can effectively extract facial expression features and classify expressions, and has high accuracy and robustness. Although the proposed model achieves good recognition effect, the generalization capability of the network model still needs to be enhanced to further improve the recognition rate of the facial expression.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The facial expression recognition method based on the multi-channel fusion and the lightweight neural network is characterized by comprising the following steps of:

s3, constructing and initializing a lightweight neural network;

2. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein step S4 is to use the data enhanced training set to perform model training when training the constructed lightweight neural network, and train the model in batches, generate data batch by batch, then perform back propagation, update weights in the model, and repeat the process until reaching the desired epoch number.

3. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S1, after the face image is obtained, image normalization is performed to adjust the feature value scales of different dimensions to be within a similar range.

4. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S2, the process of extracting local texture features of the face region using the local binary pattern is as follows:

5. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network according to claim 1, wherein the process of detecting the edges of the face area based on Canny edge detection algorithm in step S2 is as follows:

H(x,y)＝G(x,y)×h(x,y) (5)

s22, after noise filtering, calculating the gradient amplitude and direction for H (x, y) to estimate the edge strength and direction at each point, calculating the gradient by using the finite difference of the first-order partial derivatives, wherein the matrices of the first-order partial derivatives in the x and y directions are P (i, j) and Q (i, j):

6. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S4, the image obtained after channel fusion is a three-channel image with 48 x 48 pixels.

7. The method for recognizing the facial expression based on the multi-channel fusion and the lightweight neural network as claimed in claim 6, wherein the step S4 is implemented when the image inputted into the lightweight neural network is recognized: