CN113989890A - Face expression recognition method based on multi-channel fusion and lightweight neural network - Google Patents
Face expression recognition method based on multi-channel fusion and lightweight neural network Download PDFInfo
- Publication number
- CN113989890A CN113989890A CN202111273460.8A CN202111273460A CN113989890A CN 113989890 A CN113989890 A CN 113989890A CN 202111273460 A CN202111273460 A CN 202111273460A CN 113989890 A CN113989890 A CN 113989890A
- Authority
- CN
- China
- Prior art keywords
- image
- convolution
- neural network
- lightweight neural
- gradient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network, which aims at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional facial expression recognition learning method, and provides the facial expression recognition method based on the multi-channel fusion and the lightweight neural network. Experiments show that the model provided by the invention can effectively extract facial expression characteristics and classify expressions, and has better accuracy and robustness.
Description
Technical Field
The invention belongs to the technical field of facial expression recognition, and particularly relates to a facial expression recognition method based on multi-channel fusion and a lightweight neural network.
Background
In recent years, machine learning is rapidly developed in the field of artificial intelligence, how to realize better understanding of human emotion by a computer, and further change the relationship between a human and the computer are concerned by more and more researchers. Mehrabian's research has shown that in person-to-person communication, the information conveyed by facial expressions is a very large proportion, up to 55%, while only 7% of the proportion depends on the content of the speaker. It follows that human facial expressions play a crucial role in human-to-human communication. The expression recognition is a cross discipline spanning the fields of physiology, neurology, computer science and the like, and has great potential application value in the fields of psychology, intelligent robots, online education, intelligent monitoring and the like.
The expression is a form of non-verbal communication, and most of the expressions refer to states formed by facial muscles, eye muscles, oral muscles and five sense organs, such as smile, anger and the like, and can reflect emotional changes and psychological states of people in time. In 1971, psychologists Ekman and Friesen divided facial expressions into 6 basic expressions, namely, happy, sad, surprised, fear, angry and disgust, and systematically established a facial expression image library, and carefully described facial movement characteristics corresponding to each expression. Currently, facial expression recognition is mostly divided into two categories. One type is based on a traditional feature extraction algorithm, including an extraction algorithm based on geometric features, an extraction algorithm based on textural features and the like, and mainly comprises the steps of designing a proper feature extraction algorithm according to different feature requirements, combining different classifiers for classification, and then carrying out classification and identification on the facial expressions through algorithms such as a sparse representation classification method and a hidden Markov model. However, the traditional feature extraction algorithm can only extract some shallow features according to the artificially set features through non-autonomous learning, and errors are easily caused by human factors, so that the accuracy of expression classification is influenced. The other type is a deep learning-based feature extraction algorithm which comprises three models of a convolutional neural network, a deep confidence network and a restricted Boltzmann machine. The deep neural network has strong autonomous learning ability, can extract deeper features, and the extracted facial expression features are more favorable for visualization, and then the classification result is output through the classifier. Jain et al proposed a single deep convolutional neural network model containing deep residual blocks, and found that the combination of FCN and the residual block cloud greatly improved the overall result. Ko introduces a recent hybrid deep learning approach that combines Convolutional Neural Networks (CNN) with long-short term memory (LSTM). Ali et al propose a facial emotion recognition method that utilizes graph mining techniques to reduce the extracted features, where a gSpan frequent subgraph mining algorithm is used to find the frequent substructures of each emotion in the graph database. Yong Li et al propose a Convolutional Neutral Network (CNN) with attention mechanism (ACNN) that can effectively perceive the occluded regions of faces and improve the recognition accuracy of non-occluded faces and occluded faces. And the Tang combines the CNN and the SVM to recognize the facial expression, so that a better recognition effect is achieved on the Fer2013 data set. Fengyuan and the like combine with a SIFT and CNN feature fusion method to recognize the facial expressions, so that the accuracy of model recognition is improved.
Disclosure of Invention
In order to solve the problems of difficult feature extraction and incomplete feature extraction in the traditional method, the invention aims to provide a facial expression recognition method based on multi-channel fusion and a lightweight neural network, so that more complete image features are further extracted, and the accuracy and robustness of facial expression recognition are improved.
In order to achieve the purpose, the invention adopts the technical scheme that: the facial expression recognition method based on the multi-channel fusion and the lightweight neural network comprises the following steps:
s1, acquiring image data through an expression database or a camera, and performing face region detection on the face expression database image by using a Cascade Cascade classifier based on Haar characteristics to acquire a face image;
s2, extracting local texture features of the face region by adopting a local binary pattern, and detecting the edge of the face region based on a Canny edge detection algorithm;
s3, constructing and initializing a lightweight neural network;
s4, performing channel fusion on the obtained face image, the LBP texture feature image and the edge detection Canny image, performing data normalization and data enhancement on the fused image, and inputting the image into the constructed lightweight neural network for training and recognition.
Further, step S4 is to train the model using the data-enhanced training set when training the constructed lightweight neural network, and train the model in batches, generate data batch by batch, then perform back propagation, update weights in the model, and repeat the process until the desired epoch number is reached.
Further, in step S1, after the face image is obtained, image normalization processing is performed to adjust the feature value scales of different dimensions to be within a similar range.
Further, in step S2, the process of extracting the local texture features of the face region by using the local binary pattern is as follows:
given a pixel (x)c,yc) The number of sampling points is P and the radius of the sampling circle is R, and the obtained LBP can be expressed as the following decimal system:
where P represents the P-th sample point of the total P sample points in the circular region, icRepresenting the gray value of the central pixel in the neighborhood of the circle, ipRepresenting the gray values of the P surrounding pixels in a circular neighborhood, the function s (x) is defined as:
the value of the original LBP is converted into binary code, which is subjected to a cyclic shift operation, expressed by a mathematical formula, taking the smallest value of all results:
where ROR (x, i) refers to performing a cyclic bit right shift i times on the P-bit number x.
Further, the process of detecting the edge of the face region based on the Canny edge detection algorithm in step S2 is as follows:
s21, performing convolution noise reduction by adopting a Gaussian smoothing filter, namely performing convolution operation on the original data and the Gaussian filter to make the image smoother, wherein the image is expressed by a mathematical formula, and the two-dimensional Gaussian function is as follows:
wherein (x, y) is the longitudinal and transverse coordinate point of the pixel point of the original image h (x, y), and sigma is the standard deviation of the Gaussian function; and obtaining H (x, y) after convolution processing is carried out on the Gaussian function and the original image H (x, y):
H(x,y)=G(x,y)×h(x,y) (5)
after noise filtering, the gradient magnitude and direction are calculated for H (x, y) to estimate the edge strength and direction at each point, the gradient is calculated by using finite difference of first-order partial derivatives, and the matrices of the first-order partial derivatives in the x and y directions are P (i, j) and Q (i, j):
the gradient amplitude M (i, j) and gradient direction θ (i, j) are calculated by the following formula:
the gradient angle theta ranges from radian-pi to pi, the gradient angle theta is approximated to four directions which respectively represent the horizontal direction, the vertical direction and two diagonal directions (0 degrees, 45 degrees, 90 degrees and 135 degrees), the gradient amplitude is inhibited by non-maximum values along the gradient direction, the local maximum value of a pixel point is searched, on each point, a central pixel point of the field is compared with two pixels along the corresponding gradient direction, if the central pixel is the maximum value, the local maximum value is reserved, otherwise, the center is set to be 0, so that the non-maximum value can be inhibited, the point with the maximum local gradient is reserved, and the refined edge is obtained.
Further, in step S4, the image obtained after channel fusion is a three-channel image of 48 × 48 pixels.
Further, step S4, when recognizing the image in the input lightweight neural network:
firstly, carrying out convolution operation sequentially through two 2D convolution layers, wherein the sizes of convolution kernels are 3 multiplied by 3, the numbers of the convolution kernels are 32 and 64 respectively, and the step length is 1; then, the output sequentially passes through a residual error module I, a residual error module II and a residual error module I, the number of convolution kernels is 128, 256 and 512 respectively, the size of the convolution kernel of the depth separable convolution is 3 multiplied by 3, the size of the convolution kernel of the maximum pooling is 3 multiplied by 3, and the size of the convolution kernel of the 2D convolution layer is 1 multiplied by 1; then, the output is sequentially sent into two depth separable convolution layers, the sizes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 1024 and 512 respectively, and the step length is 1; and finally, the output is sequentially sent to a global averaging pooling layer and a Softmax classifier, and all 2D convolution and depth separable convolution operations are performed through a Batch Normalization layer and a ReLU6 activation layer to accelerate the convergence speed of the network and increase the capability of extracting nonlinear features.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a human face expression recognition method based on multi-channel fusion and a lightweight neural network, aiming at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional human face expression recognition learning method. Experiments show that the model provided by the invention can effectively extract facial expression characteristics and classify expressions, and has better accuracy and robustness.
Drawings
FIG. 1 is a frame diagram of a facial expression recognition method based on multi-channel fusion and a lightweight neural network according to the present invention;
FIG. 2 is a schematic diagram of face regions detected for different facial expressions, respectively;
FIG. 3 is a schematic diagram of LBP extraction of texture features;
FIG. 4 is a rotation invariant face expression LBP feature map extracted using local binary patterns;
fig. 5 is an edge detection diagram of a face region extracted based on the Canny edge detection algorithm;
FIG. 6 is a schematic diagram of a neural network;
FIG. 7 is a cross-channel correlation and spatial correlation decoupling schematic;
FIG. 8 is a network framework diagram of residual module one;
FIG. 9 is a network framework diagram of residual module two;
FIG. 10 is a schematic diagram of an improved Xmeeting network model;
FIG. 11 is a sample of a portion image of a Fer2013 dataset;
FIG. 12 is a sample of a CK + dataset image;
FIG. 13 is a confusion matrix on the Fer2013 dataset;
FIG. 14 is a confusion matrix on a CK + data set.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.
1. The principle of the invention is as follows: firstly, image data is obtained through an expression database or a camera, a Cascade Cascade classifier based on Haar features is used for carrying out face region detection on images in a face expression database, whether faces exist in the images needs to be determined, and the positions of the faces need to be detected. Secondly, after the facial expression image is obtained, local texture features of the LBP extracted face region are combined with an edge detection algorithm Canny, and the detected face image, the LBP texture feature image and the edge detection Canny image are subjected to channel fusion to form a three-channel form. And finally, performing data normalization and data enhancement on the fused image, and inputting the image into the constructed lightweight neural network for training and recognition. And configuring a training model after the network model is built, then training the model by using a data-enhanced training set, training the model according to batches, generating data batch by batch, executing back propagation, updating the weight in the model, and repeating the process until the expected epoch number is reached. The basic flow chart of facial expression recognition is shown in fig. 1.
2. Image pre-processing
The expression database in the application preferentially adopts images of a Fer2013 database and a CK + database. However, the Fer2013 and CK + database images are original images and often contain some information irrelevant to expressions, such as gestures, sunglasses, hats and the like, so in order to improve the image recognition accuracy, redundant information needs to be removed through image preprocessing. The pretreatment adopted by the application is as follows: face detection and image normalization.
2.1 face detection
The method and the device adopt a Cascade Cascade classifier based on Haar characteristics to detect the human face region. The method comprises the steps of performing feature evaluation on Haar-like through an integrogram, screening N excellent feature values (namely, optimal weak classifiers), then transmitting the N optimal weak classifiers to AdaBoost for training, training the classifiers into a strong classifier by using an AdaBoost algorithm, distinguishing human faces and non-human faces, and finally cascading a plurality of strong classifiers together to improve the accuracy. The detector has the characteristics of high inter-class variability, local intensity difference, high calculation efficiency and the like.
Different features of the Haar can be combined in various ways to generate more complex cascade features, and the Haar feature value reflects the contrast and gradient change of the image. The cascade classifier is equivalent to a decision tree, the hierarchy judgment is more accurate, the human face area is searched by multi-scale scaling and traversing of a sliding window, the model is small and light, and the detection speed is high even on equipment with limited resources. As shown in fig. 2, the face regions detected for different facial expressions.
2.2 image normalization
In deep learning, the normalization of the images not only accelerates the convergence of the training network but also maintains affine invariance. The main objective of normalization is to provide invariance, which mainly aims at fluctuations of average pixel intensity and contrast, and the illumination intensity, image reflectivity and the like can change the pixel intensity and contrast of different areas of the picture, and normalization is to weaken the fluctuations through scaling, so that the brighter parts become darker and the darker parts become lighter. In the preprocessing step, normalization is adopted to adjust the feature value scales of different dimensions to be in a similar range, namely, each pixel value of the image is divided by 255 to be normalized to be in a 0-1 range.
2.3 LBP feature extraction
LBP is an operator used to describe local texture features of an image, and the extracted features are local texture features of the image, which reflect the relationship of each pixel to surrounding pixels. The original LBP operator is fixed in radius of the field and not rotation-invariant, and in order to meet the texture requirements of different sizes and frequencies, the LBP operator with rotation invariance is adopted in the method. It allows a group of pixel points with P uniform intervals in a circular neighborhood with radius R, the circle takes the pixel to be marked as the center, the gray value of P sampling points is calculated according to a bilinear interpolation algorithm, and the neighborhood can be represented by a symbol (P, R). The image rotation will obtain different LBP values, and the minimum value is taken as the LBP value after the rotation invariant processing. The principle diagram of LBP texture feature extraction is shown in FIG. 3. Black and white in the figure represent weaker and stronger pixels, respectively, than the central pixel.
Expressed by a mathematical formula, given a pixel (x)c,yc) The number of sampling points is P and the radius of the sampling circle is R, and the obtained LBP can be expressed as the following decimal system:
where P represents the P-th sample point of the total P sample points in the circular region, icRepresenting the gray value of the central pixel in the neighborhood of the circle, ipRepresenting the gray values of the P surrounding pixels in a circular neighborhood, the function s (x) is defined as:
the value of the original LBP is converted into binary code, which is subjected to a cyclic shift operation, expressed by a mathematical formula, taking the smallest value of all results:
wherein ROR (x, i) refers to performing cyclic shift right i times on the P-bit number x, and the facial expression LBP feature map extracted by the present application is shown in fig. 4.
2.4 Canny edge detection Algorithm
The Canny edge detection operator is one of the most classical and advanced algorithms in image edge detection algorithms, and is a multi-stage edge detection algorithm developed by John f. The Canny algorithm includes four basic steps:
step 1, performing Gaussian filtering processing on an input image to remove noise;
step 3, according to the gradient direction, carrying out non-maximum suppression on the gradient amplitude;
and 4, processing and connecting edges by using double thresholds.
In the invention, before Canny operator edge detection is carried out, noise is firstly removed from an original image, and the noise is easily identified as a false edge because the place where the noise appears is a place where the gray value is changed violently. The filtering is to remove noise, and the Gaussian smoothing filter is adopted to perform convolution noise reduction, namely, the convolution operation is performed on the original data and the Gaussian filter, so that the image is smoother. Expressed by a mathematical formula, the two-dimensional gaussian function is:
wherein (x, y) is the longitudinal and transverse coordinate point of the pixel point of the original image h (x, y), and sigma is the standard deviation of the Gaussian function; and obtaining H (x, y) after convolution processing is carried out on the Gaussian function and the original image H (x, y):
H(x,y)=G(x,y)×h(x,y) (5)
after noise filtering, the gradient magnitude and direction are calculated for H (x, y) to estimate the edge strength and direction at each point, the gradient is calculated by using finite difference of first-order partial derivatives, and the matrices of the first-order partial derivatives in the x and y directions are P (i, j) and Q (i, j):
the gradient amplitude M (i, j) and the gradient direction θ (i, j) are calculated by the following formula (7):
the gradient angle theta ranges from radian-pi to pi, the gradient angle theta is approximated to four directions which respectively represent the horizontal direction, the vertical direction and two diagonal directions (0 degrees, 45 degrees, 90 degrees and 135 degrees), the gradient amplitude is inhibited by non-maximum values along the gradient direction, the local maximum value of a pixel point is searched, on each point, a central pixel point of the field is compared with two pixels along the corresponding gradient direction, if the central pixel is the maximum value, the local maximum value is reserved, otherwise, the center is set to be 0, so that the non-maximum value can be inhibited, the point with the maximum local gradient is reserved, and the refined edge is obtained. The Canny image extracted by the application for facial expression edge detection is shown in fig. 5.
2.5 feature fusion network
The fusion of features with different scales is an important means for improving detection and segmentation performance, and is classified into Early fusion (Early fusion) and Late fusion (Late fusion) according to the sequence of fusion and prediction. According to different image representation levels, image fusion can be divided into three levels of fusion: pixel level fusion, feature level fusion, and decision level fusion. By comparing the effects of several fusion modes through experiments, a method of performing early fusion before training a predictor by combining feature-level fusion and pixel-level fusion is finally selected as shown in fig. 1. Let us assume the original image pixel feature vector v1 ∈ RnLBP extracted texture feature vector v2 ∈ RmAnd the feature vector v 3E R extracted by the edge detection CannykAnd splicing the two-dimensional vector with the same dimension, wherein the two-dimensional vector has fusion characteristic vectors:
v=[v1,v2,v3]∈Rn+m+K (8)
the image fusion method has the following advantages:
(1) enhancing the image, and improving the resolution and definition of the image; (2) enhancing relevant features of the image; (3) mutually supplementing related information, and removing noise and redundancy; (4) the volume recognition capability of target detection is improved; (5) complete three-dimensional reconstruction data is obtained.
3. Building neural network model
3.1 lightweight convolutional neural network:
in deep learning, a large network often has a storage problem caused by a large number of weight parameters and a problem of a slow processing task speed caused by a large calculation amount. Therefore, the present application solves the above existing problems with a lightweight model. The design idea of the lightweight model is to design a more efficient network convolution mode, so that network performance is not lost while network parameters are reduced, and the lightweight model has the characteristics of local connection, weight sharing and hierarchical expression. As shown in fig. 6.
In image processing, convolution is a process in which a convolution kernel slides on an image and finds the multiplication sum of corresponding elements. The convolutional layer is the most important part in the convolutional network, and the main function of the convolutional layer is to extract image features by enhancing original signal features and reducing noise, and the calculation process of the convolutional operation is as follows:
whereinFor the jth feature map cell of the lth layer,is the ith input of the L-1 layer, thetaijRepresenting the convolution kernel, b is the bias unit, and g (x) is the activation function.
The pooling is also called down-sampling, no parameter needing learning exists in the pooling process, and the method mainly has the function of reducing the feature map and playing a role in reducing the dimension. A common method is to select the maximum or average of the local area. The present application employs maximum pooling of spatial data.
The general network adopts the full connection layer, but the full connection layer parameter is many to take place the overfitting easily, therefore, this application adopts global average pooling to replace the full connection layer, so not only can reduce the quantity of parameter and prevent the emergence of overfitting, and it has carried out the summation to spatial information moreover, therefore has stability to the spatial transform of input.
3.2 depth separable convolution:
in order to reduce the number of parameters of the network and improve the performance of the model, the idea of deep separable convolution can be used for processing the channel correlation and the spatial correlation separately. The core idea of the deep separable convolution is to convolve with different convolution kernels for different input channels, which decomposes the standard convolution operation into two processes, deep convolution and point-by-point convolution respectively. We use depth separable convolution by first performing a point-by-point convolution and then performing a depth convolution, in order to ensure that the data is not corrupted, there is no nonlinearities between the two convolutions caused by Relu, as shown in fig. 7. Compared with the standard convolution, the adopted deep separable convolution uses smaller space cost and smaller time cost, and achieves the same effect as the standard convolution layer, so that the calculation amount is greatly reduced while the accuracy of the neural network is kept.
3.3 activation function:
in the deep neural network, because the expression capability of the linear model is not enough, the problem of linear inseparability cannot be solved, the expression capability and the nonlinear modeling capability of the neural network on the model are improved by introducing an activation function into a hidden layer to add a nonlinear factor, and the problem which cannot be solved by the linear model is solved. The application uses ReLU6, referred to in MobileNet V2, as the nonlinear activation function. The ReLU6 is to limit the maximum output value to 6 in the ReLU function, which not only has stronger robustness but also has good numerical resolution at low precision. The ReLU6 function is defined as follows:
ReLU6=min{max(0,x),6} (10)
wherein x is the output characteristic of the last network layer.
The ReLU6 function has the following advantages:
(1) the problems of gradient disappearance and gradient explosion during gradient reverse transmission in the deep neural network are solved;
(2) the calculation amount is small, and the calculation speed is high;
(3) the convergence speed of gradient descent is accelerated;
(4) and the sparse expression capability of the neural network is improved.
3.4 residual error network
The method aims to solve the problems that the learning efficiency is low and the accuracy cannot be effectively improved due to the deepening of the network depth. The application adopts a residual error connection mechanism in Resnet50 to construct a network model, and converts the convolution layer therein into the depth separable convolution proposed by the application. The residual error network is characterized in that data output of a certain layer of a plurality of layers in front is directly skipped from a plurality of layers and introduced into an input part of a data layer behind, and can train a deeper neural network and solve the problems of network performance degradation and gradient disappearance caused by deepening of the depth. The introduction of the residual error not only obviously speeds up the convergence process of the network but also obtains higher accuracy in the deep neural network. The network structure of the first residual block and the second residual block are shown in fig. 8 and fig. 9, respectively.
3.5 network architecture
The application provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network, wherein a 48 x 48 pixel three-channel image subjected to multi-channel fusion is input into the constructed lightweight neural network, the network is designed by adopting an Xception-based structure recognition feature, and the accuracy of facial expression recognition is improved mainly by relying on layer-by-layer processing of the three-channel feature subjected to channel fusion. The proposed network structure is shown in fig. 10. Firstly, carrying out convolution operation sequentially through two 2D convolution layers, wherein the sizes of convolution kernels are 3 multiplied by 3, the numbers of the convolution kernels are 32 and 64 respectively, and the step length is 1; then, the output sequentially passes through a residual error module I, a residual error module II and a residual error module I, the number of convolution kernels is 128, 256 and 512 respectively, the size of the convolution kernel of the depth separable convolution is 3 multiplied by 3, the size of the convolution kernel of the maximum pooling is 3 multiplied by 3, and the size of the convolution kernel of the 2D convolution layer is 1 multiplied by 1; then, the output is sequentially sent into two depth separable convolution layers, the sizes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 1024 and 512 respectively, and the step length is 1; and finally, the output is sequentially sent to a global averaging pooling layer and a Softmax classifier, and all 2D convolution and depth separable convolution operations are performed through a Batch Normalization layer and a ReLU6 activation layer to accelerate the convergence speed of the network and increase the capability of extracting nonlinear features.
4. Results and analysis of the experiments
4.1 Experimental platform and data set:
the experimental software platform is a Python3.7 version under Linux, and adopts a Keras framework with TensorFlow as a rear end. The hardware platform is Dell Poweredge R940xa, and the GPU is 16GB NVIDIA Tesla T4. We optimize the training process using a batch size of 64, an initial learning rate of 0.001, and an Adam optimizer.
In the application, a Fer2013 data set and a CK + data set are respectively adopted to perform a facial expression recognition experiment. The data set adopts a cross-validation method to divide data randomly and objectively, so that human factors are reduced. Wherein 80% of the expression images are training sets, 20% of the expression images are testing sets, and the seeds of random numbers are set to be 2019 to control random states.
1) Fer2013 dataset
Fer2013 is a data set provided by a Kaggle facial expression analysis game, the data set is created by using google image search API, and the OpenCV face recognition is used for collecting the face area in the image. The data of the Fer2013 data set is more complete and better accords with the scenes of actual life, so that a Fer2013 training and testing model is mainly selected. The data set comprises 35887 pictures, wherein the number of the pictures is 4953, the pictures are angry 547, the pictures are fear 5121, the pictures are happy 8989, the pictures are sad 6077, the pictures are surprised 4002 and the pictures are neutral 6198, and 7 expressions are shared. Fig. 11 is a partial sample of the Fer2013 expression data set, which includes facial expressions of different ages, sexes, skin colors and different degrees of occlusion.
2) CK + data set
The CK + data set comprises 593 emotion sequences of 123 objects, and each picture sequence shows transition expression from neutral facial expression to peak, including 6 basic expressions plus slight and neutral, and because the fer2013 data set does not contain slight expression, 7 expressions are finally selected by the application, namely angry, neutral, disgust, fear, happiness, sadness and surprise. Fig. 12 shows some examples of CK + expression data sets, which include facial expressions of different ages, sexes, skin colors, and different degrees of occlusion.
4.2 analysis of the results:
the experimental results are obtained by optimizing the training model for multiple times, and the lower graph is a confusion matrix on the Fer2013 and CK + data sets, as shown in FIG. 13 and FIG. 14. The confusion matrix of the Fer2013 data set can show that the highest recognition rate is happy and surprised, and the fear and sad recognition rates are relatively low, because the Fer2013 data set has interference of non-face pictures, various shelters and the like, and is closer to scenes of real life, so that the difficulty in feature extraction is increased. The confusion matrix of the CK + data set shows that the recognition result is good, happy, angry and surprise, the fear and sadness recognition effect is not good enough, and the possible reason is that the fear and sadness expression characteristics have certain similarity and interfere with distinguishing different expressions, so that the expression recognition accuracy is not high.
4.2.1 Domain selection
Selecting the appropriate neighborhood for LBP-based techniques has a significant impact on the final performance. It relates to the number of sampling points, the radius of the neighborhood, the distribution of the sampling points and the shape of the neighborhood. In the LBP rotation invariant mode, in order to find out what values P and R should be, the highest accuracy can be obtained, the present application performs a comparison experiment on P and R with different values, and the experimental results are shown in table 1.
TABLE 1 Fer2013 and CK + data set experiment accuracy
Fer2013 | CK+ | |
Raw data set image | 66.42% | 92.35% |
LBP(8,1) | 68.51% | 94.56% |
LBP(8,3) | 70.54% | 97.62% |
LBP(8,5) | 68.98% | 96.83% |
LBP(24,1) | 65.62% | 86.61% |
Column 1 in table 1 is the different input image types, and columns 2 and 3 are the accuracy of identifying images on the Fer2013 and CK + datasets using the present network model. The experimental data show that the rotation invariant pattern LBP (8, 3) feature graph extracted from the data set image is input into the constructed lightweight neural network after channel fusion, and the recognition rate can reach the highest.
4.2.2 comparative testing of different methods
In order to verify the effectiveness of the method in facial expression recognition, the experiment is compared with expression recognition algorithms of Xception, CNN and FER-Net on a Fer2013 data set. The experiment was compared with the expression recognition algorithms of Xception, inclusion v4, LBP on CK + dataset, and the comparison results are shown in tables 2 and 3. The identification accuracy rate of the model is 97.62% on the CK + data set, the identification accuracy rate of human on the Fer2013 data set is 65% + -5%, the identification accuracy rate of the model is 70.54% on the Fer2013 data set, and the human identification effect is achieved. Compared with the traditional facial expression recognition algorithm, the method for recognizing the facial expression by adopting the deep learning has high accuracy, compared with the method which singly adopts the deep learning, the method for recognizing the facial expression by adopting the deep learning combined with the traditional method has higher accuracy, the recognition rate of the method is higher than that of other methods on two data sets, and the model is verified to have certain feasibility and generalization capability in expression recognition.
Table 2 identification results of different methods on Fer2013 data set
TABLE 3 recognition results of different methods on CK + data sets
In summary, aiming at the problems that the feature extraction process is complex and deeper high semantic features and depth features cannot be obtained from an original image in the traditional facial expression recognition learning method, the application provides a facial expression recognition method based on multi-channel fusion and a lightweight neural network. Experiments show that the model provided by the application can effectively extract facial expression features and classify expressions, and has high accuracy and robustness. Although the proposed model achieves good recognition effect, the generalization capability of the network model still needs to be enhanced to further improve the recognition rate of the facial expression.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. The facial expression recognition method based on the multi-channel fusion and the lightweight neural network is characterized by comprising the following steps of:
s1, acquiring image data through an expression database or a camera, and performing face region detection on the face expression database image by using a Cascade Cascade classifier based on Haar characteristics to acquire a face image;
s2, extracting local texture features of the face region by adopting a local binary pattern, and detecting the edge of the face region based on a Canny edge detection algorithm;
s3, constructing and initializing a lightweight neural network;
s4, performing channel fusion on the obtained face image, the LBP texture feature image and the edge detection Canny image, performing data normalization and data enhancement on the fused image, and inputting the image into the constructed lightweight neural network for training and recognition.
2. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein step S4 is to use the data enhanced training set to perform model training when training the constructed lightweight neural network, and train the model in batches, generate data batch by batch, then perform back propagation, update weights in the model, and repeat the process until reaching the desired epoch number.
3. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S1, after the face image is obtained, image normalization is performed to adjust the feature value scales of different dimensions to be within a similar range.
4. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S2, the process of extracting local texture features of the face region using the local binary pattern is as follows:
given a pixel (x)c,yc) The number of sampling points is P and the radius of the sampling circle is R, and the obtained LBP can be expressed as the following decimal system:
where P represents the P-th sample point of the total P sample points in the circular region, icRepresenting the gray value of the central pixel in the neighborhood of the circle, ipRepresenting the gray values of the P surrounding pixels in a circular neighborhood, the function s (x) is defined as:
the value of the original LBP is converted into binary code, which is subjected to a cyclic shift operation, expressed by a mathematical formula, taking the smallest value of all results:
where ROR (x, i) refers to performing a cyclic bit right shift i times on the P-bit number x.
5. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network according to claim 1, wherein the process of detecting the edges of the face area based on Canny edge detection algorithm in step S2 is as follows:
s21, performing convolution noise reduction by adopting a Gaussian smoothing filter, namely performing convolution operation on the original data and the Gaussian filter to make the image smoother, wherein the image is expressed by a mathematical formula, and the two-dimensional Gaussian function is as follows:
wherein (x, y) is the longitudinal and transverse coordinate point of the pixel point of the original image h (x, y), and sigma is the standard deviation of the Gaussian function; and obtaining H (x, y) after convolution processing is carried out on the Gaussian function and the original image H (x, y):
H(x,y)=G(x,y)×h(x,y) (5)
s22, after noise filtering, calculating the gradient amplitude and direction for H (x, y) to estimate the edge strength and direction at each point, calculating the gradient by using the finite difference of the first-order partial derivatives, wherein the matrices of the first-order partial derivatives in the x and y directions are P (i, j) and Q (i, j):
the gradient amplitude M (i, j) and gradient direction θ (i, j) are calculated by the following formula:
the gradient angle theta ranges from radian-pi to pi, the gradient angle theta is approximated to four directions which respectively represent the horizontal direction, the vertical direction and two diagonal directions (0 degrees, 45 degrees, 90 degrees and 135 degrees), the gradient amplitude is inhibited by non-maximum values along the gradient direction, the local maximum value of a pixel point is searched, on each point, a central pixel point of the field is compared with two pixels along the corresponding gradient direction, if the central pixel is the maximum value, the local maximum value is reserved, otherwise, the center is set to be 0, so that the non-maximum value can be inhibited, the point with the maximum local gradient is reserved, and the refined edge is obtained.
6. The method for recognizing facial expressions based on multi-channel fusion and lightweight neural network as claimed in claim 1, wherein in step S4, the image obtained after channel fusion is a three-channel image with 48 x 48 pixels.
7. The method for recognizing the facial expression based on the multi-channel fusion and the lightweight neural network as claimed in claim 6, wherein the step S4 is implemented when the image inputted into the lightweight neural network is recognized:
firstly, carrying out convolution operation sequentially through two 2D convolution layers, wherein the sizes of convolution kernels are 3 multiplied by 3, the numbers of the convolution kernels are 32 and 64 respectively, and the step length is 1; then, the output sequentially passes through a residual error module I, a residual error module II and a residual error module I, the number of convolution kernels is 128, 256 and 512 respectively, the size of the convolution kernel of the depth separable convolution is 3 multiplied by 3, the size of the convolution kernel of the maximum pooling is 3 multiplied by 3, and the size of the convolution kernel of the 2D convolution layer is 1 multiplied by 1; then, the output is sequentially sent into two depth separable convolution layers, the sizes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 1024 and 512 respectively, and the step length is 1; and finally, the output is sequentially sent to a global averaging pooling layer and a Softmax classifier, and all 2D convolution and depth separable convolution operations are performed through a Batch Normalization layer and a ReLU6 activation layer to accelerate the convergence speed of the network and increase the capability of extracting nonlinear features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111273460.8A CN113989890A (en) | 2021-10-29 | 2021-10-29 | Face expression recognition method based on multi-channel fusion and lightweight neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111273460.8A CN113989890A (en) | 2021-10-29 | 2021-10-29 | Face expression recognition method based on multi-channel fusion and lightweight neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113989890A true CN113989890A (en) | 2022-01-28 |
Family
ID=79744529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111273460.8A Withdrawn CN113989890A (en) | 2021-10-29 | 2021-10-29 | Face expression recognition method based on multi-channel fusion and lightweight neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113989890A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114639149A (en) * | 2022-03-18 | 2022-06-17 | 杭州慧田科技有限公司 | Sick bed terminal with emotion recognition function |
CN114663452A (en) * | 2022-02-28 | 2022-06-24 | 南京工业大学 | Airport visibility classification method based on MobileNet-V2 neural network |
CN114699080A (en) * | 2022-04-28 | 2022-07-05 | 电子科技大学 | Driver mental stress degree identification method based on fusion characteristics |
CN114998958A (en) * | 2022-05-11 | 2022-09-02 | 华南理工大学 | Face recognition method based on lightweight convolutional neural network |
CN115019363A (en) * | 2022-05-19 | 2022-09-06 | 重庆邮电大学 | Lightweight facial expression recognition method based on mid-Xconvergence network |
CN115348709A (en) * | 2022-10-18 | 2022-11-15 | 良业科技集团股份有限公司 | Smart cloud service lighting display method and system suitable for text travel |
CN116403270A (en) * | 2023-06-07 | 2023-07-07 | 南昌航空大学 | Facial expression recognition method and system based on multi-feature fusion |
CN116863323A (en) * | 2023-09-04 | 2023-10-10 | 济宁鑫惠生水产养殖专业合作社 | Visual detection method and system for pollution of water source for fishery culture |
CN116958783A (en) * | 2023-07-24 | 2023-10-27 | 中国矿业大学 | Light-weight image recognition method based on depth residual two-dimensional random configuration network |
CN118230395A (en) * | 2024-05-13 | 2024-06-21 | 广东电网有限责任公司 | Human face recognition method and device based on INSIGHTFACE and LIS file management |
CN118379776A (en) * | 2024-06-21 | 2024-07-23 | 浙江核新同花顺网络信息股份有限公司 | Face attribute identification method, device, equipment and readable storage medium |
-
2021
- 2021-10-29 CN CN202111273460.8A patent/CN113989890A/en not_active Withdrawn
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114663452A (en) * | 2022-02-28 | 2022-06-24 | 南京工业大学 | Airport visibility classification method based on MobileNet-V2 neural network |
CN114639149A (en) * | 2022-03-18 | 2022-06-17 | 杭州慧田科技有限公司 | Sick bed terminal with emotion recognition function |
CN114699080A (en) * | 2022-04-28 | 2022-07-05 | 电子科技大学 | Driver mental stress degree identification method based on fusion characteristics |
CN114699080B (en) * | 2022-04-28 | 2023-04-25 | 电子科技大学 | Driver mental stress degree identification method based on fusion characteristics |
CN114998958A (en) * | 2022-05-11 | 2022-09-02 | 华南理工大学 | Face recognition method based on lightweight convolutional neural network |
CN114998958B (en) * | 2022-05-11 | 2024-04-16 | 华南理工大学 | Face recognition method based on lightweight convolutional neural network |
CN115019363A (en) * | 2022-05-19 | 2022-09-06 | 重庆邮电大学 | Lightweight facial expression recognition method based on mid-Xconvergence network |
CN115348709A (en) * | 2022-10-18 | 2022-11-15 | 良业科技集团股份有限公司 | Smart cloud service lighting display method and system suitable for text travel |
CN116403270B (en) * | 2023-06-07 | 2023-09-05 | 南昌航空大学 | Facial expression recognition method and system based on multi-feature fusion |
CN116403270A (en) * | 2023-06-07 | 2023-07-07 | 南昌航空大学 | Facial expression recognition method and system based on multi-feature fusion |
CN116958783A (en) * | 2023-07-24 | 2023-10-27 | 中国矿业大学 | Light-weight image recognition method based on depth residual two-dimensional random configuration network |
CN116958783B (en) * | 2023-07-24 | 2024-02-27 | 中国矿业大学 | Light-weight image recognition method based on depth residual two-dimensional random configuration network |
CN116863323A (en) * | 2023-09-04 | 2023-10-10 | 济宁鑫惠生水产养殖专业合作社 | Visual detection method and system for pollution of water source for fishery culture |
CN116863323B (en) * | 2023-09-04 | 2023-11-24 | 济宁鑫惠生水产养殖专业合作社 | Visual detection method and system for pollution of water source for fishery culture |
CN118230395A (en) * | 2024-05-13 | 2024-06-21 | 广东电网有限责任公司 | Human face recognition method and device based on INSIGHTFACE and LIS file management |
CN118379776A (en) * | 2024-06-21 | 2024-07-23 | 浙江核新同花顺网络信息股份有限公司 | Face attribute identification method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113989890A (en) | Face expression recognition method based on multi-channel fusion and lightweight neural network | |
Pitaloka et al. | Enhancing CNN with preprocessing stage in automatic emotion recognition | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
Zhang et al. | Multimodal learning for facial expression recognition | |
CN105139004B (en) | Facial expression recognizing method based on video sequence | |
CN112288011B (en) | Image matching method based on self-attention deep neural network | |
CN108921019B (en) | Gait recognition method based on GEI and TripletLoss-DenseNet | |
CN114758383A (en) | Expression recognition method based on attention modulation context spatial information | |
US20210264144A1 (en) | Human pose analysis system and method | |
Tian et al. | Ear recognition based on deep convolutional network | |
CN108960059A (en) | A kind of video actions recognition methods and device | |
CN110222718B (en) | Image processing method and device | |
Kas et al. | New framework for person-independent facial expression recognition combining textural and shape analysis through new feature extraction approach | |
CN111353385A (en) | Pedestrian re-identification method and device based on mask alignment and attention mechanism | |
Prabhu et al. | Facial Expression Recognition Using Enhanced Convolution Neural Network with Attention Mechanism. | |
CN114049531A (en) | Pedestrian re-identification method based on weak supervision human body collaborative segmentation | |
Mohammed et al. | Deep convolution neural network for facial expression recognition | |
CN111860056B (en) | Blink-based living body detection method, blink-based living body detection device, readable storage medium and blink-based living body detection equipment | |
Jabbooree et al. | A novel facial expression recognition algorithm using geometry β–skeleton in fusion based on deep CNN | |
CN116884067B (en) | Micro-expression recognition method based on improved implicit semantic data enhancement | |
Raj et al. | Object detection in live streaming video using deep learning approach | |
Chun-man et al. | Face expression recognition based on improved MobileNeXt | |
Vepuri | Improving facial emotion recognition with image processing and deep learning | |
CN117523626A (en) | Pseudo RGB-D face recognition method | |
Patil et al. | Gender recognition and age approximation using deep learning techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220128 |
|
WW01 | Invention patent application withdrawn after publication |