CN114298233A

CN114298233A - Expression recognition method based on efficient attention network and teacher-student iterative transfer learning

Info

Publication number: CN114298233A
Application number: CN202111655846.5A
Authority: CN
Inventors: 孔英会; 张帅桐; 张珂; 戚银城; 车辚辚; 赵振兵
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-08

Abstract

An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and an expression data set is used for training the efficient attention network; then, the trained network is used as a teacher network, the other high-efficiency attention network is used as a student network, and the student network is trained by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.

Description

Expression recognition method based on efficient attention network and teacher-student iterative transfer learning

Technical Field

The invention relates to an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, and belongs to the technical field of information processing.

Background

With the advent of the artificial intelligence era, intelligent devices have penetrated aspects of human life, in which human-computer interaction technology is particularly important as a bridge for communication between people and devices. Facial expressions, a non-verbal signal of human beings across race and culture, contain rich mental activity information. The automatic facial expression recognition has great application and research values in the fields of criminal investigation inquiry, fatigue driving recognition, patient emotion monitoring and the like.

The famous psychologist Paul Ekman in 1978 defined human expressions as seven basic categories of anger, disgust, fear, happiness, sadness, surprise and neutrality. The traditional expression recognition method relies on artificial design features (local binary patterns, histogram of directional gradients, principal component analysis and the like), has high execution efficiency, and cannot be fully adapted to face data of various scenes. In recent years, the deep learning technology has great advantages in end-to-end learning and high-precision recognition in the field of image classification, and more researchers model facial expressions based on a deep learning method to realize automatic recognition of the expressions.

Liu and the like (Liu K, Zhang M, Pan Z. facial expression recognition with CNN ensemble [ C ]//2016international conference on networks (CW). IEEE,2016:163-166) respectively train a plurality of convolutional neural networks with different structures, and finally integrate expression recognition results to achieve higher recognition accuracy.

Cai et al (Cai J, Meng Z, Khan A S, et al. basic attribute tree in relational neural network for facial expression recognition [ J ]. arXiv prediction arXiv:1812.07067,2018) propose a method for learning features by a hierarchical tree structure, wherein final features are learned in the tree structure, and features of different tree nodes are combined by probability map weighting, so that the accuracy of facial expression recognition is improved, but the model design is complex and the calculation load is high.

Fan and the like (Fan Y, Li V, Lam J C K. facial expression with latent-latent expression network [ J ]. IEEE transactions on-active computing,2020) construct a deep supervision attention network based on a complex VGG/ResNet network structure, design a two-stage training scheme to integrate the relationship between race, gender, age and the like and facial expressions, and finally predict by combining multi-scale information to achieve the first-class recognition precision. However, the method continues to use a classical complex network architecture, and the model parameters and the calculation amount are still large.

The method comprises the steps of enabling a real-time expression recognition framework [ J ] in a complex environment based on face segmentation, computer engineering and application 2020,56(12):134 and 140) to add the idea of face segmentation into a graph preprocessing step, designing a recognition framework of a segmentation network and a classification network in cascade, and simplifying model parameters and calculated quantity by carefully regulating and controlling hyper-parameters of a convolution module. However, due to the excessively light model structure, the problem of reduced fitting capability of the network is brought, and the inference real-time performance of the framework is guaranteed, but the accuracy of identification is not considered.

In summary, the high-precision deep learning model has a huge volume and is difficult to be directly deployed on edge devices (such as mobile terminals and embedded terminals) with limited resources, and a high-performance server is required to perform centralized processing on data. And although the light-weight network meets the deployment requirement of the edge side, the training difficulty is high, and the recognition accuracy is low. In addition, the degree of interconnection of everything is increasingly deepened, and the facial expression is in the process of changing every moment, the expression of each frame of image needs to be recognized in real time in massive data, so that high transmission cost and privacy disclosure risks are caused, and therefore it is very important to find a real-time facial expression recognition method capable of meeting the deployment requirement of the edge side.

Disclosure of Invention

The invention aims to provide an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning aiming at the defects of the prior art so as to meet the deployment requirement of expression recognition on edge side resource-limited equipment.

The problems of the invention are solved by the following technical scheme:

an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, the trained high-efficiency attention network is used as a teacher network;

according to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the expression data set preprocessing method comprises the following steps:

and zooming the image into a fixed size, unifying the resolution of the image, normalizing the pixel values, and copying the original image into three parts to form a three-channel tensor if the original image is a gray image.

The expression identification method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of:

and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.

The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning comprises the following steps:

firstly, introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, stacking the efficient attention inverse residual block to form a main body of an efficient attention network, then converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally, substituting a full connection layer for classification by a two-dimensional convolution layer at the tail to form a lightweight full convolution network.

According to the expression recognition method based on the efficient attention network and teacher-student iterative transfer learning, the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through two-dimensional convolution with a convolution kernel size of 1, then collects spatial features channel by channel through grouped convolution with a convolution kernel size of 3, and finally reduces the dimension through two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.

According to the expression identification method based on the high-efficiency attention network and teacher-student iterative transfer learning, a local channel attention mechanism firstly integrates a tensor t of H multiplied by W multiplied by C channel by channel into a one-dimensional feature vector of which the spatial information is 1 multiplied by C, then learns a weight value required by the channel by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, and finally multiplies the weight value with the original tensor t to obtain a new feature after zooming, wherein H represents the height of a feature map, W represents the width of the feature map, and C is the number of channels of the feature map.

The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning specifically comprises the following steps of training the efficient attention network by utilizing a preprocessed and data-enhanced expression data set:

on the expression image dataset after data enhancement, a random gradient descent optimizer is used according to a Softmax loss function l_softmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:

wherein

And

respectively representing teacher network parameter set theta obtained by training model_teacherPredicted output value, label, of the ith and j expressions of the next input image tensor_iAnd representing the label value corresponding to the ith expression of the image.

The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of training a student network by using a softening prediction value output by the teacher network:

the output of the softening teacher network is used as a soft label to represent the similarity relation between expressions, and KL divergence loss of the soft label and Softmax loss of the data set label are used for training and optimizing parameters of the student network model together, and the method is specifically represented as follows:

l＝α·T²·l_KL+(1-α)·l_Softmax

in the formula:

wherein y isⁱAnd zⁱRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.

The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following specific steps of:

a. inputting the image data of the test set after data enhancement into an optimized student network model, and counting the identification accuracy;

b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set theta_teacherParameter set theta for student network_student；

c. Fixed teacher network parameter set theta_teacherAdjusting learning rate to re-optimize student network parameter set theta_student；

d. And (c) repeating the steps a to c for a plurality of times, and carrying out iterative training and parameter migration on the teacher network and the student network model until the identification accuracy of the student network does not rise any more.

According to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the student network and the teacher network have the same structure.

Advantageous effects

Compared with the existing expression recognition method, the method has the advantages that:

1. according to the invention, a light local attention mechanism is introduced into the inverse residual convolution block, a high-efficiency attention inverse residual block is constructed, and a high-efficiency attention network is constructed based on the improved structure. The method has the advantages that the model parameter number and the calculated quantity level are guaranteed, meanwhile, the network fitting capacity is enhanced by a small number of parameters, and the identification accuracy of the model is remarkably improved;

2. the teacher model is used for assisting the network training of students on the basis of the knowledge distillation framework, the soft label similarity relation among expressions is supplemented for facial data, and the training difficulty of the lightweight network is further reduced. In addition, the training of the same network structure is selected, the characteristic difference between teachers and students ' networks is avoided, the soft label information of the teachers and students ' networks is refined and optimized through iterative training, iterative characteristic transfer of the teachers and students ' networks is enhanced through parameter migration, and the identification precision of the model is remarkably improved under the condition that extra network parameters and calculated quantity are not introduced.

3. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings.

FIG. 1 is a frame diagram of a teacher-student iterative migration method;

FIG. 2 is a schematic diagram of a data enhancement process;

FIG. 3 is a diagram of an efficient attention network architecture;

FIG. 4 is a training flow diagram of the teacher web learning phase;

FIG. 5 is a training flow diagram of the teacher and student iterative transfer learning phase;

FIG. 6 is a confusion matrix on FER2013 test set according to the present invention;

FIG. 7 is a confusion matrix on the RAF-DB test set of the present invention.

The symbols used herein are respectively represented as: H. w and C respectively represent the height, width and channel number of the characteristic diagram, t represents the tensor of the input image, and theta_teacherAnd theta_studentRespectively represent the teacher network and the student network parameter sets,

and

respectively representing teacher network parameter set theta obtained by training model_teacherPredicted output value, label, of the ith and j expressions of the next input image tensor_iThe label value y corresponding to the ith expression of the representative imageⁱAnd zⁱRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.

Detailed Description

Aiming at the defects of the prior art, the invention provides an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, the method adopts deep separable convolution to carry out lightweight structural design to ensure the real-time performance of a model, a local attention mechanism is introduced by means of a small amount of parameters to enhance the network fitting capacity, and finally the light-weight network recognition capacity is remarkably improved by the teacher-student iterative transfer learning method under the condition of not introducing additional parameters, so that the real-time deployment requirement of edge side resource limited equipment is met, and higher recognition accuracy can be achieved.

The method comprises the following steps:

s1: acquiring an expression data set; preprocessing and enhancing data of the data set, and zooming the data set to a fixed size;

s2: constructing a lightweight expression recognition model based on the high-efficiency attention network, and training the high-efficiency attention network according to the preprocessed expression data set after data enhancement;

s3: inputting the image data of the test set into the trained network to obtain a recognition result, using the recognition result as a teacher network, and using the output softening predicted value to assist another high-efficiency attention network as a student network to train;

s4: and testing and training the recognition accuracy of the student network, transferring the model parameters learned by the student network to the teacher network, and repeatedly performing multiple rounds of iterative transfer training until the recognition accuracy of the student network does not increase.

S5: and recognizing the facial expressions by utilizing a student network.

Step S1 specifically includes:

step S11: acquiring an expression image data set, unifying image resolution and normalizing pixel values, and if an original image is a gray image, copying the original image into three parts to form a three-channel tensor;

step S12: and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.

Step S2 specifically includes:

step S21: introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, further stacking to form a main body of an efficient attention network, converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally classifying by replacing a full connection layer with a two-dimensional convolution layer at the tail to form a lightweight full convolution network;

the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through a two-dimensional convolution with the convolution kernel size of 1, then uses a grouped convolution with the convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with the convolution kernel size of 1 and integrates the characteristic information of each channel point by point;

the local channel attention mechanism is characterized in that a tensor t of H multiplied by W multiplied by C input is integrated with a one-dimensional feature vector with spatial information of 1 multiplied by C channel by a global pooling layer, then a weight value required by the channel is learned by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to information of adjacent channels, and finally the weight value is multiplied by an original tensor t to obtain a new feature after zooming;

step S22: on the enhanced expression image dataset, a random gradient descent optimizer is used according to a Softmax loss function l_softmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:

wherein

And

respectively representing teacher network parameter set theta obtained by training model_teacherPredicted output value, label, of the ith and j expressions of the next input image tensor_iAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories.

Step S3 specifically includes:

step S31: inputting the enhanced image data of the test set into the optimized high-efficiency attention model, and counting the identification accuracy;

step S32: and constructing a teacher-student knowledge distillation framework, taking the trained high-efficiency attention model as a teacher network, and using the output of the softened teacher network as a soft label to express the similarity relation between expressions.

Step S33: selecting a new high-efficiency attention network as a student network, and using KL divergence loss of a soft label and Softmax loss of a data set label to jointly train parameters of an optimization model, wherein the parameters are specifically expressed as follows:

l＝α·T²·l_KL+(1-α)·l_Softmax

wherein l_KLAnd l_softmaxAs follows:

y is aboveⁱAnd zⁱRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories.

Further, the step S4 is specifically:

step S41: inputting the enhanced image data of the test set into the optimized student model, and counting the identification accuracy;

step S42: transferring the model parameters learned by the student network to the teacher network to order theta_teacherIs equal to theta_student；

Step S43: fixed teacher network parameter set theta_teacherAdjusting learning rate to re-optimize θ for student network_student；

Step S44: and repeating the steps S41-S43 for a plurality of times to carry out iterative training of the teacher-student model and parameter migration until the identification accuracy of the student network does not rise any more.

One specific example is given below:

step S1:

an expression image dataset is acquired. FER2013 and RAF-DB data sets are mainly used in the training. The FER2013 data set is a 48 × 48 grayscale image and comprises 28,708 training sets, 3589 public testing sets and 3589 private testing sets of facial pictures. The RAF-DB data set is a 100 x 100 color image and comprises 12271 training sets and 3068 test sets of facial pictures. Both data sets contained seven expression categories (anger, disgust, fear, happiness, sadness and surprise), trained and tested respectively. For unified data input format, it is necessary to scale the image of RAF-DB to 48 × 48 size and copy the image of FER2013 to 3 channels.

In order to avoid the need of data expansion due to overfitting, as shown in fig. 2, the invention normalizes the pixel values of the image tensor, and performs window sampling and horizontal turning on the image tensor with the area of 90% according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set after data enhancement.

Step S2 specifically includes:

as shown in fig. 3, a local channel Attention mechanism is introduced based on a MobileNetV2 basic convolution module (IRB) to construct an Efficient Attention Inverse Residual Block (EAIRB), and further stacked to form a main body of an Efficient Attention Network (EAN), then two-dimensional convolution layers of a Network header are combined to convert image features from a space domain to a channel domain, and finally a tail two-dimensional convolution layer replaces a fully connected layer to classify to form a lightweight full convolution Network. The EAN finally only comprises 43,840 calculation parameters, the EAN occupies 4.07MB of running memory, the single inference time length of the onnx-cpu in running is about 1.985 milliseconds, and the specific parameters are as shown in the following table 1;

TABLE 1 high efficiency attention network hierarchy

The MobileNet V2 basic convolution module (IRB) firstly increases the dimension of an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouping convolution with a convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with a convolution kernel size of 1 and integrates the characteristic information of each channel point by point. The local channel attention mechanism (blue part in fig. 3) firstly integrates the tensor t of H × W × C input by a global pooling layer channel by channel into a one-dimensional feature vector with spatial information of 1 × 1 × C, then learns the weight value required by the channel by combining a one-dimensional convolution with convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, finally multiplies the weight value by the original tensor t to obtain a new feature after scaling, and performs short circuit addition (Shortcut) on the new feature and the original feature.

The teacher network training process is shown in FIG. 4, and a random gradient descent optimizer is used on the enhanced expression image data set according to a Softmax loss function l_softmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:

wherein

And

representing the parameter set theta obtained by respectively representing the models in training_teacherPredicted output value, label, of the ith and j expressions of the next input image tensor_iAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories. The initial learning rate of training is set to 0.01, the batch processing size is 64, and the learning rate is adjusted to 0.9 times of the original learning rate when the loss value does not decrease within two periods.

Step S3 specifically includes:

and inputting the enhanced image data of the test set into the optimized high-efficiency attention model, counting the identification accuracy, taking the trained high-efficiency attention model as a teacher network, and using the output (z) of the softened teacher network as a soft label to represent the similarity relation between expressions. Another high-efficiency attention network is selected as a student network to form a teacher-student knowledge distillation structure, as shown in fig. 5. The KL divergence loss of the teacher network soft label and the Softmax loss of the data set label are trained and optimized together in the student network, and the method is specifically represented as follows:

l＝α·T²·l_KL+(1-α)·l_Softmax

wherein l_KLAnd l_softmaxAs follows:

y is aboveⁱAnd zⁱRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories. The initial learning rate of the combined training is 0.01, the batch processing size is 64, and the learning rate is adjusted to be 0.9 times of the original learning rate when the loss value does not decrease within two periods.

Step S4 specifically includes:

and inputting the enhanced image data of the test set into the trained student network model, and counting the identification accuracy. Student network parameter set theta for learning student network at the beginning of iterative training_studentMigrating to the teacher network, and fixing the teacher network parameter set theta_teacherRe-optimizing student network parameter set theta by adjusting learning rate to 0.1_student. The effect of the soft label component is regulated and controlled by alpha and T in the iterative process, the specific promotion effect is shown in Table 2, and experiments show that the effect is optimal when the alpha and the T are respectively selected to be 0.5 and 5 on FER2013 and RAF-DB data sets.

TABLE 2 comparison of the iterative precision enhancement effects under different hyper-parameters

The student network model with the highest iteration precision can independently predict on edge resource-limited equipment, recognition confusion matrixes of the student network model on FER2013 and RAF-DB data sets are shown in fig. 6 and 7, and recognition precision of more than 80% of 'happy' expressions and 'surprised' expressions with higher similarity is achieved. The overall accuracy of the model is 70.63% and 85.30% respectively, on the premise of ensuring the real-time deployment performance of the limited equipment, the considerable identification accuracy is achieved, and the balance between the complexity and the accuracy of the model is realized.

Claims

1. An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, taking the trained high-efficiency attention network as a teacher network, taking the other high-efficiency attention network as a student network, and training the student network by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network.

2. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 1, wherein the expression data set is preprocessed as follows:

3. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning according to claim 1 or 2, wherein the expression data set is subjected to data enhancement by the following method:

4. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 3, wherein the construction method of the lightweight expression recognition model based on the high-efficiency attention network comprises the following steps:

5. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 4, wherein the MobileNet V2 basic convolution module firstly performs dimension increase on an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouped convolution with a convolution kernel size of 3 to collect space features channel by channel, and finally performs dimension reduction through a two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.

6. The method as claimed in claim 5, wherein the local channel attention mechanism is characterized in that a global pooling layer is used to integrate an H × W × C tensor t with a channel-by-channel integration with a one-dimensional eigenvector with spatial information of 1 × 1 × C, a one-dimensional convolution with a convolution kernel size of 3 is used to learn a weight value required by the channel in combination with a Sigmoid activation function according to information of adjacent channels, and the weight value is multiplied by an original tensor t to obtain a new scaled feature, wherein H represents the height of a feature map, W represents the width of the feature map, and C represents the number of channels of the feature map.

7. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 6, wherein the specific method for training the high-efficiency attention network by using the preprocessed and data-enhanced expression data set comprises the following steps:

wherein

And

respectively representing teacher network parameter set theta obtained by training model_teacherPredicted output value, label, of the ith and j expressions of the next input image tensor_iAnd the label value corresponding to the ith expression of the image is represented, and n represents the number of expression categories.

8. The method for recognizing the expressions based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 7, wherein the method for training the student network by using the softening predicted value output by the teacher network comprises the following steps:

l＝α·T²·l_KL+(1-α)·l_Softmax

in the formula:

wherein

And zⁱRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.

9. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 8, wherein the specific method of the repeated iterative transfer training is as follows:

b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set theta_teacherEqual to student network parameter set theta_student；

10. The method of claim 9, wherein the student network is the same as the teacher network in structure.