CN114298233A - Expression recognition method based on efficient attention network and teacher-student iterative transfer learning - Google Patents

Expression recognition method based on efficient attention network and teacher-student iterative transfer learning Download PDF

Info

Publication number
CN114298233A
CN114298233A CN202111655846.5A CN202111655846A CN114298233A CN 114298233 A CN114298233 A CN 114298233A CN 202111655846 A CN202111655846 A CN 202111655846A CN 114298233 A CN114298233 A CN 114298233A
Authority
CN
China
Prior art keywords
network
student
teacher
expression
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111655846.5A
Other languages
Chinese (zh)
Inventor
孔英会
张帅桐
张珂
戚银城
车辚辚
赵振兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN202111655846.5A priority Critical patent/CN114298233A/en
Publication of CN114298233A publication Critical patent/CN114298233A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and an expression data set is used for training the efficient attention network; then, the trained network is used as a teacher network, the other high-efficiency attention network is used as a student network, and the student network is trained by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.

Description

Expression recognition method based on efficient attention network and teacher-student iterative transfer learning
Technical Field
The invention relates to an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, and belongs to the technical field of information processing.
Background
With the advent of the artificial intelligence era, intelligent devices have penetrated aspects of human life, in which human-computer interaction technology is particularly important as a bridge for communication between people and devices. Facial expressions, a non-verbal signal of human beings across race and culture, contain rich mental activity information. The automatic facial expression recognition has great application and research values in the fields of criminal investigation inquiry, fatigue driving recognition, patient emotion monitoring and the like.
The famous psychologist Paul Ekman in 1978 defined human expressions as seven basic categories of anger, disgust, fear, happiness, sadness, surprise and neutrality. The traditional expression recognition method relies on artificial design features (local binary patterns, histogram of directional gradients, principal component analysis and the like), has high execution efficiency, and cannot be fully adapted to face data of various scenes. In recent years, the deep learning technology has great advantages in end-to-end learning and high-precision recognition in the field of image classification, and more researchers model facial expressions based on a deep learning method to realize automatic recognition of the expressions.
Liu and the like (Liu K, Zhang M, Pan Z. facial expression recognition with CNN ensemble [ C ]//2016international conference on networks (CW). IEEE,2016:163-166) respectively train a plurality of convolutional neural networks with different structures, and finally integrate expression recognition results to achieve higher recognition accuracy.
Cai et al (Cai J, Meng Z, Khan A S, et al. basic attribute tree in relational neural network for facial expression recognition [ J ]. arXiv prediction arXiv:1812.07067,2018) propose a method for learning features by a hierarchical tree structure, wherein final features are learned in the tree structure, and features of different tree nodes are combined by probability map weighting, so that the accuracy of facial expression recognition is improved, but the model design is complex and the calculation load is high.
Fan and the like (Fan Y, Li V, Lam J C K. facial expression with latent-latent expression network [ J ]. IEEE transactions on-active computing,2020) construct a deep supervision attention network based on a complex VGG/ResNet network structure, design a two-stage training scheme to integrate the relationship between race, gender, age and the like and facial expressions, and finally predict by combining multi-scale information to achieve the first-class recognition precision. However, the method continues to use a classical complex network architecture, and the model parameters and the calculation amount are still large.
The method comprises the steps of enabling a real-time expression recognition framework [ J ] in a complex environment based on face segmentation, computer engineering and application 2020,56(12):134 and 140) to add the idea of face segmentation into a graph preprocessing step, designing a recognition framework of a segmentation network and a classification network in cascade, and simplifying model parameters and calculated quantity by carefully regulating and controlling hyper-parameters of a convolution module. However, due to the excessively light model structure, the problem of reduced fitting capability of the network is brought, and the inference real-time performance of the framework is guaranteed, but the accuracy of identification is not considered.
In summary, the high-precision deep learning model has a huge volume and is difficult to be directly deployed on edge devices (such as mobile terminals and embedded terminals) with limited resources, and a high-performance server is required to perform centralized processing on data. And although the light-weight network meets the deployment requirement of the edge side, the training difficulty is high, and the recognition accuracy is low. In addition, the degree of interconnection of everything is increasingly deepened, and the facial expression is in the process of changing every moment, the expression of each frame of image needs to be recognized in real time in massive data, so that high transmission cost and privacy disclosure risks are caused, and therefore it is very important to find a real-time facial expression recognition method capable of meeting the deployment requirement of the edge side.
Disclosure of Invention
The invention aims to provide an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning aiming at the defects of the prior art so as to meet the deployment requirement of expression recognition on edge side resource-limited equipment.
The problems of the invention are solved by the following technical scheme:
an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, the trained high-efficiency attention network is used as a teacher network;
according to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the expression data set preprocessing method comprises the following steps:
and zooming the image into a fixed size, unifying the resolution of the image, normalizing the pixel values, and copying the original image into three parts to form a three-channel tensor if the original image is a gray image.
The expression identification method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of:
and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning comprises the following steps:
firstly, introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, stacking the efficient attention inverse residual block to form a main body of an efficient attention network, then converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally, substituting a full connection layer for classification by a two-dimensional convolution layer at the tail to form a lightweight full convolution network.
According to the expression recognition method based on the efficient attention network and teacher-student iterative transfer learning, the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through two-dimensional convolution with a convolution kernel size of 1, then collects spatial features channel by channel through grouped convolution with a convolution kernel size of 3, and finally reduces the dimension through two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.
According to the expression identification method based on the high-efficiency attention network and teacher-student iterative transfer learning, a local channel attention mechanism firstly integrates a tensor t of H multiplied by W multiplied by C channel by channel into a one-dimensional feature vector of which the spatial information is 1 multiplied by C, then learns a weight value required by the channel by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, and finally multiplies the weight value with the original tensor t to obtain a new feature after zooming, wherein H represents the height of a feature map, W represents the width of the feature map, and C is the number of channels of the feature map.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning specifically comprises the following steps of training the efficient attention network by utilizing a preprocessed and data-enhanced expression data set:
on the expression image dataset after data enhancement, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
Figure BDA0003448276070000031
wherein
Figure BDA0003448276070000032
And
Figure BDA0003448276070000033
respectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd representing the label value corresponding to the ith expression of the image.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of training a student network by using a softening prediction value output by the teacher network:
the output of the softening teacher network is used as a soft label to represent the similarity relation between expressions, and KL divergence loss of the soft label and Softmax loss of the data set label are used for training and optimizing parameters of the student network model together, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
in the formula:
Figure BDA0003448276070000041
Figure BDA0003448276070000042
wherein y isiAnd ziRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following specific steps of:
a. inputting the image data of the test set after data enhancement into an optimized student network model, and counting the identification accuracy;
b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set thetateacherParameter set theta for student networkstudent
c. Fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize student network parameter set thetastudent
d. And (c) repeating the steps a to c for a plurality of times, and carrying out iterative training and parameter migration on the teacher network and the student network model until the identification accuracy of the student network does not rise any more.
According to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the student network and the teacher network have the same structure.
Advantageous effects
Compared with the existing expression recognition method, the method has the advantages that:
1. according to the invention, a light local attention mechanism is introduced into the inverse residual convolution block, a high-efficiency attention inverse residual block is constructed, and a high-efficiency attention network is constructed based on the improved structure. The method has the advantages that the model parameter number and the calculated quantity level are guaranteed, meanwhile, the network fitting capacity is enhanced by a small number of parameters, and the identification accuracy of the model is remarkably improved;
2. the teacher model is used for assisting the network training of students on the basis of the knowledge distillation framework, the soft label similarity relation among expressions is supplemented for facial data, and the training difficulty of the lightweight network is further reduced. In addition, the training of the same network structure is selected, the characteristic difference between teachers and students ' networks is avoided, the soft label information of the teachers and students ' networks is refined and optimized through iterative training, iterative characteristic transfer of the teachers and students ' networks is enhanced through parameter migration, and the identification precision of the model is remarkably improved under the condition that extra network parameters and calculated quantity are not introduced.
3. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.
Drawings
The present invention will be described in further detail with reference to the accompanying drawings.
FIG. 1 is a frame diagram of a teacher-student iterative migration method;
FIG. 2 is a schematic diagram of a data enhancement process;
FIG. 3 is a diagram of an efficient attention network architecture;
FIG. 4 is a training flow diagram of the teacher web learning phase;
FIG. 5 is a training flow diagram of the teacher and student iterative transfer learning phase;
FIG. 6 is a confusion matrix on FER2013 test set according to the present invention;
FIG. 7 is a confusion matrix on the RAF-DB test set of the present invention.
The symbols used herein are respectively represented as: H. w and C respectively represent the height, width and channel number of the characteristic diagram, t represents the tensor of the input image, and thetateacherAnd thetastudentRespectively represent the teacher network and the student network parameter sets,
Figure BDA0003448276070000051
and
Figure BDA0003448276070000052
respectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriThe label value y corresponding to the ith expression of the representative imageiAnd ziRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.
Detailed Description
Aiming at the defects of the prior art, the invention provides an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, the method adopts deep separable convolution to carry out lightweight structural design to ensure the real-time performance of a model, a local attention mechanism is introduced by means of a small amount of parameters to enhance the network fitting capacity, and finally the light-weight network recognition capacity is remarkably improved by the teacher-student iterative transfer learning method under the condition of not introducing additional parameters, so that the real-time deployment requirement of edge side resource limited equipment is met, and higher recognition accuracy can be achieved.
The method comprises the following steps:
s1: acquiring an expression data set; preprocessing and enhancing data of the data set, and zooming the data set to a fixed size;
s2: constructing a lightweight expression recognition model based on the high-efficiency attention network, and training the high-efficiency attention network according to the preprocessed expression data set after data enhancement;
s3: inputting the image data of the test set into the trained network to obtain a recognition result, using the recognition result as a teacher network, and using the output softening predicted value to assist another high-efficiency attention network as a student network to train;
s4: and testing and training the recognition accuracy of the student network, transferring the model parameters learned by the student network to the teacher network, and repeatedly performing multiple rounds of iterative transfer training until the recognition accuracy of the student network does not increase.
S5: and recognizing the facial expressions by utilizing a student network.
Step S1 specifically includes:
step S11: acquiring an expression image data set, unifying image resolution and normalizing pixel values, and if an original image is a gray image, copying the original image into three parts to form a three-channel tensor;
step S12: and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
Step S2 specifically includes:
step S21: introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, further stacking to form a main body of an efficient attention network, converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally classifying by replacing a full connection layer with a two-dimensional convolution layer at the tail to form a lightweight full convolution network;
the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through a two-dimensional convolution with the convolution kernel size of 1, then uses a grouped convolution with the convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with the convolution kernel size of 1 and integrates the characteristic information of each channel point by point;
the local channel attention mechanism is characterized in that a tensor t of H multiplied by W multiplied by C input is integrated with a one-dimensional feature vector with spatial information of 1 multiplied by C channel by a global pooling layer, then a weight value required by the channel is learned by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to information of adjacent channels, and finally the weight value is multiplied by an original tensor t to obtain a new feature after zooming;
step S22: on the enhanced expression image dataset, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
Figure BDA0003448276070000061
wherein
Figure BDA0003448276070000071
And
Figure BDA0003448276070000072
respectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories.
Step S3 specifically includes:
step S31: inputting the enhanced image data of the test set into the optimized high-efficiency attention model, and counting the identification accuracy;
step S32: and constructing a teacher-student knowledge distillation framework, taking the trained high-efficiency attention model as a teacher network, and using the output of the softened teacher network as a soft label to express the similarity relation between expressions.
Step S33: selecting a new high-efficiency attention network as a student network, and using KL divergence loss of a soft label and Softmax loss of a data set label to jointly train parameters of an optimization model, wherein the parameters are specifically expressed as follows:
l=α·T2·lKL+(1-α)·lSoftmax
wherein lKLAnd lsoftmaxAs follows:
Figure BDA0003448276070000073
Figure BDA0003448276070000074
y is aboveiAnd ziRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories.
Further, the step S4 is specifically:
step S41: inputting the enhanced image data of the test set into the optimized student model, and counting the identification accuracy;
step S42: transferring the model parameters learned by the student network to the teacher network to order thetateacherIs equal to thetastudent
Step S43: fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize θ for student networkstudent
Step S44: and repeating the steps S41-S43 for a plurality of times to carry out iterative training of the teacher-student model and parameter migration until the identification accuracy of the student network does not rise any more.
One specific example is given below:
step S1:
an expression image dataset is acquired. FER2013 and RAF-DB data sets are mainly used in the training. The FER2013 data set is a 48 × 48 grayscale image and comprises 28,708 training sets, 3589 public testing sets and 3589 private testing sets of facial pictures. The RAF-DB data set is a 100 x 100 color image and comprises 12271 training sets and 3068 test sets of facial pictures. Both data sets contained seven expression categories (anger, disgust, fear, happiness, sadness and surprise), trained and tested respectively. For unified data input format, it is necessary to scale the image of RAF-DB to 48 × 48 size and copy the image of FER2013 to 3 channels.
In order to avoid the need of data expansion due to overfitting, as shown in fig. 2, the invention normalizes the pixel values of the image tensor, and performs window sampling and horizontal turning on the image tensor with the area of 90% according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set after data enhancement.
Step S2 specifically includes:
as shown in fig. 3, a local channel Attention mechanism is introduced based on a MobileNetV2 basic convolution module (IRB) to construct an Efficient Attention Inverse Residual Block (EAIRB), and further stacked to form a main body of an Efficient Attention Network (EAN), then two-dimensional convolution layers of a Network header are combined to convert image features from a space domain to a channel domain, and finally a tail two-dimensional convolution layer replaces a fully connected layer to classify to form a lightweight full convolution Network. The EAN finally only comprises 43,840 calculation parameters, the EAN occupies 4.07MB of running memory, the single inference time length of the onnx-cpu in running is about 1.985 milliseconds, and the specific parameters are as shown in the following table 1;
TABLE 1 high efficiency attention network hierarchy
Figure BDA0003448276070000081
The MobileNet V2 basic convolution module (IRB) firstly increases the dimension of an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouping convolution with a convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with a convolution kernel size of 1 and integrates the characteristic information of each channel point by point. The local channel attention mechanism (blue part in fig. 3) firstly integrates the tensor t of H × W × C input by a global pooling layer channel by channel into a one-dimensional feature vector with spatial information of 1 × 1 × C, then learns the weight value required by the channel by combining a one-dimensional convolution with convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, finally multiplies the weight value by the original tensor t to obtain a new feature after scaling, and performs short circuit addition (Shortcut) on the new feature and the original feature.
The teacher network training process is shown in FIG. 4, and a random gradient descent optimizer is used on the enhanced expression image data set according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
Figure BDA0003448276070000091
wherein
Figure BDA0003448276070000092
And
Figure BDA0003448276070000093
representing the parameter set theta obtained by respectively representing the models in trainingteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories. The initial learning rate of training is set to 0.01, the batch processing size is 64, and the learning rate is adjusted to 0.9 times of the original learning rate when the loss value does not decrease within two periods.
Step S3 specifically includes:
and inputting the enhanced image data of the test set into the optimized high-efficiency attention model, counting the identification accuracy, taking the trained high-efficiency attention model as a teacher network, and using the output (z) of the softened teacher network as a soft label to represent the similarity relation between expressions. Another high-efficiency attention network is selected as a student network to form a teacher-student knowledge distillation structure, as shown in fig. 5. The KL divergence loss of the teacher network soft label and the Softmax loss of the data set label are trained and optimized together in the student network, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
wherein lKLAnd lsoftmaxAs follows:
Figure BDA0003448276070000094
Figure BDA0003448276070000101
y is aboveiAnd ziRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories. The initial learning rate of the combined training is 0.01, the batch processing size is 64, and the learning rate is adjusted to be 0.9 times of the original learning rate when the loss value does not decrease within two periods.
Step S4 specifically includes:
and inputting the enhanced image data of the test set into the trained student network model, and counting the identification accuracy. Student network parameter set theta for learning student network at the beginning of iterative trainingstudentMigrating to the teacher network, and fixing the teacher network parameter set thetateacherRe-optimizing student network parameter set theta by adjusting learning rate to 0.1student. The effect of the soft label component is regulated and controlled by alpha and T in the iterative process, the specific promotion effect is shown in Table 2, and experiments show that the effect is optimal when the alpha and the T are respectively selected to be 0.5 and 5 on FER2013 and RAF-DB data sets.
TABLE 2 comparison of the iterative precision enhancement effects under different hyper-parameters
Figure BDA0003448276070000102
Step S44: and repeating the steps S41-S43 for a plurality of times to carry out iterative training of the teacher-student model and parameter migration until the identification accuracy of the student network does not rise any more.
The student network model with the highest iteration precision can independently predict on edge resource-limited equipment, recognition confusion matrixes of the student network model on FER2013 and RAF-DB data sets are shown in fig. 6 and 7, and recognition precision of more than 80% of 'happy' expressions and 'surprised' expressions with higher similarity is achieved. The overall accuracy of the model is 70.63% and 85.30% respectively, on the premise of ensuring the real-time deployment performance of the limited equipment, the considerable identification accuracy is achieved, and the balance between the complexity and the accuracy of the model is realized.

Claims (10)

1. An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, taking the trained high-efficiency attention network as a teacher network, taking the other high-efficiency attention network as a student network, and training the student network by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network.
2. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 1, wherein the expression data set is preprocessed as follows:
and zooming the image into a fixed size, unifying the resolution of the image, normalizing the pixel values, and copying the original image into three parts to form a three-channel tensor if the original image is a gray image.
3. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning according to claim 1 or 2, wherein the expression data set is subjected to data enhancement by the following method:
and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
4. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 3, wherein the construction method of the lightweight expression recognition model based on the high-efficiency attention network comprises the following steps:
firstly, introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, stacking the efficient attention inverse residual block to form a main body of an efficient attention network, then converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally, substituting a full connection layer for classification by a two-dimensional convolution layer at the tail to form a lightweight full convolution network.
5. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 4, wherein the MobileNet V2 basic convolution module firstly performs dimension increase on an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouped convolution with a convolution kernel size of 3 to collect space features channel by channel, and finally performs dimension reduction through a two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.
6. The method as claimed in claim 5, wherein the local channel attention mechanism is characterized in that a global pooling layer is used to integrate an H × W × C tensor t with a channel-by-channel integration with a one-dimensional eigenvector with spatial information of 1 × 1 × C, a one-dimensional convolution with a convolution kernel size of 3 is used to learn a weight value required by the channel in combination with a Sigmoid activation function according to information of adjacent channels, and the weight value is multiplied by an original tensor t to obtain a new scaled feature, wherein H represents the height of a feature map, W represents the width of the feature map, and C represents the number of channels of the feature map.
7. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 6, wherein the specific method for training the high-efficiency attention network by using the preprocessed and data-enhanced expression data set comprises the following steps:
on the expression image dataset after data enhancement, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
Figure FDA0003448276060000021
wherein
Figure FDA0003448276060000022
And
Figure FDA0003448276060000023
respectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n represents the number of expression categories.
8. The method for recognizing the expressions based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 7, wherein the method for training the student network by using the softening predicted value output by the teacher network comprises the following steps:
the output of the softening teacher network is used as a soft label to represent the similarity relation between expressions, and KL divergence loss of the soft label and Softmax loss of the data set label are used for training and optimizing parameters of the student network model together, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
in the formula:
Figure FDA0003448276060000024
Figure FDA0003448276060000031
wherein
Figure FDA0003448276060000032
And ziRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.
9. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 8, wherein the specific method of the repeated iterative transfer training is as follows:
a. inputting the image data of the test set after data enhancement into an optimized student network model, and counting the identification accuracy;
b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set thetateacherEqual to student network parameter set thetastudent
c. Fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize student network parameter set thetastudent
d. And (c) repeating the steps a to c for a plurality of times, and carrying out iterative training and parameter migration on the teacher network and the student network model until the identification accuracy of the student network does not rise any more.
10. The method of claim 9, wherein the student network is the same as the teacher network in structure.
CN202111655846.5A 2021-12-30 2021-12-30 Expression recognition method based on efficient attention network and teacher-student iterative transfer learning Pending CN114298233A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111655846.5A CN114298233A (en) 2021-12-30 2021-12-30 Expression recognition method based on efficient attention network and teacher-student iterative transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111655846.5A CN114298233A (en) 2021-12-30 2021-12-30 Expression recognition method based on efficient attention network and teacher-student iterative transfer learning

Publications (1)

Publication Number Publication Date
CN114298233A true CN114298233A (en) 2022-04-08

Family

ID=80973289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111655846.5A Pending CN114298233A (en) 2021-12-30 2021-12-30 Expression recognition method based on efficient attention network and teacher-student iterative transfer learning

Country Status (1)

Country Link
CN (1) CN114298233A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645507A (en) * 2023-05-18 2023-08-25 丽水瑞联医疗科技有限公司 Placenta image processing method and system based on semantic segmentation
CN117829683A (en) * 2024-03-04 2024-04-05 国网山东省电力公司信息通信公司 Electric power Internet of things data quality analysis method and system based on graph comparison learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645507A (en) * 2023-05-18 2023-08-25 丽水瑞联医疗科技有限公司 Placenta image processing method and system based on semantic segmentation
CN117829683A (en) * 2024-03-04 2024-04-05 国网山东省电力公司信息通信公司 Electric power Internet of things data quality analysis method and system based on graph comparison learning

Similar Documents

Publication Publication Date Title
Zhang et al. Lightweight deep network for traffic sign classification
CN110334705B (en) Language identification method of scene text image combining global and local information
WO2021190451A1 (en) Method and apparatus for training image processing model
Lei et al. Shallow convolutional neural network for image classification
Islalm et al. Recognition bangla sign language using convolutional neural network
CN109829541A (en) Deep neural network incremental training method and system based on learning automaton
CN109993100B (en) Method for realizing facial expression recognition based on deep feature clustering
CN110516095A (en) Weakly supervised depth Hash social activity image search method and system based on semanteme migration
CN114298233A (en) Expression recognition method based on efficient attention network and teacher-student iterative transfer learning
CN110110724A (en) The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type
CN115331285A (en) Dynamic expression recognition method and system based on multi-scale feature knowledge distillation
Wu et al. Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation
Ye et al. A joint-training two-stage method for remote sensing image captioning
CN114612761A (en) Network architecture searching method for image recognition
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
CN113920363B (en) Cultural relic classification method based on lightweight deep learning network
CN115062727A (en) Graph node classification method and system based on multi-order hypergraph convolutional network
CN114463340A (en) Edge information guided agile remote sensing image semantic segmentation method
Aakanksha et al. A systematic and bibliometric review on face recognition: Convolutional neural network
Liu et al. Learning a similarity metric discriminatively with application to ancient character recognition
CN116881416A (en) Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network
Hao et al. Architecture self-attention mechanism: Nonlinear optimization for neural architecture search
CN116311455A (en) Expression recognition method based on improved Mobile-former
CN115965819A (en) Lightweight pest identification method based on Transformer structure
CN115809314A (en) Multitask NL2SQL method based on double-layer multi-gated expert Mixed Model (MMOE)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination