CN114298233A - Expression recognition method based on efficient attention network and teacher-student iterative transfer learning - Google Patents
Expression recognition method based on efficient attention network and teacher-student iterative transfer learning Download PDFInfo
- Publication number
- CN114298233A CN114298233A CN202111655846.5A CN202111655846A CN114298233A CN 114298233 A CN114298233 A CN 114298233A CN 202111655846 A CN202111655846 A CN 202111655846A CN 114298233 A CN114298233 A CN 114298233A
- Authority
- CN
- China
- Prior art keywords
- network
- student
- teacher
- expression
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and an expression data set is used for training the efficient attention network; then, the trained network is used as a teacher network, the other high-efficiency attention network is used as a student network, and the student network is trained by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.
Description
Technical Field
The invention relates to an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, and belongs to the technical field of information processing.
Background
With the advent of the artificial intelligence era, intelligent devices have penetrated aspects of human life, in which human-computer interaction technology is particularly important as a bridge for communication between people and devices. Facial expressions, a non-verbal signal of human beings across race and culture, contain rich mental activity information. The automatic facial expression recognition has great application and research values in the fields of criminal investigation inquiry, fatigue driving recognition, patient emotion monitoring and the like.
The famous psychologist Paul Ekman in 1978 defined human expressions as seven basic categories of anger, disgust, fear, happiness, sadness, surprise and neutrality. The traditional expression recognition method relies on artificial design features (local binary patterns, histogram of directional gradients, principal component analysis and the like), has high execution efficiency, and cannot be fully adapted to face data of various scenes. In recent years, the deep learning technology has great advantages in end-to-end learning and high-precision recognition in the field of image classification, and more researchers model facial expressions based on a deep learning method to realize automatic recognition of the expressions.
Liu and the like (Liu K, Zhang M, Pan Z. facial expression recognition with CNN ensemble [ C ]//2016international conference on networks (CW). IEEE,2016:163-166) respectively train a plurality of convolutional neural networks with different structures, and finally integrate expression recognition results to achieve higher recognition accuracy.
Cai et al (Cai J, Meng Z, Khan A S, et al. basic attribute tree in relational neural network for facial expression recognition [ J ]. arXiv prediction arXiv:1812.07067,2018) propose a method for learning features by a hierarchical tree structure, wherein final features are learned in the tree structure, and features of different tree nodes are combined by probability map weighting, so that the accuracy of facial expression recognition is improved, but the model design is complex and the calculation load is high.
Fan and the like (Fan Y, Li V, Lam J C K. facial expression with latent-latent expression network [ J ]. IEEE transactions on-active computing,2020) construct a deep supervision attention network based on a complex VGG/ResNet network structure, design a two-stage training scheme to integrate the relationship between race, gender, age and the like and facial expressions, and finally predict by combining multi-scale information to achieve the first-class recognition precision. However, the method continues to use a classical complex network architecture, and the model parameters and the calculation amount are still large.
The method comprises the steps of enabling a real-time expression recognition framework [ J ] in a complex environment based on face segmentation, computer engineering and application 2020,56(12):134 and 140) to add the idea of face segmentation into a graph preprocessing step, designing a recognition framework of a segmentation network and a classification network in cascade, and simplifying model parameters and calculated quantity by carefully regulating and controlling hyper-parameters of a convolution module. However, due to the excessively light model structure, the problem of reduced fitting capability of the network is brought, and the inference real-time performance of the framework is guaranteed, but the accuracy of identification is not considered.
In summary, the high-precision deep learning model has a huge volume and is difficult to be directly deployed on edge devices (such as mobile terminals and embedded terminals) with limited resources, and a high-performance server is required to perform centralized processing on data. And although the light-weight network meets the deployment requirement of the edge side, the training difficulty is high, and the recognition accuracy is low. In addition, the degree of interconnection of everything is increasingly deepened, and the facial expression is in the process of changing every moment, the expression of each frame of image needs to be recognized in real time in massive data, so that high transmission cost and privacy disclosure risks are caused, and therefore it is very important to find a real-time facial expression recognition method capable of meeting the deployment requirement of the edge side.
Disclosure of Invention
The invention aims to provide an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning aiming at the defects of the prior art so as to meet the deployment requirement of expression recognition on edge side resource-limited equipment.
The problems of the invention are solved by the following technical scheme:
an expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, the trained high-efficiency attention network is used as a teacher network;
according to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the expression data set preprocessing method comprises the following steps:
and zooming the image into a fixed size, unifying the resolution of the image, normalizing the pixel values, and copying the original image into three parts to form a three-channel tensor if the original image is a gray image.
The expression identification method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of:
and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning comprises the following steps:
firstly, introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, stacking the efficient attention inverse residual block to form a main body of an efficient attention network, then converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally, substituting a full connection layer for classification by a two-dimensional convolution layer at the tail to form a lightweight full convolution network.
According to the expression recognition method based on the efficient attention network and teacher-student iterative transfer learning, the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through two-dimensional convolution with a convolution kernel size of 1, then collects spatial features channel by channel through grouped convolution with a convolution kernel size of 3, and finally reduces the dimension through two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.
According to the expression identification method based on the high-efficiency attention network and teacher-student iterative transfer learning, a local channel attention mechanism firstly integrates a tensor t of H multiplied by W multiplied by C channel by channel into a one-dimensional feature vector of which the spatial information is 1 multiplied by C, then learns a weight value required by the channel by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, and finally multiplies the weight value with the original tensor t to obtain a new feature after zooming, wherein H represents the height of a feature map, W represents the width of the feature map, and C is the number of channels of the feature map.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning specifically comprises the following steps of training the efficient attention network by utilizing a preprocessed and data-enhanced expression data set:
on the expression image dataset after data enhancement, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
whereinAndrespectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd representing the label value corresponding to the ith expression of the image.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following steps of training a student network by using a softening prediction value output by the teacher network:
the output of the softening teacher network is used as a soft label to represent the similarity relation between expressions, and KL divergence loss of the soft label and Softmax loss of the data set label are used for training and optimizing parameters of the student network model together, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
in the formula:
wherein y isiAnd ziRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.
The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning comprises the following specific steps of:
a. inputting the image data of the test set after data enhancement into an optimized student network model, and counting the identification accuracy;
b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set thetateacherParameter set theta for student networkstudent;
c. Fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize student network parameter set thetastudent;
d. And (c) repeating the steps a to c for a plurality of times, and carrying out iterative training and parameter migration on the teacher network and the student network model until the identification accuracy of the student network does not rise any more.
According to the expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning, the student network and the teacher network have the same structure.
Advantageous effects
Compared with the existing expression recognition method, the method has the advantages that:
1. according to the invention, a light local attention mechanism is introduced into the inverse residual convolution block, a high-efficiency attention inverse residual block is constructed, and a high-efficiency attention network is constructed based on the improved structure. The method has the advantages that the model parameter number and the calculated quantity level are guaranteed, meanwhile, the network fitting capacity is enhanced by a small number of parameters, and the identification accuracy of the model is remarkably improved;
2. the teacher model is used for assisting the network training of students on the basis of the knowledge distillation framework, the soft label similarity relation among expressions is supplemented for facial data, and the training difficulty of the lightweight network is further reduced. In addition, the training of the same network structure is selected, the characteristic difference between teachers and students ' networks is avoided, the soft label information of the teachers and students ' networks is refined and optimized through iterative training, iterative characteristic transfer of the teachers and students ' networks is enhanced through parameter migration, and the identification precision of the model is remarkably improved under the condition that extra network parameters and calculated quantity are not introduced.
3. According to the method, the model parameter quantity and the calculated quantity level are guaranteed, the light-weight network fitting capacity is enhanced, soft labels and characteristic information are optimized through teacher and student iterative transfer learning, the model identification precision is greatly improved, and the deployment requirement of expression identification on edge side resource limited equipment can be met.
Drawings
The present invention will be described in further detail with reference to the accompanying drawings.
FIG. 1 is a frame diagram of a teacher-student iterative migration method;
FIG. 2 is a schematic diagram of a data enhancement process;
FIG. 3 is a diagram of an efficient attention network architecture;
FIG. 4 is a training flow diagram of the teacher web learning phase;
FIG. 5 is a training flow diagram of the teacher and student iterative transfer learning phase;
FIG. 6 is a confusion matrix on FER2013 test set according to the present invention;
FIG. 7 is a confusion matrix on the RAF-DB test set of the present invention.
The symbols used herein are respectively represented as: H. w and C respectively represent the height, width and channel number of the characteristic diagram, t represents the tensor of the input image, and thetateacherAnd thetastudentRespectively represent the teacher network and the student network parameter sets,andrespectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriThe label value y corresponding to the ith expression of the representative imageiAnd ziRespectively representing the predicted output values of a teacher network and a student network, and alpha and T are respectively the soft label proportion and the distillation temperature over-parameter.
Detailed Description
Aiming at the defects of the prior art, the invention provides an expression recognition method based on a high-efficiency attention network and teacher-student iterative transfer learning, the method adopts deep separable convolution to carry out lightweight structural design to ensure the real-time performance of a model, a local attention mechanism is introduced by means of a small amount of parameters to enhance the network fitting capacity, and finally the light-weight network recognition capacity is remarkably improved by the teacher-student iterative transfer learning method under the condition of not introducing additional parameters, so that the real-time deployment requirement of edge side resource limited equipment is met, and higher recognition accuracy can be achieved.
The method comprises the following steps:
s1: acquiring an expression data set; preprocessing and enhancing data of the data set, and zooming the data set to a fixed size;
s2: constructing a lightweight expression recognition model based on the high-efficiency attention network, and training the high-efficiency attention network according to the preprocessed expression data set after data enhancement;
s3: inputting the image data of the test set into the trained network to obtain a recognition result, using the recognition result as a teacher network, and using the output softening predicted value to assist another high-efficiency attention network as a student network to train;
s4: and testing and training the recognition accuracy of the student network, transferring the model parameters learned by the student network to the teacher network, and repeatedly performing multiple rounds of iterative transfer training until the recognition accuracy of the student network does not increase.
S5: and recognizing the facial expressions by utilizing a student network.
Step S1 specifically includes:
step S11: acquiring an expression image data set, unifying image resolution and normalizing pixel values, and if an original image is a gray image, copying the original image into three parts to form a three-channel tensor;
step S12: and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
Step S2 specifically includes:
step S21: introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, further stacking to form a main body of an efficient attention network, converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally classifying by replacing a full connection layer with a two-dimensional convolution layer at the tail to form a lightweight full convolution network;
the MobileNet V2 basic convolution module firstly increases the dimension of an image channel domain through a two-dimensional convolution with the convolution kernel size of 1, then uses a grouped convolution with the convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with the convolution kernel size of 1 and integrates the characteristic information of each channel point by point;
the local channel attention mechanism is characterized in that a tensor t of H multiplied by W multiplied by C input is integrated with a one-dimensional feature vector with spatial information of 1 multiplied by C channel by a global pooling layer, then a weight value required by the channel is learned by combining a one-dimensional convolution with a convolution kernel size of 3 with a Sigmoid activation function according to information of adjacent channels, and finally the weight value is multiplied by an original tensor t to obtain a new feature after zooming;
step S22: on the enhanced expression image dataset, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
whereinAndrespectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories.
Step S3 specifically includes:
step S31: inputting the enhanced image data of the test set into the optimized high-efficiency attention model, and counting the identification accuracy;
step S32: and constructing a teacher-student knowledge distillation framework, taking the trained high-efficiency attention model as a teacher network, and using the output of the softened teacher network as a soft label to express the similarity relation between expressions.
Step S33: selecting a new high-efficiency attention network as a student network, and using KL divergence loss of a soft label and Softmax loss of a data set label to jointly train parameters of an optimization model, wherein the parameters are specifically expressed as follows:
l=α·T2·lKL+(1-α)·lSoftmax
wherein lKLAnd lsoftmaxAs follows:
y is aboveiAnd ziRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories.
Further, the step S4 is specifically:
step S41: inputting the enhanced image data of the test set into the optimized student model, and counting the identification accuracy;
step S42: transferring the model parameters learned by the student network to the teacher network to order thetateacherIs equal to thetastudent;
Step S43: fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize θ for student networkstudent;
Step S44: and repeating the steps S41-S43 for a plurality of times to carry out iterative training of the teacher-student model and parameter migration until the identification accuracy of the student network does not rise any more.
One specific example is given below:
step S1:
an expression image dataset is acquired. FER2013 and RAF-DB data sets are mainly used in the training. The FER2013 data set is a 48 × 48 grayscale image and comprises 28,708 training sets, 3589 public testing sets and 3589 private testing sets of facial pictures. The RAF-DB data set is a 100 x 100 color image and comprises 12271 training sets and 3068 test sets of facial pictures. Both data sets contained seven expression categories (anger, disgust, fear, happiness, sadness and surprise), trained and tested respectively. For unified data input format, it is necessary to scale the image of RAF-DB to 48 × 48 size and copy the image of FER2013 to 3 channels.
In order to avoid the need of data expansion due to overfitting, as shown in fig. 2, the invention normalizes the pixel values of the image tensor, and performs window sampling and horizontal turning on the image tensor with the area of 90% according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set after data enhancement.
Step S2 specifically includes:
as shown in fig. 3, a local channel Attention mechanism is introduced based on a MobileNetV2 basic convolution module (IRB) to construct an Efficient Attention Inverse Residual Block (EAIRB), and further stacked to form a main body of an Efficient Attention Network (EAN), then two-dimensional convolution layers of a Network header are combined to convert image features from a space domain to a channel domain, and finally a tail two-dimensional convolution layer replaces a fully connected layer to classify to form a lightweight full convolution Network. The EAN finally only comprises 43,840 calculation parameters, the EAN occupies 4.07MB of running memory, the single inference time length of the onnx-cpu in running is about 1.985 milliseconds, and the specific parameters are as shown in the following table 1;
TABLE 1 high efficiency attention network hierarchy
The MobileNet V2 basic convolution module (IRB) firstly increases the dimension of an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouping convolution with a convolution kernel size of 3 to collect space characteristics channel by channel, and finally reduces the dimension through a two-dimensional convolution with a convolution kernel size of 1 and integrates the characteristic information of each channel point by point. The local channel attention mechanism (blue part in fig. 3) firstly integrates the tensor t of H × W × C input by a global pooling layer channel by channel into a one-dimensional feature vector with spatial information of 1 × 1 × C, then learns the weight value required by the channel by combining a one-dimensional convolution with convolution kernel size of 3 with a Sigmoid activation function according to the information of adjacent channels, finally multiplies the weight value by the original tensor t to obtain a new feature after scaling, and performs short circuit addition (Shortcut) on the new feature and the original feature.
The teacher network training process is shown in FIG. 4, and a random gradient descent optimizer is used on the enhanced expression image data set according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
whereinAndrepresenting the parameter set theta obtained by respectively representing the models in trainingteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n is the number of the expression categories. The initial learning rate of training is set to 0.01, the batch processing size is 64, and the learning rate is adjusted to 0.9 times of the original learning rate when the loss value does not decrease within two periods.
Step S3 specifically includes:
and inputting the enhanced image data of the test set into the optimized high-efficiency attention model, counting the identification accuracy, taking the trained high-efficiency attention model as a teacher network, and using the output (z) of the softened teacher network as a soft label to represent the similarity relation between expressions. Another high-efficiency attention network is selected as a student network to form a teacher-student knowledge distillation structure, as shown in fig. 5. The KL divergence loss of the teacher network soft label and the Softmax loss of the data set label are trained and optimized together in the student network, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
wherein lKLAnd lsoftmaxAs follows:
y is aboveiAnd ziRespectively representing the predicted output values of a teacher network and a student network, wherein alpha and T are respectively a soft label proportion and a distillation temperature over-parameter, and n is the number of expression categories. The initial learning rate of the combined training is 0.01, the batch processing size is 64, and the learning rate is adjusted to be 0.9 times of the original learning rate when the loss value does not decrease within two periods.
Step S4 specifically includes:
and inputting the enhanced image data of the test set into the trained student network model, and counting the identification accuracy. Student network parameter set theta for learning student network at the beginning of iterative trainingstudentMigrating to the teacher network, and fixing the teacher network parameter set thetateacherRe-optimizing student network parameter set theta by adjusting learning rate to 0.1student. The effect of the soft label component is regulated and controlled by alpha and T in the iterative process, the specific promotion effect is shown in Table 2, and experiments show that the effect is optimal when the alpha and the T are respectively selected to be 0.5 and 5 on FER2013 and RAF-DB data sets.
TABLE 2 comparison of the iterative precision enhancement effects under different hyper-parameters
Step S44: and repeating the steps S41-S43 for a plurality of times to carry out iterative training of the teacher-student model and parameter migration until the identification accuracy of the student network does not rise any more.
The student network model with the highest iteration precision can independently predict on edge resource-limited equipment, recognition confusion matrixes of the student network model on FER2013 and RAF-DB data sets are shown in fig. 6 and 7, and recognition precision of more than 80% of 'happy' expressions and 'surprised' expressions with higher similarity is achieved. The overall accuracy of the model is 70.63% and 85.30% respectively, on the premise of ensuring the real-time deployment performance of the limited equipment, the considerable identification accuracy is achieved, and the balance between the complexity and the accuracy of the model is realized.
Claims (10)
1. An expression recognition method based on an efficient attention network and teacher-student iterative transfer learning is characterized in that a lightweight expression recognition model based on the efficient attention network is constructed, and the efficient attention network is trained by utilizing a preprocessed and data-enhanced expression data set; then, taking the trained high-efficiency attention network as a teacher network, taking the other high-efficiency attention network as a student network, and training the student network by using a softening predicted value output by the teacher network; and transferring the model parameters of the student network learning which is trained and tested to a teacher network, repeating iterative transfer training until the recognition accuracy of the student network does not rise any more, and finally recognizing the facial expression by using the student network.
2. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 1, wherein the expression data set is preprocessed as follows:
and zooming the image into a fixed size, unifying the resolution of the image, normalizing the pixel values, and copying the original image into three parts to form a three-channel tensor if the original image is a gray image.
3. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning according to claim 1 or 2, wherein the expression data set is subjected to data enhancement by the following method:
and carrying out window sampling with the area of 90% and horizontal overturning on the image tensor according to the upper left position, the upper right position, the middle position, the lower left position and the lower right position to obtain an expression image data set with enhanced data.
4. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 3, wherein the construction method of the lightweight expression recognition model based on the high-efficiency attention network comprises the following steps:
firstly, introducing a local channel attention mechanism based on a MobileNet V2 basic convolution module to construct an efficient attention inverse residual block, stacking the efficient attention inverse residual block to form a main body of an efficient attention network, then converting image characteristics from a space domain to a channel domain by combining a two-dimensional convolution layer at the head of the network, and finally, substituting a full connection layer for classification by a two-dimensional convolution layer at the tail to form a lightweight full convolution network.
5. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 4, wherein the MobileNet V2 basic convolution module firstly performs dimension increase on an image channel domain through a two-dimensional convolution with a convolution kernel size of 1, then uses a grouped convolution with a convolution kernel size of 3 to collect space features channel by channel, and finally performs dimension reduction through a two-dimensional convolution with a convolution kernel size of 1 and integrates feature information of each channel point by point.
6. The method as claimed in claim 5, wherein the local channel attention mechanism is characterized in that a global pooling layer is used to integrate an H × W × C tensor t with a channel-by-channel integration with a one-dimensional eigenvector with spatial information of 1 × 1 × C, a one-dimensional convolution with a convolution kernel size of 3 is used to learn a weight value required by the channel in combination with a Sigmoid activation function according to information of adjacent channels, and the weight value is multiplied by an original tensor t to obtain a new scaled feature, wherein H represents the height of a feature map, W represents the width of the feature map, and C represents the number of channels of the feature map.
7. The expression recognition method based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 6, wherein the specific method for training the high-efficiency attention network by using the preprocessed and data-enhanced expression data set comprises the following steps:
on the expression image dataset after data enhancement, a random gradient descent optimizer is used according to a Softmax loss function lsoftmaxTraining and optimizing the output of the high-efficiency attention network, wherein the calculation formula is as follows:
whereinAndrespectively representing teacher network parameter set theta obtained by training modelteacherPredicted output value, label, of the ith and j expressions of the next input image tensoriAnd the label value corresponding to the ith expression of the image is represented, and n represents the number of expression categories.
8. The method for recognizing the expressions based on the high-efficiency attention network and the teacher-student iterative transfer learning of claim 7, wherein the method for training the student network by using the softening predicted value output by the teacher network comprises the following steps:
the output of the softening teacher network is used as a soft label to represent the similarity relation between expressions, and KL divergence loss of the soft label and Softmax loss of the data set label are used for training and optimizing parameters of the student network model together, and the method is specifically represented as follows:
l=α·T2·lKL+(1-α)·lSoftmax
in the formula:
9. The expression recognition method based on the efficient attention network and the teacher-student iterative transfer learning of claim 8, wherein the specific method of the repeated iterative transfer training is as follows:
a. inputting the image data of the test set after data enhancement into an optimized student network model, and counting the identification accuracy;
b. transferring the model parameters learned by the student network to the teacher network to enable the teacher network parameter set thetateacherEqual to student network parameter set thetastudent;
c. Fixed teacher network parameter set thetateacherAdjusting learning rate to re-optimize student network parameter set thetastudent;
d. And (c) repeating the steps a to c for a plurality of times, and carrying out iterative training and parameter migration on the teacher network and the student network model until the identification accuracy of the student network does not rise any more.
10. The method of claim 9, wherein the student network is the same as the teacher network in structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111655846.5A CN114298233A (en) | 2021-12-30 | 2021-12-30 | Expression recognition method based on efficient attention network and teacher-student iterative transfer learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111655846.5A CN114298233A (en) | 2021-12-30 | 2021-12-30 | Expression recognition method based on efficient attention network and teacher-student iterative transfer learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114298233A true CN114298233A (en) | 2022-04-08 |
Family
ID=80973289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111655846.5A Pending CN114298233A (en) | 2021-12-30 | 2021-12-30 | Expression recognition method based on efficient attention network and teacher-student iterative transfer learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114298233A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116645507A (en) * | 2023-05-18 | 2023-08-25 | 丽水瑞联医疗科技有限公司 | Placenta image processing method and system based on semantic segmentation |
CN117829683A (en) * | 2024-03-04 | 2024-04-05 | 国网山东省电力公司信息通信公司 | Electric power Internet of things data quality analysis method and system based on graph comparison learning |
-
2021
- 2021-12-30 CN CN202111655846.5A patent/CN114298233A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116645507A (en) * | 2023-05-18 | 2023-08-25 | 丽水瑞联医疗科技有限公司 | Placenta image processing method and system based on semantic segmentation |
CN117829683A (en) * | 2024-03-04 | 2024-04-05 | 国网山东省电力公司信息通信公司 | Electric power Internet of things data quality analysis method and system based on graph comparison learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Lightweight deep network for traffic sign classification | |
CN110334705B (en) | Language identification method of scene text image combining global and local information | |
WO2021190451A1 (en) | Method and apparatus for training image processing model | |
Lei et al. | Shallow convolutional neural network for image classification | |
Islalm et al. | Recognition bangla sign language using convolutional neural network | |
CN109829541A (en) | Deep neural network incremental training method and system based on learning automaton | |
CN109993100B (en) | Method for realizing facial expression recognition based on deep feature clustering | |
CN110516095A (en) | Weakly supervised depth Hash social activity image search method and system based on semanteme migration | |
CN114298233A (en) | Expression recognition method based on efficient attention network and teacher-student iterative transfer learning | |
CN110110724A (en) | The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type | |
CN115331285A (en) | Dynamic expression recognition method and system based on multi-scale feature knowledge distillation | |
Wu et al. | Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation | |
Ye et al. | A joint-training two-stage method for remote sensing image captioning | |
CN114612761A (en) | Network architecture searching method for image recognition | |
CN116258990A (en) | Cross-modal affinity-based small sample reference video target segmentation method | |
CN113920363B (en) | Cultural relic classification method based on lightweight deep learning network | |
CN115062727A (en) | Graph node classification method and system based on multi-order hypergraph convolutional network | |
CN114463340A (en) | Edge information guided agile remote sensing image semantic segmentation method | |
Aakanksha et al. | A systematic and bibliometric review on face recognition: Convolutional neural network | |
Liu et al. | Learning a similarity metric discriminatively with application to ancient character recognition | |
CN116881416A (en) | Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network | |
Hao et al. | Architecture self-attention mechanism: Nonlinear optimization for neural architecture search | |
CN116311455A (en) | Expression recognition method based on improved Mobile-former | |
CN115965819A (en) | Lightweight pest identification method based on Transformer structure | |
CN115809314A (en) | Multitask NL2SQL method based on double-layer multi-gated expert Mixed Model (MMOE) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |