CN114170657A - Facial emotion recognition method integrating attention mechanism and high-order feature representation - Google Patents

Facial emotion recognition method integrating attention mechanism and high-order feature representation Download PDF

Info

Publication number
CN114170657A
CN114170657A CN202111439715.3A CN202111439715A CN114170657A CN 114170657 A CN114170657 A CN 114170657A CN 202111439715 A CN202111439715 A CN 202111439715A CN 114170657 A CN114170657 A CN 114170657A
Authority
CN
China
Prior art keywords
output
network
image
attention
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111439715.3A
Other languages
Chinese (zh)
Inventor
孙强
梁乐
梅路洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202111439715.3A priority Critical patent/CN114170657A/en
Publication of CN114170657A publication Critical patent/CN114170657A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a facial emotion recognition method integrating an attention mechanism and high-order feature representation, which comprises the steps of firstly collecting a target image, and dividing the target image into a training sample set and a testing sample set; then reading an original emotion marking value of each sample image in the training sample set, sending each sample image in the training sample set into a multi-task cascaded convolutional neural network to obtain an output image, and then inputting the image obtained after preprocessing into a residual error attention network to obtain an attention output characteristic diagram; and finally, respectively sending the output feature map into a global second-order pooling network based on a channel and a global second-order pooling network based on a space position, so as to obtain output features corresponding to the emotion, and obtaining an emotion state value through a regressor. The invention solves the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model long-distance dependency relationship among emotion characteristics and insufficient in network nonlinear characterization capability.

Description

Facial emotion recognition method integrating attention mechanism and high-order feature representation
Technical Field
The invention belongs to the technical field of pattern recognition, and particularly relates to a facial emotion recognition method integrating an attention mechanism and high-order feature representation.
Background
With the continuous development of society and urgent needs for fast and effective automatic emotion recognition in various aspects in recent years, biometric recognition technology has been rapidly developed in the last decade. In real life, people need to perform necessary emotion recognition to judge the emotion of the opposite party and make correct behavior response for normal social communication and requirements.
Aaron Sloman proposed a study on artificial intelligence emotion as early as 1981. In 1985, Marvin Minsky, one of the founders of artificial intelligence, presented computer and emotional problems. Picard formally proposed the concept of emotion calculation in 1995, and defined in emotion calculation in 1997 "emotion calculation" as "calculation related to, caused by or capable of affecting emotion".
Emotion recognition has been a major research target of broad scholars in recent years as one of the important branches of the recognition field. For static face images, in a conventional identification method, researchers usually artificially extract emotional features (for example, 68 feature points based on Dlib), and then send the emotional features into a pre-designed classifier for classification (SVM, decision tree, random forest, etc.), so as to obtain a final emotional prediction result. However, this method requires extensive experience in actual practice to manually select the appropriate emotional characteristics, and often requires a great deal of time and effort. Moreover, different characteristics contribute different degrees to the final result, and the traditional method does not distinguish the final result well.
In recent years, with the continuous increase of computer capabilities, deep learning has been proposed again. Due to its strong feature learning ability and high performance, it gradually replaces the traditional machine learning and becomes the mainstream method in the identification field. Most of the current emotion analysis methods based on deep learning can be used for screening effective features by combining an attention mechanism, but effective modeling for long-distance dependency relationship among emotion semantic features is lacked. Moreover, the traditional method can train different models aiming at different tasks, does not utilize the similarity between related tasks, and has insufficient nonlinear characterization capability of a deep network.
Disclosure of Invention
The invention aims to provide a facial emotion recognition method integrating an attention mechanism and high-order feature representation, and solves the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model long-distance dependency relationship between emotion features and insufficient in network nonlinear characterization capability.
The invention adopts the technical scheme that a facial emotion recognition method integrating an attention mechanism and high-order feature representation is implemented according to the following steps:
step 1, collecting a target image, and dividing the target image into a training sample set xtrainAnd test sample set xtest
Step 2, reading a training sample set xtrainMarking the original emotion value of each sample image, and collecting a training sample set xtrainSending each sample image into a multi-task cascaded convolutional neural network MTCNN, finishing face alignment according to face detection and 5 key feature points to obtain an output image xinput=[x1,x2,...,xn]Wherein x isnRepresenting the nth output image, n representing the total number of output images, i.e. the training sample set xtrainThe total number of images in;
step 3, obtaining the image x after the pretreatmentinputInputting the data into a residual attention network, and extracting features M of different receptive fields through trunk branchesi,c(x) Mask branch learning attention weight Ti,c(x) (ii) a Finally, the output of the main branch and the output of the mask branch are subjected to dot product operation to obtain an attention output characteristic diagram Hi,c(x);
Step 4, outputting the attention output characteristic diagram H obtained in the step 3i,c(x) Respectively sent to global second-order pooling network GSoP based on channel and global second-order pooling network GSoP based on space positionGlobal second-order pooling network, channel-based dependency Z between global second-order pooling network output feature mapstransSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profilenon-local
Step 5, fusing the dependency relationship Z between the characteristic graphstransAnd the dependency Z between spatial positions in the characteristic diagramnon-localThereby obtaining an output characteristic Zfusion
Step 6, the output characteristics Z obtained in the step 5 are usedfusionSending the emotion data into a two-stage multi-task learning network, and obtaining an emotion state value by using a linear regression device: arousal and Valence.
The present invention is also characterized in that,
in the step 1, the method comprises the following steps of,
for training sample set xtrainThe sample image is a tensor x with dimensions of n x h x wtrain=[(h1,w1),(h2,w2),...,(hn,wn)]Wherein n represents the total number of samples of the training sample set, h and w represent the length and width of each sample image respectively, and the original emotion marking value of the sample is a vector y with dimension of n multiplied by 2a,v=[(a1,v1),(a2,v2),...,(an,vn)]Wherein (a)n,vn) Respectively representing a training sample set xtrainArousal and Valence labels for the nth sample image;
for test sample set xtestThe sample image is a tensor x with dimensions of m x h x Wtest=[(h1,w1),(h2,w2),...,(hm,wm)]Where m represents the total number of samples of the test sample set, and h and w represent the length and width, respectively, of each sample image; the original emotion annotation value of the sample is a m multiplied by 2 dimensional vector ya,v=[(a1,v1),(a2,v2),...,(am,vm)]Wherein (a)m,vm) Respectively representing a set of test samples xtestArousal and Valence labels of the mth sample picture.
The step 2 is as follows:
step 2.1, reading a training sample set xtrainPer sample image (h)l,wl) Where l ═ 1, 2., n, n denote a training sample set xtrainThe total number of the intermediate images is then scaled in different scales according to the proportion of the input sample image, a series of detection frames with different sizes are generated and used for constructing an image pyramid
Figure BDA0003382527550000041
Wherein xkRepresenting the kth image in the image pyramid, wherein k represents the total number of images in the image pyramid so as to adapt to the face detection of different sizes; the detection process comprises three Network structures of a Proposal Network, a refined Network and an Output Network;
step 2.2, the image pyramid obtained in the step 2.1 is used
Figure BDA0003382527550000042
Sending into a first layer network: the method comprises the steps of performing feature extraction and face frame calibration on a Proposal Network, then judging whether the region is a face or not through three convolution layers, a face classifier, frame regression and a face key point locator, and finally outputting a plurality of images with possible faces
Figure BDA0003382527550000051
Wherein xgRepresenting the g image, wherein g represents the total number of images in which human faces possibly exist; filtering the generated candidate frames through a non-maximum suppression NMS algorithm;
and 2.3, sending the output result of the step 2.2 to a second-layer network: refine Network, further perform key feature point and face region location on the remaining candidate frames after screening through a key feature point locator and frame regression, and optimize by using a non-maximum inhibition algorithm, thereby obtaining output results of high-precision screening and face region optimization
Figure BDA0003382527550000052
Wherein xjJ represents the j image in the output result, and j represents the total number of images of the output result;
and 2.4, sending the output characteristics obtained in the step 2.5 into a third-layer network: output Network, image feature
Figure BDA0003382527550000053
Obtaining the final MTCNN output image x through four convolution layers and a full connection layer, and simultaneously through face region frame regression and face key point positioninginput=[x1,x2,...,xn]Wherein x isnRepresenting the nth output image and n representing the total number of output images.
The step 3 is as follows:
step 3.1, initializing the parameters of the whole network architecture, namely weight and bias, wherein the parameters comprise all convolution layers, pooling layers and full-connection layers in the network;
step 3.2, respectively sending the images obtained after preprocessing into a main branch and a mask branch, wherein the characteristics M of different receptive fields output by the main branch are obtainedi,c(x) Mask branching output learning attention weight Ti,c(x) The specific process is described as follows:
branching a trunk: image feature xinputPerforming normalization processing on output results of the two convolution layers with convolution kernels of 3 multiplied by 3 to obtain characteristic M of different receptive fieldsi,c(x);
Let the convolutional layer output of the first layer be zlThen obtaining the final output o by normalization processing BN and the activation function ReLUlThe specific calculation formula is as follows:
ol=ReLU(BN(zl))=ReLU(BN(Wol-1+b)) (1)
wherein W and b represent weight value and offset, respectively, and l is 1, 2, zlRepresents the convolutional layer output of the first layer, ol-1Represents the final output of the l-1 layer;
wherein, BN is used for carrying out normalization processing on the result of the convolution layer, and the calculation formula is as follows:
Figure BDA0003382527550000061
in the formula (2), xiRepresenting image features xinputAny of the individual sample image features of (2), xBNFor the output characteristics after layer normalization, σlIs a standard deviation image of the first layer image features, mulThe average image of the image characteristics of the l layer is taken;
σ in equation (2)lAnd mulThe definition of (A) is as follows:
Figure BDA0003382527550000062
Figure BDA0003382527550000071
in the formulae (3) and (4), xkImage feature xinputK represents the number of samples in each small batch, and k is more than or equal to 1;
and (3) mask branching: image feature xinputAfter two pooling operations and two upsampling operations, the output is adjusted to be in the range of 0 to 1 by using a sigmoid function, and a learning attention weight T is obtainedi,c(x);
Step 3.3, outputting M trunk branchesi,c(x) Sum mask branch output Ti,c(x) Performing dot product, introducing a residual error mechanism in the process to obtain residual error attention output, wherein the calculation formula is as follows:
Hi,c(x)=(1+Mi,c(x))*Ti,c(x) (5)
the step 4 is as follows:
step 4.1, firstly, the residual attention characteristic H obtained in the step 3i,c(x) Feeding into channel-based tanksDependency Z between local second pooling network learning feature mapstransSetting the size of an output characteristic diagram of the residual attention network as h ' × w ' × c ', firstly, performing channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtaining the characteristic diagram with the size of h ' × w ' × c ', wherein h ', w ', c ' are respectively the height, the width, the number of input channels and the number of channels subjected to dimensionality reduction of the characteristic diagram;
obtaining a covariance matrix with the channel-by-channel size of c x c through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith characteristic channel of the residual attention network output characteristic diagram and all channels;
reconstructing the cxc two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 xcxcxcxc, and performing line-by-line convolution on the reconstructed covariance matrix, namely performing grouping convolution on each line of the covariance matrix as a group, wherein the output size is 1 xc 4 c; then, performing 1 × 1 convolution, wherein the output size is 1 × 1 × c ', and then, performing sigmoid activation function layer to obtain a weight vector with the size of 1 × c';
multiplying each characteristic channel in the characteristic diagram of the input GSoP network by a corresponding position element in a weight vector of 1 × c', namely giving different attention degrees to each characteristic channel output by the residual attention network;
step 4.2, then the residual attention characteristic H obtained in the step 3i,c(x) Feeding into GSoP network model based on spatial position to learn dependence Z between spatial positions in characteristic diagramnon-localSetting the size of an output characteristic diagram of the residual attention network as h '× w' × c ', firstly, performing channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtaining the characteristic diagram with the size of h' × w '× c', wherein h ', w', c are respectively the height, width, input channel number and channel number after dimensionality reduction of the characteristic diagram;
down-sampling the feature map subjected to channel dimension reduction, wherein the size of the feature map is reduced to h multiplied by w multiplied by c;
obtaining a covariance matrix with the position-by-position size of hw multiplied by hw through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith spatial position and all spatial positions in the residual attention network output characteristic diagram;
reconstructing the hw multiplied by hw two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 multiplied by hw, performing row-by-row convolution on each row of the reconstructed covariance matrix as a group, and performing grouping convolution to obtain the output size of 1 multiplied by hw multiplied by 4 hw; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;
reconstructing a weight matrix with the size of h multiplied by w multiplied by 1 into a weight matrix of h 'multipliedby w' × 1 through upsampling;
and multiplying the input feature map of the GSoP by the corresponding spatial position feature in the weight matrix, and emphasizing or suppressing the spatial position feature in the output feature map of the residual attention network.
The step 6 is as follows:
step 6.1, adopting a hard sharing mode in the multi-task learning, and in the first stage of the multi-task learning, firstly using a sharing module to output the characteristic Z obtained in the step 5fusionExtracting some universal bottom-layer features, and then respectively using two branches for learning the classification characterization Z of the imageclassAnd dimension characterization ZdimThen, the two types of output features of the learned classification feature and the learned dimension feature are cascaded and combined into a feature Zmtl-1
Step 6.2, in the second stage of multitask learning, the characteristic Z is subjected tomtl-1Performing linear regression to obtain an output value: arousal and Valence;
the loss function adopted by the linear regressor is Tukey's double-weight loss function, and the specific defined formula is as follows:
Figure BDA0003382527550000101
in the formula (6), Loss represents a Loss value, and c-4.685 represents oneHyper-parameter, yiWhich represents the value of the real tag,
Figure BDA0003382527550000102
representing the predicted tag value.
Compared with the prior art, the facial emotion recognition method based on the attention mechanism has the beneficial effects that the attention mechanism and the high-order feature representation are fused, and the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model the long-distance dependency relationship between the emotion features and insufficient in network nonlinear characterization capability can be effectively solved.
Drawings
FIG. 1 is an overall network model architecture of a facial emotion recognition method integrating attention mechanism and high-order feature representation according to the invention.
Fig. 2 is a residual attention network model structure.
Fig. 3 is a channel-based GSoP network model structure.
Fig. 4 is a GSoP network model structure based on spatial location.
FIG. 5 is an emotional state value prediction module that integrates depth characterization.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a facial emotion recognition method integrating an attention mechanism and high-order feature representation, which is implemented by combining the following steps of fig. 1-5:
step 1, collecting a target image, and dividing the target image into a training sample set xtrainAnd test sample set xtest
In the step 1, the method comprises the following steps of,
for training sample set xtrainThe sample image is a tensor x with dimensions of n x h x wtrain=[(h1,w1),(h2,w2),...,(hn,wn)]Wherein n represents the total number of samples of the training sample set, h and w represent the length and width of each sample image respectively, and the original emotion marking value of the sample is a vector y with dimension of n multiplied by 2a,v=[(a1,v1),(a2,v2),...,(an,vn)]Wherein (a)n,vn) Respectively representing a training sample set xtrainArousal and Valence labels for the nth sample image;
for test sample set xtestThe sample image is a tensor x with dimensions of m x h x wtest=[(h1,w1),(h2,w2),...,(hm,wm)]Where m represents the total number of samples of the test sample set, and h and w represent the length and width, respectively, of each sample image; the original emotion annotation value of the sample is a m multiplied by 2 dimensional vector ya,v=[(a1,v1),(a2,v2),...,(am,vm)]Wherein (a)m,vm) Respectively representing a set of test samples xtestArousal and Valence labels of the mth sample picture.
Step 2, reading a training sample set xtrainMarking the original emotion value of each sample image, and collecting a training sample set xtrainSending each sample image into a multi-task cascaded convolutional neural network MTCNN, finishing face alignment according to face detection and 5 key feature points to obtain an output image xinput=[x1,x2,...,xn]Wherein x isnRepresenting the nth output image, n representing the total number of output images, i.e. the training sample set xtrainThe total number of images in;
the step 2 is as follows:
step 2.1, reading a training sample set xtrainPer sample image (h)l,wl) Where l ═ 1, 2., n, n denote a training sample set xtrainThe total number of the intermediate images is then scaled in different scales according to the proportion of the input sample image, a series of detection frames with different sizes are generated and used for constructing an image pyramid
Figure BDA0003382527550000121
Wherein xkRepresenting the k image, k, in the image pyramidThe total number of the images in the image pyramid is represented so as to adapt to the face detection of different sizes; the detection process comprises three Network structures of a Proposal Network, a refined Network and an Output Network;
step 2.2, the image pyramid obtained in the step 2.1 is used
Figure BDA0003382527550000122
Sending into a first layer network: the method comprises the steps of performing feature extraction and face frame calibration on a Proposal Network, then judging whether the region is a face or not through three convolution layers, a face classifier, frame regression and a face key point locator, and finally outputting a plurality of images with possible faces
Figure BDA0003382527550000123
Wherein xgRepresenting the g image, wherein g represents the total number of images in which human faces possibly exist; filtering the generated candidate box by a Non-Maximum inhibition NMS (Non-Maximum Suppression) algorithm;
and 2.3, sending the output result of the step 2.2 to a second-layer network: compared with the Proposal Network, the Network has a more full connection layer structure, so that the characteristic screening is more strict, and most of candidate images with poor effects can be screened. Then, the key feature points and the face region are further positioned on the remaining candidate frames after screening through a key feature point positioner and frame regression, and optimization is carried out by using a non-maximum inhibition algorithm, so that the output results of high-precision screening and face region optimization are obtained
Figure BDA0003382527550000131
Wherein xjJ represents the j image in the output result, and j represents the total number of images of the output result;
and 2.4, sending the output characteristics obtained in the step 2.5 into a third-layer network: output Network, image feature
Figure BDA0003382527550000132
ThroughFour convolution layers and a full connection layer are processed by face region frame regression and face key point positioning to obtain the final MTCNN output image xinput=[x1,x2,...,xn]Wherein x isnRepresenting the nth output image and n representing the total number of output images.
Step 3, obtaining the image x after the pretreatmentinputInputting the data into a residual attention network, and extracting features M of different receptive fields through trunk branchesi,c(x) Mask branch learning attention weight Ti,c(x) (ii) a Finally, the output of the main branch and the output of the mask branch are subjected to dot product operation to obtain an attention output characteristic diagram Hi,c(x) (ii) a The network structure is shown in fig. 2, and the network architecture parameters are shown in table 1 below.
TABLE 1 residual attention network model parameter Table
Figure BDA0003382527550000133
Figure BDA0003382527550000141
The step 3 is as follows:
step 3.1, initializing the parameters of the whole network architecture, namely weight and bias, wherein the parameters comprise all convolution layers, pooling layers and full-connection layers in the network;
step 3.2, respectively sending the images obtained after preprocessing into a main branch and a mask branch, wherein the characteristics M of different receptive fields output by the main branch are obtainedi,c(x) Mask branching output learning attention weight Ti,c(x) The specific process is described as follows:
branching a trunk: image feature xinputPerforming normalization processing on output results of the two convolution layers with convolution kernels of 3 multiplied by 3 to obtain characteristic M of different receptive fieldsi,c(x);
Assuming convolutional layer output of the first layerIs zlThen obtaining the final output o by normalization processing BN and the activation function ReLUlThe specific calculation formula is as follows:
ol=ReLU(BN(zl))=ReLU(BN(Wol-1+b)) (1)
wherein W and b represent weight value and offset, respectively, and l is 1, 2, zlRepresents the convolutional layer output of the first layer, ol-1Represents the final output of the l-1 layer;
wherein, BN is used for carrying out normalization processing on the result of the convolution layer, and the calculation formula is as follows:
Figure BDA0003382527550000151
in the formula (2), xiRepresenting image features xinputAny of the individual sample image features of (2), xBNFor the output characteristics after layer normalization, σlIs a standard deviation image of the first layer image features, mulThe average image of the image characteristics of the l layer is taken;
σ in equation (2)lAnd mulThe definition of (A) is as follows:
Figure BDA0003382527550000152
Figure BDA0003382527550000153
in the formulae (3) and (4), xkImage feature xinputK represents the number of samples in each small batch, and k is more than or equal to 1;
and (3) mask branching: image feature xinputAfter two pooling operations and two upsampling operations, the output is adjusted to be in the range of 0 to 1 by using a sigmoid function, and a learning attention weight T is obtainedi,c(x);
Step 3.3, outputting M trunk branchesi,c(x)Sum mask branch output Ti,c(x) Performing dot product, introducing a residual error mechanism in the process to obtain residual error attention output, wherein the calculation formula is as follows:
Hi,c(x)=(1+Mi,c(x))*Ti,c(x) (5)
step 4, outputting the attention output characteristic diagram H obtained in the step 3i,c(x) Respectively sending the data into a global second-order pooling network GSoP based on a channel and a global second-order pooling network based on a space position, and outputting a dependency relationship Z between characteristic graphs by the global second-order pooling network based on the channeltransSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profilenon-local(ii) a The network architecture parameters are shown in table 2 below.
TABLE 2 GSop model parameter Table
Figure BDA0003382527550000161
The step 4 is as follows:
step 4.1, firstly, the residual attention characteristic H obtained in the step 3i,c(x) Feeding into a channel-based global second-order pooling network to learn the dependency Z between feature mapstransThe network structure is shown in fig. 3.
As shown in fig. 3, the output feature map size of the residual attention network is h ' × w ' × c ', the feature map of the input GSoP network is firstly subjected to channel dimensionality reduction by a 1 × 1 convolution, and then a feature map with the size of h ' × w ' × c ' is obtained, wherein h ', w ', c ', and c are respectively the height, width, input channel number, and channel number after dimensionality reduction of the feature map;
obtaining a covariance matrix with the channel-by-channel size of c x c through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith characteristic channel of the residual attention network output characteristic diagram and all channels;
reconstructing the cxc two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 xcxcxcxc, and performing line-by-line convolution on the reconstructed covariance matrix, namely performing grouping convolution on each line of the covariance matrix as a group, wherein the output size is 1 xc 4 c; then, performing 1 × 1 convolution, wherein the output size is 1 × 1 × c ', and then, performing sigmoid activation function layer to obtain a weight vector with the size of 1 × c';
multiplying each characteristic channel in the characteristic diagram of the input GSoP network by a corresponding position element in a weight vector of 1 × c', namely giving different attention degrees to each characteristic channel output by the residual attention network;
step 4.2, then the residual attention characteristic H obtained in the step 3i,c(x) Feeding into GSoP network model based on spatial position to learn dependence Z between spatial positions in characteristic diagramnon-localThe network structure is shown in fig. 4.
As shown in fig. 4, the output feature map size of the residual attention network is h '× w' × c ', the feature map of the input GSoP network is firstly subjected to channel dimensionality reduction by a 1 × 1 convolution, and then the feature map with the size of h' × w '× c' is obtained, wherein h ', w', c are respectively the height, width, input channel number and channel number after dimensionality reduction of the feature map;
down-sampling the feature map subjected to channel dimension reduction, wherein the size of the feature map is reduced to h multiplied by w multiplied by c;
obtaining a covariance matrix with the position-by-position size of hw multiplied by hw through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith spatial position and all spatial positions in the residual attention network output characteristic diagram;
the two-dimensional covariance matrix of hw × hw is reconstructed into a three-dimensional tensor of size 1 × hw × hw, which facilitates the following row-by-row convolution operation. Taking each row of the reconstructed covariance matrix as a group to perform line-by-line convolution, and performing packet convolution to obtain an output size of 1 × hw × 4 hw; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;
reconstructing a weight matrix with the size of h multiplied by w multiplied by 1 into a weight matrix of h 'multipliedby w' × 1 through upsampling;
and multiplying the input feature map of the GSoP by the corresponding spatial position feature in the weight matrix, and emphasizing or suppressing the spatial position feature in the output feature map of the residual attention network.
Step 5, fusing the dependency relationship Z between the characteristic graphstransAnd the dependency Z between spatial positions in the characteristic diagramnon-localThereby obtaining an output characteristic Zfusion
Step 6, the output characteristics Z obtained in the step 5 are usedfusionSending the emotion data into a two-stage multi-task learning network, and obtaining an emotion state value by using a linear regression device: arousal and Valence.
The step 6 is as follows:
the structure of the multitask learning diagram is shown in fig. 5.
Step 6.1, adopting a hard sharing mode in the multi-task learning, and in the first stage of the multi-task learning, firstly using a sharing module to output the characteristic Z obtained in the step 5fusionExtracting some universal bottom-layer features, and then respectively using two branches for learning the classification characterization Z of the imageclassAnd dimension characterization ZdimThen, the two types of output features of the learned classification feature and the learned dimension feature are cascaded and combined into a feature Zmtl-1
Step 6.2, in the second stage of multitask learning, the characteristic Z is subjected tomtl-1Performing linear regression to obtain an output value: arousal and Valence;
the loss function adopted by the linear regressor is Tukey's double-weight loss function, and the specific defined formula is as follows:
Figure BDA0003382527550000201
in equation (6), Loss represents a Loss value, c-4.685 is a hyperparameter, and yiWhich represents the value of the real tag,
Figure BDA0003382527550000202
representing the predicted tag value.
Examples
The experiment of the invention is carried out based on the AffectNet data set, the Root Mean Square Error (RMSE) and the Consistency Correlation Coefficient (CCC) are respectively calculated according to the label value predicted by the model and the original label value, and then the result is compared with the existing method, so that the performance of the invention is evaluated and analyzed.
The results of the experiment are shown in table 3:
TABLE 3 comparison of Performance of different network models
Figure BDA0003382527550000203
As can be seen from table 3, according to the Root Mean Square Error (RMSE) evaluation index, the method of the present invention obtained 0.366 for Arousal, which is lower than both 0.513 and 0.410 obtained by the conventional method SVR and CNN, respectively. 0.317 was obtained for Valence, which is a reduction compared to both 0.384 and 0.370 obtained by the two conventional methods above. The method proposed in this section yields 0.556 for Arousal in terms of the Coherence Correlation Coefficient (CCC), which is an improvement over both 0.182 and 0.340 obtained by the above conventional methods. 0.603 for Valence was obtained, which is an improvement over both 0.372 and 0.600 obtained by the above conventional methods.
According to the analysis, the facial emotion recognition method combining the attention mechanism and the high-order feature representation is superior to the traditional method, and the effectiveness of modeling for the dependence between the long-distance features and the improvement on the nonlinear characterization capability are verified.

Claims (6)

1. The facial emotion recognition method integrating the attention mechanism and the high-order feature representation is characterized by comprising the following steps:
step 1, collecting a target image, and dividing the target image into a training sample set xtrainAnd test sample set xtest
Step 2, reading a training sample set xtrainOf each sample imageMarking original emotion with values, and collecting training sample set xtrainSending each sample image into a multi-task cascaded convolutional neural network MTCNN, finishing face alignment according to face detection and 5 key feature points to obtain an output image xinput=[x1,x2,...,xn]Where xn represents the nth output image and n represents the total number of output images, i.e. the training sample set xtrainThe total number of images in;
step 3, obtaining the image x after the pretreatmentinputInputting the data into a residual attention network, and extracting features M of different receptive fields through trunk branchesi,c(x) Mask branch learning attention weight Ti,c(x) (ii) a Finally, the output of the main branch and the output of the mask branch are subjected to dot product operation to obtain an attention output characteristic diagram Hi,c(x);
Step 4, outputting the attention output characteristic diagram H obtained in the step 3i,c(x) Respectively sending the data into a global second-order pooling network GSoP based on a channel and a global second-order pooling network based on a space position, and outputting a dependency relationship Z between characteristic graphs by the global second-order pooling network based on the channeltransSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profilenon-local
Step 5, fusing the dependency relationship Z between the characteristic graphstransAnd the dependency Z between spatial positions in the characteristic diagramnon-localThereby obtaining an output characteristic Zfusion
Step 6, the output characteristics Z obtained in the step 5 are usedfusionSending the emotion data into a two-stage multi-task learning network, and obtaining an emotion state value by using a linear regression device: arousal and Valence.
2. The method for facial emotion recognition based on fusion of an attention mechanism and a higher-order feature representation according to claim 1, wherein in the step 1,
for training sample set xtrainThe sample image is n x h x w dimension tensor ztrain=[(h1,w1),(h2,w2),...,(hn,wn)]Wherein n represents the total number of samples of the training sample set, h and w represent the length and width of each sample image respectively, and the original emotion marking value of the sample is a vector y with dimension of n multiplied by 2a,v=[(a1,v1),(a2,v2),...,(an,vn)]Wherein (a)n,vn) Respectively representing a training sample set xtrainArousal and Valence labels for the nth sample image;
for test sample set xtestThe sample image is a tensor x with dimensions of m x h x wtest=[(h1,w1),(h2w2),...,(hm,wm)]Where m represents the total number of samples of the test sample set, and h and w represent the length and width, respectively, of each sample image; the original emotion annotation value of the sample is a m multiplied by 2 dimensional vector ya,v=[(a1,v1),(a2,v2),...,(am,vm)]Wherein (a)m,vm) Respectively representing a set of test samples xtestArousal and Valence labels of the mth sample picture.
3. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 2, wherein the step 2 is specifically as follows:
step 2.1, reading a training sample set xtrainPer sample image (h)l,wl) Where l ═ 1, 2., n, n denote a training sample set xtrainThe total number of the intermediate images is then scaled in different scales according to the proportion of the input sample image, a series of detection frames with different sizes are generated and used for constructing an image pyramid
Figure FDA0003382527540000031
Wherein xkRepresenting the kth image in the image pyramid, wherein k represents the total number of images in the image pyramid so as to adapt to the face detection of different sizes; wherein, examineThe measurement process is divided into three Network structures of a Proposal Network, a refined Network, and an Output Network;
step 2.2, the image pyramid obtained in the step 2.1 is used
Figure FDA0003382527540000032
Sending into a first layer network: the method comprises the steps of performing feature extraction and face frame calibration on a Proposal Network, then judging whether the region is a face or not through three convolution layers, a face classifier, frame regression and a face key point locator, and finally outputting a plurality of images with possible faces
Figure FDA0003382527540000033
Wherein xgRepresenting the g image, wherein g represents the total number of images in which human faces possibly exist; filtering the generated candidate frames through a non-maximum suppression NMS algorithm;
and 2.3, sending the output result of the step 2.2 to a second-layer network: refine Network, further perform key feature point and face region location on the remaining candidate frames after screening through a key feature point locator and frame regression, and optimize by using a non-maximum inhibition algorithm, thereby obtaining output results of high-precision screening and face region optimization
Figure FDA0003382527540000041
Wherein xjJ represents the j image in the output result, and j represents the total number of images of the output result;
and 2.4, sending the output characteristics obtained in the step 2.5 into a third-layer network: output Network, image feature
Figure FDA0003382527540000042
Obtaining the final MTCNN output image x through four convolution layers and a full connection layer, and simultaneously through face region frame regression and face key point positioninginput=[x1,x2,...,xn]Wherein xn represents the nth output diagramLike, n represents the total number of output images.
4. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 3, wherein the step 3 is specifically as follows:
step 3.1, initializing the parameters of the whole network architecture, namely weight and bias, wherein the parameters comprise all convolution layers, pooling layers and full-connection layers in the network;
step 3.2, respectively sending the images obtained after preprocessing into a main branch and a mask branch, wherein the characteristics M of different receptive fields output by the main branch are obtainedi,c(x) Mask branching output learning attention weight Ti,c(x) The specific process is described as follows:
branching a trunk: image feature xinputPerforming normalization processing on output results of the two convolution layers with convolution kernels of 3 multiplied by 3 to obtain characteristic M of different receptive fieldsi,c(x);
Let the convolutional layer output of the first layer be zlThen obtaining the final output o by normalization processing BN and the activation function ReLUlThe specific calculation formula is as follows:
ol=ReLU(BN(zl))=ReLU(BN(Wol-1+b)) (1)
wherein W and b represent weight value and offset, respectively, and l is 1, 2, zlRepresents the convolutional layer output of the first layer, ol-1Represents the final output of the l-1 layer;
wherein, BN is used for carrying out normalization processing on the result of the convolution layer, and the calculation formula is as follows:
Figure FDA0003382527540000051
in the formula (2), xiRepresenting image features xinputAny of the individual sample image features of (2), xBNFor the output characteristics after layer normalization, σlIs a standard deviation image of the first layer image features, mulThe average image of the image characteristics of the l layer is taken;
σ in equation (2)lAnd mulThe definition of (A) is as follows:
Figure FDA0003382527540000052
Figure FDA0003382527540000053
in the formulae (3) and (4), xkImage feature xinputK represents the number of samples in each small batch, and k is more than or equal to 1;
and (3) mask branching: image feature xinputAfter two pooling operations and two upsampling operations, the output is adjusted to be in the range of 0 to 1 by using a sigmoid function, and a learning attention weight T is obtainedi,c(x);
Step 3.3, outputting M trunk branchesi,c(x) Sum mask branch output Ti,c(x) Performing dot product, introducing a residual error mechanism in the process to obtain residual error attention output, wherein the calculation formula is as follows:
Hi,c(x)=(1+Mi,c(x))*Ti,c(x) (5)。
5. the method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 4, wherein the step 4 is specifically as follows:
step 4.1, firstly, the residual attention characteristic H obtained in the step 3i,c(x) Feeding into a channel-based global second-order pooling network to learn the dependency Z between feature mapstransLet the output characteristic diagram size of the residual attention network be h '× w' × c ', firstly perform channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtain the characteristic diagram with the size of h' × w '× c', wherein h ', w', c are respectively the height, width and input of the characteristic diagramThe number of channels entering and the number of channels after dimensionality reduction;
obtaining a covariance matrix with the channel-by-channel size of c x c through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith characteristic channel of the residual attention network output characteristic diagram and all channels;
reconstructing the cxc two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 xcxcxcxc, and performing line-by-line convolution on the reconstructed covariance matrix, namely performing grouping convolution on each line of the covariance matrix as a group, wherein the output size is 1 xc 4 c; then, performing 1 × 1 convolution, wherein the output size is 1 × 1 × c ', and then, performing sigmoid activation function layer to obtain a weight vector with the size of 1 × c';
multiplying each characteristic channel in the characteristic diagram of the input GSoP network by a corresponding position element in a weight vector of 1 × c', namely giving different attention degrees to each characteristic channel output by the residual attention network;
step 4.2, then the residual attention characteristic H obtained in the step 3i,c(x) Feeding into GSoP network model based on spatial position to learn dependence Z between spatial positions in characteristic diagramnon-localSetting the size of an output characteristic diagram of the residual attention network as h '× w' × c ', firstly, performing channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtaining the characteristic diagram with the size of h' × w '× c', wherein h ', w', c are respectively the height, width, input channel number and channel number after dimensionality reduction of the characteristic diagram;
down-sampling the feature map subjected to channel dimension reduction, wherein the size of the feature map is reduced to h multiplied by w multiplied by c;
obtaining a covariance matrix with the position-by-position size of hw multiplied by hw' through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith spatial position and all spatial positions in the output characteristic diagram of the residual attention network;
reconstructing the two-dimensional covariance matrix of hw '× hw' into a three-dimensional tensor with the size of 1 × hw '× hw', performing row-by-row convolution by taking each row of the reconstructed covariance matrix as a group, and performing packet convolution to obtain the output size of 1 × hw '× 4 hw'; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;
reconstructing a weight matrix with the size of h multiplied by w multiplied by 1 into a weight matrix of h 'multipliedby w' × 1 through upsampling;
and multiplying the input feature map of the GSoP by the corresponding spatial position feature in the weight matrix, and emphasizing or suppressing the spatial position feature in the output feature map of the residual attention network.
6. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 5, wherein the step 6 is as follows:
step 6.1, adopting a hard sharing mode in the multi-task learning, and in the first stage of the multi-task learning, firstly using a sharing module to output the characteristic Z obtained in the step 5fusionExtracting some universal bottom-layer features, and then respectively using two branches for learning the classification characterization Z of the imageclassAnd dimension characterization ZdimThen, the two types of output features of the learned classification feature and the learned dimension feature are cascaded and combined into a feature Zmtl-1
Step 6.2, in the second stage of multitask learning, the characteristic Z is subjected tomtl-1Performing linear regression to obtain an output value: arousal and Valence;
the loss function adopted by the linear regressor is Tukey's double-weight loss function, and the specific defined formula is as follows:
Figure FDA0003382527540000091
in equation (6), Loss represents a Loss value, c-4.685 is a hyperparameter, and yiWhich represents the value of the real tag,
Figure FDA0003382527540000092
representing the predicted tag value.
CN202111439715.3A 2021-11-30 2021-11-30 Facial emotion recognition method integrating attention mechanism and high-order feature representation Pending CN114170657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111439715.3A CN114170657A (en) 2021-11-30 2021-11-30 Facial emotion recognition method integrating attention mechanism and high-order feature representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111439715.3A CN114170657A (en) 2021-11-30 2021-11-30 Facial emotion recognition method integrating attention mechanism and high-order feature representation

Publications (1)

Publication Number Publication Date
CN114170657A true CN114170657A (en) 2022-03-11

Family

ID=80481645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111439715.3A Pending CN114170657A (en) 2021-11-30 2021-11-30 Facial emotion recognition method integrating attention mechanism and high-order feature representation

Country Status (1)

Country Link
CN (1) CN114170657A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115276784A (en) * 2022-07-26 2022-11-01 西安电子科技大学 Deep learning-based orbital angular momentum modal identification method
CN117593593A (en) * 2024-01-18 2024-02-23 湖北工业大学 Image emotion classification method for multi-scale semantic fusion under emotion gain

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115276784A (en) * 2022-07-26 2022-11-01 西安电子科技大学 Deep learning-based orbital angular momentum modal identification method
CN115276784B (en) * 2022-07-26 2024-01-23 西安电子科技大学 Deep learning-based orbital angular momentum modal identification method
CN117593593A (en) * 2024-01-18 2024-02-23 湖北工业大学 Image emotion classification method for multi-scale semantic fusion under emotion gain
CN117593593B (en) * 2024-01-18 2024-04-09 湖北工业大学 Image emotion classification method for multi-scale semantic fusion under emotion gain

Similar Documents

Publication Publication Date Title
CN110532900B (en) Facial expression recognition method based on U-Net and LS-CNN
CN111091045B (en) Sign language identification method based on space-time attention mechanism
CN108615010B (en) Facial expression recognition method based on parallel convolution neural network feature map fusion
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
US20190228268A1 (en) Method and system for cell image segmentation using multi-stage convolutional neural networks
CN107636691A (en) Method and apparatus for identifying the text in image
CN111696101A (en) Light-weight solanaceae disease identification method based on SE-Inception
CN106909938B (en) Visual angle independence behavior identification method based on deep learning network
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN112818764A (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN112070768A (en) Anchor-Free based real-time instance segmentation method
CN109977394A (en) Text model training method, text analyzing method, apparatus, equipment and medium
CN112766283A (en) Two-phase flow pattern identification method based on multi-scale convolution network
CN111008570B (en) Video understanding method based on compression-excitation pseudo-three-dimensional network
CN112668486A (en) Method, device and carrier for identifying facial expressions of pre-activated residual depth separable convolutional network
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
CN109508640A (en) A kind of crowd's sentiment analysis method, apparatus and storage medium
Tereikovskyi et al. The method of semantic image segmentation using neural networks
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
CN114170659A (en) Facial emotion recognition method based on attention mechanism
CN114492634A (en) Fine-grained equipment image classification and identification method and system
CN110210380A (en) The analysis method of personality is generated based on Expression Recognition and psychology test
CN112560668A (en) Human behavior identification method based on scene prior knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination