CN115393944A

CN115393944A - Micro-expression identification method based on multi-dimensional feature fusion

Info

Publication number: CN115393944A
Application number: CN202211123830.4A
Authority: CN
Inventors: 张家波; 徐光辉; 甘海洋; 黄钟玉; 高洁
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-11-25

Abstract

The invention discloses a micro-expression recognition method based on multi-dimensional feature fusion, which comprises four steps of image preprocessing, extraction of optical flow feature components, construction of a multi-dimensional feature fusion network, and classification of micro-expressions by using a model obtained through training. In the invention, in order to compensate the facial detail information lost by the model, a feature fusion module is constructed, and the shallow feature extracted by the feature fusion module is fused with the abstract feature extracted by the double-current convolution network and is commonly used for model classification. For the extracted high-dimensional fusion features, different weights are given to the channels through the channel attention module, so that the model focuses more on the channels with high contribution degree, and the precision of micro-expression recognition is further improved.

Description

Micro-expression identification method based on multi-dimensional feature fusion

Technical Field

The invention relates to the technical field of expression recognition, in particular to a micro expression recognition method based on multi-dimensional feature fusion.

Background

Micro-expressions are special expressions of very short duration and very low expression intensity, and usually occur involuntarily when people attempt to mask their own true mind. Since the time that a micro-expression stays on the face is typically only 1/25 second to 1/3 second, it is difficult for most people to recognize its appearance. This particular expression is thought to be related to the human self-defense mechanism, reflecting the real idea of the mind of the human. Accurate identification of micro-expressions helps people make appropriate judgments and decisions, and thus micro-expression identification is extremely important. The micro-expression recognition is essentially an image classification problem, and in recent years, with the development of deep learning, people begin to perform micro-expression analysis recognition by means of computers, and the micro-expression recognition method has the advantages that as long as an accurate and stable model is trained, a large-scale micro-expression recognition task can be automatically and efficiently processed at low cost, and manpower and material resources are saved.

Document 1, gan y.s., liong s.t., yau w.c., et al. Off-extranet on micro-expression recognition system [ J ]. Signal Processing: image Communication,2019,74,. On the basis, the Genetic Algorithm (Genetic Algorithm, GA) is introduced into the model learning characteristics of the [ Genetic Algorithm, GA ] 2[ jin Qiushi, xu Huanggchao, liu Kunhong, et al, ga-apexnet: genetic Algorithm in apex frame network for micro-expression recognition system [ C ]// Proceedings of the Journal of Physics: reference series.Suzhou: IOP Press,2020,1544 (1): 012149.], the Genetic Algorithm, GA) reduces the dimensions of the model learning characteristics, only retains the characteristics beneficial to classification, and finally, the precision of the tested model is further improved. Document 3[ Nie xuan, takalkar M.A., duan Mengyang, et al.GEME Dual-stream multi-task generator-based micro-expression recognition [ J ]. Neuro expression, 2021,427 ] ], document 4[ Zhou Ling, mao Qirong, xue Luoyang. Dual-input network for cross-database micro-expression recognition [ C ]// Proceedings of the 2019 11 IEEE International Conference Automatic facility and Gestregistration. Lille. In contrast, document [3] introduces a multitask learning method, which assists in model classification by detecting gender through a network leg. While the overall idea of the document [4] is the same as that of the document [1], optical flow characteristic components of a start frame and a vertex frame are extracted as input, but the document [4] uses an initiation block module when constructing a model, namely, convolution kernels of various sizes are used in parallel on the same layer.

The existing models are not sufficient in attention to face detail information, and meanwhile, the micro expression recognition effect needs to be improved.

Disclosure of Invention

The invention discloses a micro-expression recognition method based on multi-dimensional feature fusion, and aims to solve the problem that an existing model is not enough in attention to face detail information. In order to compensate face detail information lost by the model, a feature fusion module is constructed, and the shallow features extracted by the feature fusion module are fused with the abstract features extracted by the double-current convolution network and are used for model classification. For the extracted high-dimensional fusion features, different weights are given to the channels through the channel attention module, so that the model focuses more on the channels with high contribution degree, and the precision of micro-expression recognition is further improved.

In view of this, the technical scheme adopted by the invention is as follows: a micro expression identification method based on multi-dimensional feature fusion is characterized by comprising the following steps:

the method comprises the steps of firstly, image preprocessing, including image graying, face key point detection, face alignment cutting and scale normalization.

And step two, extracting an optical flow horizontal component u and an optical flow vertical component v between the initial frame and the vertex frame by using an iterative Lucas-Kanade algorithm.

Step three, constructing a multi-dimensional feature fusion network, which specifically comprises the following steps: taking an optical flow horizontal component u and an optical flow vertical component v as the input of a double-current convolutional neural network, wherein the backbone network adopts a symmetrical structure, convolution layers are used in a first layer, a third layer and a fourth layer and are used for extracting texture information and edge features of optical flow features, and maximum pooling layers are used in a second layer and a fifth layer and are used for down-sampling input feature dimensions; the output of the first layer convolution layer in the backbone network is subjected to feature fusion through a feature fusion module, and then is fused with the maximum pooling layer output feature of the fifth layer of the two branches again to obtain the output containing the facial detail information and the abstract feature; after the multi-dimensional features are fused, a channel attention module is introduced to endow different channels with different importance, and useful features for model classification and discrimination are highlighted; two fully-connected layers are then introduced, and features output from the fully-connected layers are passed to the output layer, sorted by softmax function.

And step four, classifying the micro-expressions by using the model obtained by training.

The invention has the following beneficial technical effects:

according to the invention, the horizontal component and the vertical component of the optical flow are used as the input of the model, the multi-dimensional feature fusion module is constructed, the extracted shallow feature of the facial details is fused with the abstract feature extracted by the double-flow convolution network, the channel attention module is introduced, and the recognition effect is good. Specifically, the method comprises the following steps:

(1) And (3) obtaining optical flow characteristic components by using a fast and robust iterative Lucas-Kanade (iLK) algorithm, wherein the optical flow characteristics can better highlight the slight movement of the face.

(2) According to the model of the double-current convolutional neural network, the horizontal component of the optical flow and the vertical component of the optical flow are learned and fused respectively, and the built network model is high in identification precision.

(3) The shallow layer features extracted by the feature fusion module are fused with the abstract features extracted by the double-current convolution network and are jointly used for model classification, so that the features are richer.

(4) And a channel attention mechanism is introduced at a proper position, different channels are endowed with different importance, and the characteristics useful for model classification and discrimination are highlighted.

Drawings

FIG. 1 is a multi-dimensional feature fusion model;

FIG. 2 is a feature fusion Module FFM;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

A micro-expression recognition method based on multi-dimensional feature fusion comprises four steps of image preprocessing, optical flow feature component extraction, multi-dimensional feature fusion network construction and micro-expression classification by using a model obtained through training.

Step one, image preprocessing

Compared with common expressions, micro-expressions have three remarkable characteristics, namely short duration, low intensity and local motion. Aiming at the characteristics of the micro expression, the image preprocessing is particularly important for micro expression recognition, and specifically comprises image graying, face key point detection, face alignment cutting and scale normalization.

Because the number of micro expression samples is small, if a color image is used in an experiment, a series of problems such as difficulty in training and the like can be caused, and therefore the gray processing is carried out on the original color image by using the formula (1).

Gray(x,y)＝0.299×R(x,y)+0.587×G(x,y)+0.114×B(x,y) (1)

Wherein, gray is the pixel Gray value, x and y represent the spatial position index of the target pixel point, and R, G and B are the values of three channels of the RGB image.

The original image in the micro-expression dataset contains a lot of background image noise, including laboratory background walls and the participants 'headphones, and only the subject's facial region helps the model to recognize the micro-expression, so the invention aligns and cuts the facial region.

Before the facial region is cut, the invention adopts a method based on cascade shape regression to detect the key points of the face in the micro expression data set. The method based on the cascade shape regression can conveniently perform data enhancement by adjusting the initialized shape, and has the advantages of high calculation efficiency, strong universality and the like.

According to the key points of the face extracted by positioning, the distance d between two eyes and the included angle theta between the central coordinate connecting line of the left eye and the right eye and the horizontal direction are calculated by using a formula (2) and a formula (3).

In the formula, x ₁ ,y ₁ Is the inner canthus coordinate of the left eye, x ₂ ,y ₂ Is the coordinate of the eye corner in the right eye, d is the distance between the two eyes, and theta is the included angle between the connecting line of the left eye and the right eye and the horizontal direction.

Then calculating the midpoint (x) of the distance between the two eyes by the coordinates of the two eyes in the formula (4) ₀ ,y ₀ )，

And taking the center of the eye distance as a rotation center, and calculating the positions of the key points of the human face after rotation according to a formula (5):

wherein, x and y are indexes of spatial positions of the target pixels before adjustment, and x 'and y' are indexes of spatial positions of the target pixels after adjustment. After face alignment, a face cropping operation is performed, typically cropping the face area of the participant into a rectangle.

And finally, carrying out scale normalization. In the invention, all the micro expression images are uniformly adjusted to be 112 multiplied by 112 in the image preprocessing stage through the formula (6).

Wherein g ' is the pixel value after the image adjustment, g is the pixel value before the image adjustment, w, h is the width and height before the image adjustment, and w ', h ' is the width and height after the image adjustment.

Step two, extracting characteristic components of the optical flow

The optical flow can extract representative motion characteristics, and after image preprocessing is completed in the first step, the invention extracts an optical flow horizontal component u and an optical flow vertical component v between a start frame and a vertex frame by using an iterative Lucas-Kanade (iLK) algorithm. Wherein, the initial frame is the moment of micro expression, the vertex frame is the moment of maximum micro expression, and is a frame with most abundant face information.

The main idea of the optical flow method is to calculate the change of the same pixel point in time through two adjacent frames of images, so as to obtain the motion change of the same pixel point between the two adjacent frames of images, therefore, the optical flow method is often used for describing a moving object in image processing. An iterative Lucas-Kanade (iLK) solver is applied to each level of the image pyramid. iLK is a fast and robust alternative to the TVL1 algorithm.

Step three, constructing a multi-dimensional feature fusion model

As shown in FIG. 1, the multi-dimensional feature fusion model takes an optical flow horizontal component u and an optical flow vertical component v as inputs of a double-current convolution neural network, a main part adopts a symmetrical structure, convolution layers are used in a first layer, a third layer and a fourth layer and used for extracting texture information and edge features of optical flow features, all convolution kernels in the model are set to be 5 x 5, and the step length is set to be 1. The number of kernels of the convolutional layer is 64, 128 and 128, respectively, and the Relu function is adopted as the activation function. The second and fifth layers of the backbone network use the largest pooling layer, and downsample the input feature dimensions to reduce the parameters by half. The window size of the pooling layer is 2 x 2 with the step size set to 1.

The output of the first layer convolutional layer in the backbone network is subjected to Feature Fusion through a Feature Fusion Module (FFM), and then is fused with the maximum pooling layer output features of the 5 th layer of the two branches again to obtain the output containing the facial detail information and the abstract features, so that the output is used for the final classification and judgment of the model.

After the multi-dimensional feature fusion, the feature dimension of the output is relatively high, wherein the feature graph which does not contribute much to the classification interferes with the judgment of the model and influences the final classification precision. The invention introduces a channel attention module (SENET) to solve the problem, endows different channels with different importance and highlights the characteristics useful for model classification and discrimination. Two fully connected layers were then introduced to reduce the amount of parameters in the model to prevent overfitting. The number of nodes per layer is 256. Features output from the fully connected layer are passed to the output layer, sorted by the softmax function. The detailed parameters of the model are shown in Table 1.

Specifically, the probability of being classified as the i-th expression is:

in the formula, p, x, L, L and y are respectively a model prediction label, an input image, a predicted emotion label, an emotion category total number and a real label. x is the number of _j ，x _n Respectively representing the output values of j and n nodes.

The invention adopts an Adam optimizer to update and optimize model parameters, and the Loss function adopts a classical cross entropy Loss function (Cross entropy Loss, CELoss), and the calculation formula is as follows:

wherein y is _i Is a label and y _i ∈{0,1}，p _i Is a predicted value of the model output.

TABLE 1 model parameter Table

The FFM module is described in detail below.

The detailed structure of the FFM designed by the invention is shown in figure 2.

In fig. 2, conv denotes a convolutional layer, and M denotes a max pooling layer. Suppose the input is characterized by x ^m I.e. of the network _m The layer output, including the characteristics of c channels, then the input characteristics can be expressed as

Before fusion, one-dimensional convolution operation is carried out on two features from a shallow layer to make the dimensions of the two features uniform, and the transformed features can be expressed as follows:

in the formula, x ^m ,χ ^m′ The input feature and the output feature are used as input features, c is a channel of the input feature and the output feature, w is convolution kernel weight, and b is a bias term.

Representing the input characteristics of a certain channel.

After one-dimensional convolution operation, the two feature maps are superimposed on the basis of the channel, i.e.

χ ^n′ Showing the output of the FFM from the lower arm through the convolutional layer (see figure 2 for details).

And then activating the fused high-dimensional features by using a Relu function and inputting the features into a maximum pooling layer, reducing feature dimensions and extracting more representative features.

Step four, using the trained model to classify the micro-expression

A Leave-One-Subject-Out (LOSO) method is adopted on a CASME II data set, namely samples of K participants exist in the data set, image data of One person is selected to serve as a test set, data of the remaining K-1 persons serve as a training set, and samples of each person are taken as a primary test set.

The deep learning frameworks used were TensorFlow 2.4.0 and Keras 2.4.3, a learning rate of 0.0001, a batch size of 32, and a training round of 100.

Unweighted F1-score (Unweighted F1-score, UF 1) values and Unweighted Average Recall (UAR) values of test results were calculated by formulas (11) - (14):

wherein:

wherein:

wherein C is the total number of emotion categories, TP _i 、TN _i 、FP _i 、FN _i The numbers of true positive, true negative, false positive and false negative of the i-th category are indicated respectively.

Compared with the current method, the leave-one-out cross validation method (LOSO) is adopted on the CASME II micro-expression data set, the identification effect is good, and UF1 and UAR are compared as shown in the following table.

Method	UF1	UAR
			LBP-top[ ¹ ]	0.7026	0.7429
Bi-WOOF[ ² ]	0.7805	0.8026
			OFF-Apex ^[3]	0.8764	0.8681
Ga-apexnet ^[4]	0.8634	0.9232
			MERFGR[ ⁵ ]	0.871	0.8798
MJFN[ ⁶ ]	0.9151	0.8871
			MERASTC[ ⁷ ]	0.854	0.862
MERSiamC3D[ ⁸ ]	0.8818	0.8763
			This example	0.9406	0.9502

The references referred to in the table above are as follows:

[1]Zhao Guoying,Pietikainen M.Dynamic texture recognition using local binary patterns with an application to facial expressions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(6):915-928.

[2]Liong S.T.,SeeJ.,Wong K.S.,et al.Less is more:Micro-expression recognition from video using apex frame[J].Signal Processing:Image Communication,2018,62:82-92.

[3]Gan Y.S.,Liong S.T.,Yau W.C.,et al.Off-apexnet on micro-expression recognition system[J]. Signal Processing:Image Communication,2019,74:129-139.

[4]Jin Qiushi,Xu Huangchao,Liu Kunhong,et al.Ga-apexnet:Genetic algorithm in apex frame network for micro-expression recognition system[C]//Proceedings of the Journal of Physics: Conference Series.Suzhou:IOP Press,2020,1544(1):012149.

[5]Lei Ling,Chen Tong,Li Shigang,et al.Micro-expression recognition based on facial graph representation learning and facial action unit fusion[OL]URL:[2021-6-25]. https://openaccess.thecvf.com/content/CVPR2021W/AUVi/papers/Lei_Micro-Expression_Rec ognition_Based_on_Facial_Graph_Representation_Learning_and_Facial_CVPRW_2021_paper .pdf.

[6]Li Xinyu,Wei Guangshun,Wang Jie,et al.Multi-scale joint feature network for micro-expression recognition[J].Computational Visual Media,2021,7(3):407-417.

[7]Gupta P.MERASTC:Micro-expression recognition using effective feature encodings and 2D convolutional neural network[J].IEEE Transactions on Affective Computing,doi: 10.1109/TAFFC.2021.3061967,2021.

[8]Zhao Sirui,Tao Hanqing,Zhang Yangsong,et al.A two-stage 3D CNN based learning method for spontaneous micro-expression recognition[J].Neurocomputing,2021,448:276-289。

Claims

1. a micro expression identification method based on multi-dimensional feature fusion is characterized by comprising the following steps:

the method comprises the steps of firstly, image preprocessing, including image graying, face key point detection, face alignment cutting and scale normalization;

secondly, extracting an optical flow horizontal component u and an optical flow vertical component v between the initial frame and the vertex frame by using an iterative Lucas-Kanade algorithm;

step three, constructing a multi-dimensional feature fusion network;

and step four, carrying out micro-expression classification by using the model obtained by training.

2. The micro-expression recognition method based on multi-dimensional feature fusion as claimed in claim 1, wherein: firstly, the facial key point detection adopts a cascade shape regression-based method to detect the facial key points in the micro expression data set.

3. The micro-expression recognition method based on multi-dimensional feature fusion as claimed in claim 2, wherein: step one the face alignment cropping includes:

calculating the distance d between the two eyes and the included angle theta between the central coordinate connecting line of the left eye and the right eye and the horizontal direction by using a formula (1) and a formula (2);

in the formula, x ₁ ,y ₁ Is the inner canthus coordinate of the left eye, x ₂ ,y ₂ Is the coordinate of the eye corner in the right eye, d is the distance between the two eyes, and theta is the included angle between the connecting line of the left eye and the right eye and the horizontal direction;

then calculating the midpoint (x) of the distance between the two eyes by the coordinates of the two eyes in the formula (3) ₀ ,y ₀ )，

And taking the center of the eye distance as a rotation center, and calculating the positions of the key points of the human face after rotation according to a formula (4):

wherein, x and y are indexes of spatial positions of the target pixels before adjustment, and x 'and y' are indexes of spatial positions of the target pixels after adjustment; and after the face is aligned, performing face cropping operation.

4. The micro-expression recognition method based on multi-dimensional feature fusion according to claim 1, characterized in that: and step two, the initial frame is the moment when the micro expression appears, and the vertex frame is the moment when the micro expression amplitude is maximum.

5. The micro-expression recognition method based on multi-dimensional feature fusion according to claim 1, characterized in that: step three, the multidimensional feature fusion network specifically comprises:

taking an optical flow horizontal component u and an optical flow vertical component v as the input of a double-current convolutional neural network, wherein the backbone network adopts a symmetrical structure, convolution layers are used in a first layer, a third layer and a fourth layer and are used for extracting texture information and edge features of optical flow features, and maximum pooling layers are used in a second layer and a fifth layer and are used for down-sampling input feature dimensions; the output of the first layer convolution layer in the backbone network is subjected to feature fusion through a feature fusion module, and then is fused with the maximum pooling layer output feature of the fifth layer of the two branches again to obtain the output containing the facial detail information and the abstract feature; after the multi-dimensional features are fused, a channel attention module is introduced to endow different channels with different importance, and useful features for classifying and distinguishing the models are highlighted; two fully-connected layers are then introduced, and features output from the fully-connected layers are passed to the output layer, sorted by the softmax function.

6. The micro-expression recognition method based on multi-dimensional feature fusion as claimed in claim 5, wherein: the feature fusion module specifically comprises:

let the characteristic of the input be x ^m I.e. the m-th layer output of the network, contains the characteristics of c channels, the input characteristics can be expressed as

Before fusion, one-dimensional convolution operation is carried out on two characteristics from a shallow layer, so that the dimensions of the two characteristics are unified, and the converted characteristics are represented as follows:

in the formula, chi ^m ,χ ^m′ For the input features and the output features, c is the channel of the input features and the output features, w is the convolution kernel weight, b is the bias term,

representing the input characteristics of a certain channel,

χ ^n′ The output of the lower branch of the FFM module passing through the convolution layer is shown;

the fused high-dimensional features are then activated using the Relu function and input to the max pooling layer.

7. The micro-expression recognition method based on multi-dimensional feature fusion according to any one of claims 1-6, characterized in that: the micro expressions are classified, and the probability of the micro expressions classified into the ith expression is as follows:

in the formula, p, x, L, L and y are respectively a model prediction label, an input image, a predicted emotion label, an emotion category total number and a real label, and x _j ，x _n Respectively representing the output values of j and n nodes.