CN111950373A

CN111950373A - Method for recognizing micro-expressions through transfer learning based on optical flow input

Info

Publication number: CN111950373A
Application number: CN202010666988.0A
Authority: CN
Inventors: 张立言; 李星燃
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-11-17
Anticipated expiration: 2040-07-13
Also published as: CN111950373B

Abstract

The invention discloses a method for recognizing micro-expressions by transfer learning based on optical flow input, which comprises the following steps: step 1: downloading a micro expression data set, and aligning and normalizing the micro expression data set; step 2: calculating the optical flow estimation of each micro-expression video in the micro-expression data set processed in the step 1 to obtain an optical flow sequence; and step 3: and (3) performing transfer learning from macro expression to micro expression by using a CNN model based on facial expression, inputting the optical flow sequence obtained in the step (2), outputting the optical flow sequence as a time-space characteristic, and training a network to finally realize a micro expression recognition function. The method solves the problem of overfitting caused by small data scale in micro expression recognition, and meanwhile, the optical flow has higher hierarchical characteristics than the original data, so that the performance of the model is further improved.

Description

Method for recognizing micro-expressions through transfer learning based on optical flow input

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a micro-expression recognition method realized by deep learning.

Background

Micro-expression recognition is a challenging task in the field of computer vision because it has the characteristics of suppressed facial expression and short duration. In recent years, due to the potential application in many fields such as clinical diagnosis, criminal investigation, safety systems and the like, the experts in different disciplines are receiving more and more attention. Micro-expression is a special facial expression defined as a rapid facial action that is not controlled by human consciousness, can represent a real emotion, and can reveal the real emotion that a person is trying to hide. It is possible to deceive others with facial expressions, but micro-expressions are not possible.

Although micro-expressions are visually similar to facial expressions, micro-expressions are short and suppressed, which makes micro-expression recognition more challenging than facial expression recognition. The seven categories of micro-expressions were trained using the micro-expression training tool (METT) in a psychological experiment with an average micro-expression recognition rate of 50%. The development of deep learning makes it possible to learn hierarchical features from multiple labeled images using Convolutional Neural Networks (CNNs). However, deep learning requires a large amount of data, and it is not feasible to train a Convolutional Neural Network (CNN) model directly from micro-expression data due to the lack of micro-expression data. According to the problem of small scale.

Disclosure of Invention

The invention aims to provide a method for micro expression recognition based on optical flow input transfer learning, which aims to solve the over-fitting problem caused by small data scale of micro expression recognition directly performed by a micro expression database

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for micro-expression recognition based on optical flow input transfer learning, comprising the steps of:

step 1: downloading a micro expression data set, and aligning and normalizing the micro expression data set;

step 2: calculating the optical flow estimation of each micro-expression video in the micro-expression data set processed in the step 1 to obtain an optical flow sequence;

and step 3: and (3) performing transfer learning from macro expression to micro expression by using a CNN model based on facial expression, inputting the optical flow sequence obtained in the step (2), outputting the optical flow sequence as a time-space characteristic, and training a network to finally realize a micro expression recognition function.

In the step 1, 68 facial markers are detected in the first frame of each micro expression video in the micro expression data set by using the active shape model during alignment, then the first frame of each micro expression video is normalized according to a comparison template, and the subsequent frames in each micro expression video are compared with the first frame through local weighted mean transformation; normalization includes spatial domain normalization, which cuts all images into 96 × 112 pixels in the face region, and temporal domain normalization, which uses a linear interpolation method to obtain a sufficient number of frames.

In step 2, in a segment of micro-expression video, the value of the point (x, y, t) is I (x, y, t), the time interval t reaches the next frame, the pixel moves by (x + x, y + y, t + t), the intensity is I (x + x, y + y, t + t), and the intensity is obtained based on the invariance of the brightness in the small period

I(x,y,t)＝I(x+x,y+y,t+t) (1)

Where x ═ ut, y ═ vt, u (x, y) and v (x, y) are the horizontal and vertical components to be estimated in the optical flow field, and assuming that the pixel values in the micro-expression video are continuous functions of their positions and times, the expansion is based on taylor series, the right part of the above function is represented as:

wherein, the unbiased estimation quantity of the second order and above time t, when t tends to be infinitesimal small, the two sides of the equation (2) are divided by the time t and the equation (1), and then the optical flow equation is obtained as follows:

that is to say that the first and second electrodes,

in the step 3, a network for transfer learning from the macro expression to the micro expression is designed to realize the micro expression recognition function:

the source label space for transfer learning is:

y_S＝{Neutral,Angry,Contempt,Disgust,Fear,Happy,Sad,Surprise}

the target label space is:

y_T＝{Positive,Negative,Surprise,Others}

wherein, Positive is { Happy }, Negative is { Afraid, Angry, dispost, Sad, front }, surpriise is { surpride }, and the unclear facial movements belong to other;

the overall network structure is as follows:

input->conv_1->max-pool_1->conv_2->max-pool_2->conv_3->max-pool_3->fc_1->fc_2->lstm_1->lstm_2->lstm_3->fc_3->spatial_temporal feature

wherein, input is the optical flow sequence obtained in step 2, conv _ i { i ═ 1, 2, 3} represents the ith convolution layer, and batch normalization is adopted after the convolution operation of conv1 is removed; max-pool _ i { i ═ 1, 2, 3} represents the ith largest pooling layer; fc _ i { i ═ 1, 2, 3} represents the ith fully-connected layer, extracting spatial feature representations from the fc2 layer; LSTM _ i { i ═ 1, 2, 3} represents the ith LSTM layer; spatial _ temporal feature represents a space-time feature vector finally obtained through transfer learning; the output of each volume and full-connection layer adopts a ReLU nonlinear layer as an activation function to restrict the output; after the first and second fully connected layers, there is a dropout layer to mitigate overfitting to the feature vectors;

the objective function of learning the spatial feature representation is as follows:

wherein,

a true value representing the ith sample, 1 if k is the correct class, 0 otherwise,

representing the predicted probability of the expression category k calculated on the full connection layer; target item L₁Enabling samples with different expression types to be separable in a functional space;

wherein f is_c，p，iRepresenting a spatial feature representation vector of the ith training sample of the c types and the p expression state extracted from the last layer; m is_cRepresenting a mean feature vector of the class c training samples;

when j ≠ c, m_cAnd m_jHalf of the minimum distance therebetween; target item L₂Reducing the influence of intra-class changes in the same expression class caused by factors such as the appearance of a subject;

the operation of learning the LSTM layer of temporal feature representation is as follows:

g_in，t ^(l)＝sigm(W_in ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_in ^(l))，

g_f，t ^(l)＝sigm(W_f ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_f ^(l))，

g_o，t ^(l)＝sigm(W_o ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_o ^(l))，

wherein, W_* ^(l)And b_x ^(l)Respectively representing the weight and the deviation of the first LSTM layer, wherein subscripts in, f, o and cell respectively represent input, forget, output and a memory unit; g_in，t ^(l)The representation input gate determines how much input of the network at the current moment t is stored in the unit state; g_f，t ^(l)The forgetting gate is shown, and the number of unit states at the previous moment is determined to be stored to the current moment t; g_o，t ^(l)The output gate is shown and determines how much of the unit state is output to the current output value of the LSTM; cell_t ^(l)Representing the unit state input at the current time t; h is_t ^(l)The output of the l-th LSTM layer given the t-th input is represented.

Has the advantages that: unlike the prior art, the present invention provides innovation in two ways. On one hand, network input is not a micro expression data sequence subjected to simple alignment and normalization processing, but optical flow estimation is carried out on an original micro expression data concentrated sequence, so that fine actions in a micro expression video are more obvious, and high-level features are obtained; on the other hand, the convolutional neural network can learn the space image characteristics of representative expression state frames, but the sample size of micro expression data is very small, and direct training is easy to overfit, so that the method adopts a CNN model based on facial expressions to perform transfer learning, uses pre-trained ImageNet _ CNN to train on a facial expression data set CK +, and the trained network contains some expression information shared with the micro expressions, and the information is transmitted to an LSTM training network to learn spatiotemporal characteristics. Therefore, the overfitting problem caused by small scale of micro-expression identification data directly carried out by using the micro-expression database is solved.

Detailed Description

The present invention is explained further below.

step 1, downloading a micro expression data set, and aligning and normalizing the micro expression data set; and acquiring CASMEII as a data set of the training model. Detecting 68 facial markers in a first frame of each micro expression video in the micro expression data set by using an Active Shape Model (ASM) during alignment, normalizing the first frame of each micro expression video according to a comparison template, and comparing subsequent frames in each micro expression video with the first frame through Local Weighted Mean (LWM) transformation; normalization includes spatial domain normalization, which cuts all images into 96 × 112 pixels in the face region, and temporal domain normalization, which uses a linear interpolation method to obtain a sufficient number of frames.

Step 2: calculating the optical flow estimation of each micro-expression video in the micro-expression data set processed in the step 1 to obtain an optical flow sequence; in order to ensure that the proposed architecture is able to obtain high-level features, the data input to the network is not raw data, but optical flow, which has higher level features than the raw data, has proven to be effective in micro-expression recognition.

In a segment of micro-expression video, the value of the point (x, y, t) is I (x, y, t), the time interval t is to the next frame, the pixel moves by (x + x, y + y, t + t), the intensity is I (x + x, y + y, t + t), and the micro-expression video is obtained based on the invariance of the brightness in a small period

I(x，y，t)＝I(x+x，y+y，t+t) (1)

that is to say that the first and second electrodes,

Expressions and micro-expressions have common knowledge that they are similar when expressing emotion and therefore have similar texture information. As dynamic facial movements, they have the same temporal pattern, onset period, apex period, and offset period. Similar texture and temporal patterns enable learning from expressions to micro-expressions.

The goal of transfer learning is to transfer knowledge between related source and target domains.

Designing a network for transfer learning from macro expression to micro expression to realize the micro expression recognition function:

the source label space for transfer learning is:

y_s＝{Neutral，Angry，Contempt，Disgust，Fear，Happy，Sad，Surprise}

the target label space is:

y_T＝{Positive，Negative，Surprise，Others}

wherein, Positive is { Happy }, Negative is { Afraid, Angry, dispost, Sad, front }, surpriise is { surpride }, and the unclear facial movements belong to other; thus, the source task and the target task are related, which further improves the performance of the migration learning.

Deep CNN requires a large number of annotated samples, and the sample size of micro-expression data is very small compared to existing expression data. Deep CNN does not guarantee good performance for such small size data. Thus, using pre-trained ImageNet _ CNN trained on the facial expression dataset CK +, the trained network contains some expression information shared with the micro-expressions, which is transmitted to the LSTM training network to learn spatiotemporal features.

The overall network structure is as follows:

input->conv_1->max-pool_1->conv_2->max-pool_2->conv_3->max-pool_3->fc_1->fc_2->lstm_1->lstm_2->lstm_3->fc_3->spatial_temporalfeature

wherein, input is the optical flow sequence obtained in step 2, conv _ i { i ═ 1, 2, 3} represents the ith convolution layer, and batch normalization is adopted after the convolution operation of conv1 is removed; max-pool _ i { i ═ 1, 2, 3} represents the ith largest pooling layer; fc _ i { i ═ 1, 2, 3} represents the ith fully-connected layer, and spatial feature representations are extracted from fc _2 layer; LSTM _ i { i ═ 1, 2, 3} represents the ith LSTM layer; spatial _ temporal feature represents a space-time feature vector finally obtained through transfer learning; the output of each volume and full-connection layer adopts a ReLU nonlinear layer as an activation function to restrict the output; after the first and second fully connected layers, there is a dropout layer to mitigate overfitting to the feature vectors;

wherein,

g_in，t ^(l)＝sigm(W_in ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_in ^(l))，

g_f，t ^(l)＝sigm(W_f ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_f ^(l))，

g_o，t ^(l)＝sigm(W_o ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_o ^(l))，

wherein, W_* ^(l)And b_* ^(l)Respectively representing the weight and the deviation of the first LSTM layer, wherein subscripts in, f, o and cell respectively represent input, forget, output and a memory unit; g_in，t ^(l)The representation input gate determines how much input of the network at the current moment t is stored in the unit state; g_f，t ^(l)The forgetting gate is shown, and the number of unit states at the previous moment is determined to be stored to the current moment t; g_o，t ^(l)The output gate is shown and determines how much of the unit state is output to the current output value of the LSTM; cell_t ^(l)Representing the unit state input at the current time t; h is_t ^(l)The output of the l-th LSTM layer representing the given t-th input; .

The method carries out optical flow estimation on the video sequence in the original micro-expression data set, so that the subtle actions in the video are more obvious, and high-level features are obtained; in the aspect of network structure, the invention uses the transfer learning from macro expression to micro expression, uses the pre-trained deep CNN on ImageNet and uses the expression data of a large sample to pre-train the deep CNN, the trained network comprises some expression information shared with the micro expression, and the information is transmitted to the LSTM training network to carry out micro expression recognition, thereby accelerating the network training speed. The problem of creating an overfitting fit to such small size data and not guaranteeing good performance because the convolutional neural network requires a large number of annotated samples is solved.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for micro-expression recognition based on optical flow input transfer learning is characterized in that: the method comprises the following steps:

2. The method for micro-expression recognition based on optical flow input transfer learning of claim 1, wherein: in the step 1, 68 facial markers are detected in the first frame of each micro expression video in the micro expression data set by using the active shape model during alignment, then the first frame of each micro expression video is normalized according to a comparison template, and the subsequent frames in each micro expression video are compared with the first frame through local weighted mean transformation; normalization includes spatial domain normalization, which cuts all images into 96 × 112 pixels in the face region, and temporal domain normalization, which uses a linear interpolation method to obtain a sufficient number of frames.

3. The method for micro-expression recognition based on optical flow input transfer learning according to claim 1 or 2, wherein: in the step 1, the micro-expression dataset is CASMEII.

4. The method for micro-expression recognition based on optical flow input transfer learning of claim 1, wherein: in step 2, in a segment of micro-expression video, the value of the point (x, y, t) is I (x, y, t), the time interval t is to the next frame, the pixel moves by (x + x, y + y, t + t), the intensity is I (x + x, y + y, t + t), and based on the invariance of the brightness in the small period, the following results are obtained:

I(x，y，t)＝I(x+x，y+y，t+t) (1)

that is to say that the first and second electrodes,

5. the method for micro-expression recognition based on optical flow input transfer learning of claim 1, wherein: in the step 3, a network for transfer learning from the macro expression to the micro expression is designed to realize the micro expression recognition function:

the source label space for transfer learning is:

the target label space is:

the overall network structure is as follows:

input-＞conv_1-＞max-pool_1-＞conv_2-＞max-pool_2-＞conv_3-＞max-pool_3-＞fc_1-＞fc_2-＞lstm_1-＞lstm_2-＞lstm_3-＞fc_3-＞spatial_temporal feature

wherein,

when j ≠ c, m_cAnd m_jHalf of the minimum distance therebetween; target item L₂The influence of intra-class change in the same expression class caused by factors such as the appearance of a subject is reduced;

g_in，t ^(l)＝sigm(W_in ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_in ^(l))，

g_f，t ^(l)＝sigm(W_f ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_f ^(l))，

g_o，t ^(l)＝sigm(W_o ^(l)[h_t-1 ^(l)，h_t ^(l-1)]+b_o ^(l))，

wherein, W_* ^(l)And b_* ^(l)Respectively representing the weight and the deviation of the first LSTM layer, wherein subscripts in, f, o and cell respectively represent input, forget, output and a memory unit; g_in，t ^(l)The representation input gate determines how much input of the network at the current moment t is stored in the unit state; g_f，t ^(l)The forgetting gate is shown, and the number of unit states at the previous moment is determined to be stored to the current moment t; g_o，t ^(l)The output gate is shown and determines how much of the unit state is output to the current output value of the LSTM; cell_t ^(l)Representing the unit state input at the current time t; h is_t ^(l)The output of the l-th LSTM layer given the t-th input is represented.