CN115331289A

CN115331289A - Micro-expression recognition method based on video motion amplification and optical flow characteristics

Info

Publication number: CN115331289A
Application number: CN202210948759.7A
Authority: CN
Inventors: 赵明华; 董爽爽; 都双丽; 胡静; 李鹏; 王琳; 王理
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-11-11

Abstract

The invention discloses a micro-expression recognition method based on video motion amplification and optical flow characteristics, which specifically comprises the following steps: selecting a data set and classifying according to emotion; preprocessing all original image frame sequences of the selected data set to obtain all single-channel gray-scale image sequences as a part of network model input; calculating the optical flow characteristics of all the obtained image frame sequences by adopting an RAFT network structure based on deep learning, and taking a visualized optical flow graph as the other part of the network model input; and superposing all single-channel gray-scale image sequences and all visual RGB optical flow image sequences into four-channel images, inputting the four-channel images into a designed VGG16 network, extracting spatial features of the micro-expression, and classifying the spatial features to obtain final identification precision. The method solves the key problems that the micro expression recognition method in the prior art is low in face motion intensity, short in duration and difficult in extracting the fine motion change of the face in the video frame.

Description

Micro-expression recognition method based on video motion amplification and optical flow characteristics

Technical Field

The invention belongs to the technical field of digital image processing and recognition, and particularly relates to a micro-expression recognition method based on video motion amplification and optical flow characteristics.

Background

In recent years, micro-expression recognition has important research value in the fields of criminal investigation lie detection, depression analysis and the like. However, since the micro-expressions themselves have the characteristics of small action amplitude and short duration, the micro-expressions are difficult to be identified manually. Even a professionally trained psychological researcher has only about 47% accuracy in identifying micro-expressions. Because the dependence on human eyes to identify micro-expression is limited by professional training and a large amount of time cost, and the identification accuracy is low, the large-scale popularization of micro-expression identification is seriously hindered. With the rapid development of computer vision and deep learning, more and more researchers apply the machine learning algorithm to micro-expression recognition, so that a lot of difficulties existing in manual recognition are solved, and the recognition accuracy is obviously improved. However, in view of the characteristics of micro-expressions, how to extract the slight motion changes of the human face in the video frame is still a key problem in the field. Therefore, micro expression recognition is still in a rapid development stage at present, and realization of micro expression recognition is gradually becoming an important research topic in the field of emotion calculation.

Disclosure of Invention

The invention aims to provide a micro-expression recognition method based on video motion amplification and optical flow characteristics, and solves the key problems that in the micro-expression recognition method in the prior art, the face motion intensity is low, the duration is short, and the extraction of the fine motion change of the face in a video frame is difficult.

The invention adopts the technical scheme that the micro-expression recognition method based on video motion amplification and optical flow characteristics is implemented according to the following steps:

step 1, selecting a data set and classifying according to emotion;

step 2, preprocessing all original image frame sequences of the selected data set to obtain all single-channel gray-scale image sequences as a part of network model input;

step 3, adopting a RAFT network structure based on deep learning to calculate the optical flow characteristics of all the image frame sequences obtained in the step 2 and taking a visual optical flow graph as the other part of the network model input;

and 4, superposing all single-channel gray-scale image sequences obtained in the step 2 and all visual RGB optical flow image sequences obtained in the step 3 into four-channel images, inputting the four-channel images into a designed VGG16 network, extracting spatial domain characteristics of the micro-expression, and classifying the spatial domain characteristics to obtain final identification precision.

The present invention is also characterized in that,

the step 2 is implemented according to the following steps:

step 2.1, amplifying fine facial muscle motion amplitudes in all original image frame sequences of the selected data set by adopting a learning-based video motion amplification method;

step 2.2, a model for detecting 68 key point information provided by the dlib library is used for realizing face alignment operation, a face area is obtained by cutting, and the resolution is uniformly adjusted to be 224 pixels multiplied by 224 pixels;

step 2.3, selecting a peak frame and 4 frames before and after the peak frame of each micro expression image sequence, and taking 9 frames of images as key frames in total to reduce the influence of redundant information in all image frame sequences obtained in the step 2.2 on identification;

and 2.4, carrying out graying treatment on all the image frame sequences obtained in the step 2.3 by utilizing a cv2.Imread () function to obtain a single-channel gray-scale image which is used as a part of network model input.

The step 2.1 is specifically implemented according to the following steps:

first, all adjacent frames (X) in all original image frame sequences are input _t-1 ,X _t ) By means of an encoder H _e (. To obtain their respective shape characteristics (M) _t-1 ,M _t ) And texture feature (V) _t-1 ,V _t )；

Then, the shape features (M) of the preceding and following frames are compared _t-1 ,M _t ) Sending the signal into an amplifier for action amplitude amplification; wherein the amplifier H _m (. Cndot.) is expressed as:

H _m (M _t-1 ,M _t ,α)＝M _t-1 +h(α×g(M _t -M _t-1 )) (1)

formula (1) where g (-) is represented by a 3 × 3 convolution followed by a ReLU activation function, and h (-) is a 3 × 3 convolution followed by a 3 × 3 residual block;

finally, the decoder reconstructs the changed shape information and the unchanged texture information to generate an enlarged image frame sequence.

In step 3, a "RAFT" network structure based on deep learning is adopted to calculate the optical flow characteristics of all the image frame sequences obtained in step 2.3, and the specific steps are as follows:

first, a feature encoder P _θ Adjacent frames (T) of all image frame sequences resulting from step 2.3 ₁ ,T ₂ ) Extracting optical flow characteristics pixel by pixel and outputting the optical flow characteristics at the resolution of 1/8, wherein the number of channels of an output characteristic map is D =256; at the same time, a context coder C is also included _θ From T only ₁ Extracting optical flow characteristics; feature encoder P _θ And a context encoder C _θ The RAFT feature extraction stage is formed together, and only needs to be executed once;

then, the image feature P obtained by feature extraction is given _θ (T ₁ ) And P _θ (T ₂ ) By pairing all pairs of feature vectors (P) _θ (T ₁ ),P _θ (T ₂ ) Performing dot product to obtain a complete correlation quantity Q for calculating the visual similarity, wherein the expression of the correlation quantity Q is as follows:

finally, the optical flow is iteratively updated with a gated loop element based loop update structure to generate a final visualization optical flow graph.

Step 4 is specifically implemented according to the following steps:

step 4.1, designing a VGG16 network model, wherein the designed VGG16 network model adopts 13 convolutional layers and 5 maximal pooling layers to extract features, and zero padding is used for filling feature edges before each convolutional layer; the last 3 layers of full connection layers are responsible for completing classification tasks, a dropout method is applied to the full connection layers, and the ratio of the dropout method is set to be 0.5, namely dropout =0.5;

step 4.2, the initial learning rate used by the VGG16 network model designed in the step 4.1 during training is 10 ^-5 Attenuation of 10 ^-6 Epoch is set to 100, and batch size is set to 3. After all the parameters are set, superposing all the single-channel gray-scale image sequences obtained in the step 2.4 and all the visual RGB optical flow image sequences obtained in the step 3 into a four-channel image sequence, inputting the four-channel image sequence into a designed VGG16 network model, extracting spatial features of the four-channel image sequence, and realizing emotion classification by utilizing softmax;

step 4.3, firstly, randomly dividing the obtained four-channel image sequence into two parts, wherein 80% of training sets and 20% of testing sets;

then, training the accuracy of the model by using a training set and testing the accuracy of the model by using a testing set, wherein the calculation method is shown in a formula (3) so as to verify the effectiveness of the model;

then, in order to reduce errors, a method of 10 groups of simple cross validation is adopted, samples are disturbed, a training set and a testing set are reselected, and models are trained and validated continuously; this was repeated 10 times to obtain the accuracy of 10 models and averaged to obtain the final accuracy of the model.

The invention has the beneficial effects that:

the method disclosed by the invention is used for processing the micro expression by combining LVMM and RAFT, and extracting the spatial domain characteristics of the micro expression by adopting a VGG16 network and classifying to obtain a micro expression recognition result. Meanwhile, in order to reduce the influence of redundant information in the sequence of the microexpressive image frames on identification, a key 9 frames of the microexpressive sequence are selected from a CASME II data set to be tested, and compared with other 7 mainstream methods. Experimental results show that the method obtains better performance, and the identification precision reaches 67.98%.

Drawings

FIG. 1 is a flow chart of an algorithm framework used in the micro-expression recognition method based on video motion amplification and optical flow characteristics according to the present invention;

FIG. 2 is a graph comparing the LVMM's effects with different magnification factors α in the method of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

The invention provides a micro-expression recognition method based on video motion amplification and optical flow characteristics, which is implemented according to the following steps as shown in figure 1:

step 1, selecting a data set, classifying the data set into 5 types (happy, dispost, surfrise, repression and others) according to emotions, using a public spontaneous micro-expression data set CASME II released by a psychological research institute of Chinese academy of sciences Fu Xiaolan team, and the division condition of the data set is shown in Table 1:

TABLE 1 CASME II dataset partitioning case

And 2, preprocessing all original image frame sequences of the selected data set. The method is implemented according to the following steps:

and 2.1, amplifying the fine facial muscle motion amplitude in all original image frame sequences of the selected data set by adopting a learning-based video motion amplification method (LVMM) to enhance visual characteristics. LVMM mainly comprises an encoder H _e (. A) amplifier H _m (. And decoding)H device _d Three parts (v.). In the experimental process of zooming in with LVMM, first, all adjacent frames (X) in all original image frame sequences are input _t-1 ,X _t ) By means of an encoder H _e (. To obtain their respective shape characteristics (M) _t-1 ,M _t ) And texture feature (V) _t-1 ,V _t ) (ii) a The obtained texture features are not subjected to motion amplification, but are mainly used for restraining noise caused by subsequent intensity amplification; then, the shape features (M) of the preceding and following frames are compared _t-1 ,M _t ) Sending into an amplifier for amplification of the operation range, amplifier H _m (. Cndot.) can be expressed as:

H _m (M _t-1 ,M _t ,α)＝M _t-1 +h(α×g(M _t -M _t-1 )) (1)

formula (1) where g (-) is represented by a 3 × 3 convolution followed by a ReLU activation function, and h (-) is a 3 × 3 convolution followed by a 3 × 3 residual block; finally, the decoder reconstructs the changed shape information and the unchanged texture information to generate an enlarged image frame sequence.

Through repeated experimental comparison, a reasonable amplification factor alpha =15 is finally selected. As shown in fig. 2, which is a result of using different magnification factors (α =5, α =10, α =15, α =20, α = 25) for a certain frame in all original image frame sequences, we find that, when α =15, the effect of enlarging the image frame does not affect the image quality.

Step 2.2, in order to reduce the influence of the non-face area in all the amplified image frame sequences obtained in the step 2.1 on micro expression recognition to the maximum extent, a model detected by 68 pieces of key point information provided by a dlib library is used for realizing face alignment operation, a face area is obtained by cutting, and the resolution is uniformly adjusted to 224 pixels × 224 pixels so that the input space dimension is matched with the VGG16 network model;

and 2.3, considering the change and subtlety of the facial motion in the micro-expression image sequence, the change between two continuous frames is hardly perceived. If all image frame sequences obtained in step 2.2 are directly input into the network model training, a large number of redundant features are included. Meanwhile, the shortest duration time of the micro expression is about 1/25 second, the frame rate of the samples in the CASME II data set is 200 frames/second, and the shortest micro expression duration frame sequence can be obtained by conversion and is 8 frames, so that the peak frame and the front and rear 4 frames of each micro expression image sequence are selected, and 9 frames of images are used as key frames, so that the influence of redundant information in all the image frame sequences obtained in the step 2.2 on identification is reduced;

And 3, capturing representative motion characteristics between the micro-expression adjacent image frame sequences due to the optical flow. Higher signal-to-noise ratio can be obtained, and abundant and key input characteristics are provided for the network. Therefore, we use the deep learning based "RAFT" network structure for the first time to calculate the optical flow characteristics of all image frame sequences obtained in step 2.3 and obtain a visual optical flow graph as another part of the network model input. RAFT extraction of optical flow comprises three steps: first, a feature encoder P _θ Adjacent frames (T) of all image frame sequences resulting from step 2.3 ₁ ,T ₂ ) And extracting optical flow characteristics pixel by pixel and outputting the optical flow characteristics at the resolution of 1/8, wherein the number of channels of the output characteristic map is D =256. At the same time, a context coder C is also included _θ From T only ₁ And extracting optical flow characteristics. Feature encoder P _θ And a context encoder C _θ The RAFT feature extraction stage is formed together, and only needs to be executed once; then, the image feature P obtained by feature extraction is given _θ (T ₁ ) And P _θ (T ₂ ) By pairing all pairs of feature vectors (P) _θ (T ₁ ),P _θ (T ₂ ) To perform dot product to obtain a complete correlation quantity Q to calculate the visual similarity:

finally, the optical flow is iteratively updated using a gated loop unit (GRU) based loop update structure to generate a final visualization optical flow graph.

And 4, superposing all single-channel gray-scale image sequences obtained in the step 2.4 and all visual RGB optical flow image sequences obtained in the step 3 into four-channel images, inputting the four-channel images into a designed VGG16 network, extracting spatial features of the micro-expression, and classifying the spatial features to obtain final identification precision. The method is implemented according to the following steps:

and 4.1, through the three steps, all single-channel gray-scale image sequences obtained in the step 2.4 and all visual RGB optical flow image sequences obtained in the step 3 are already available. In order to complete the classification and identification work of the micro expressions, a VGG16 network model is designed. The VGG16 network is simple and neat, 13 convolutional layers and 5 maximum pooling layers are adopted to extract features, and in order to ensure that the size of an input image does not change, zero padding is used for filling feature edges before each convolutional layer. The last 3 layers of full connection layers are responsible for completing classification tasks, in order to reduce the overfitting phenomenon, a dropout method is applied to the full connection layers, the number of neurons can be randomly shielded according to set parameters, the generalization capability of a network model is improved, and meanwhile the training speed of the network is accelerated. Referring to the empirical values, we set their ratio to 0.5, i.e. dropout =0.5.

Step 4.2, the initial learning rate (lr) used by the designed VGG16 network model during training is 10 ^-5 Attenuation (decay) of 10 ^-6 Epoch is set to 100, and batch size is set to 3. And (3) after all the parameters are set, superposing all the single-channel gray-scale image sequences obtained in the step (2.4) and all the visual RGB optical flow image sequences obtained in the step (3) into a four-channel image sequence, inputting the four-channel image sequence into the VGG16 network model, extracting the spatial features of the four-channel image sequence, and realizing emotion classification by utilizing softmax.

And 4.3, the error phenomenon caused by data set division can be avoided to a certain extent due to the cross validation of the model. Therefore, our experiment used 10 sets of simple cross-validation to reduce this error. The method comprises the following specific operations: firstly, randomly dividing an obtained four-channel image sequence into two parts (80% of training set and 20% of testing set); then, training the accuracy of the model by using a training set and testing the accuracy of the model by using a testing set (the calculation method is shown in formula 3) so as to verify the effectiveness of the model; then, a method of 10 groups of simple cross validation is adopted, the sample is disturbed, the training set and the testing set are reselected, and the model is continuously trained and validated. As shown in table 2, the final recognition result is an average of 67.98% for 10 sets of experiments.

TABLE 2 10 training results

And (3) comparing and verifying experimental results: the average of 10 sets of experimental results (shown in Table 2) obtained in step 4.3 was compared with other existing methods, as shown in Table 3, including conventional methods LBP-TOP, STLBP-IP, bi-WOOF and deep learning methods ELRCN-SE, CNN + LSTM, CNNCapsNet, MSCNN. The result shows that the recognition precision of the method is improved by 3.35% compared with a suboptimal method CNNCapsNet, a better micro-expression recognition effect is obtained, and the method solves the key problems that the micro-expression recognition method in the prior art is low in face motion intensity, short in duration and difficult in extracting the fine motion change of the face in a video frame.

TABLE 3 comparison of the Performance of the method herein with the existing method on CASME II

Claims

1. The micro-expression recognition method based on video motion amplification and optical flow features is characterized by comprising the following steps:

step 1, selecting a data set and classifying according to emotion;

step 2, preprocessing all original image frame sequences of the selected data set to obtain all single-channel grey-scale image sequences as a part of network model input;

2. The micro expression recognition method based on video motion amplification and optical flow features as claimed in claim 1, wherein the step 2 is implemented by the following steps:

step 2.2, the model of 68 key point information detection provided by the dlib library is used for realizing the face alignment operation, the face area is obtained by cutting, and the resolution is uniformly adjusted to 224 pixels × 224 pixels;

step 2.3, selecting a peak frame and 4 frames before and after the peak frame of each micro expression image sequence, and taking 9 frames of images as key frames so as to reduce the influence of redundant information in all the image frame sequences obtained in the step 2.2 on identification;

3. The micro expression recognition method based on video motion amplification and optical flow features as claimed in claim 2, wherein the step 2.1 is implemented by the following steps:

first, all adjacent frames (X) in all original image frame sequences are input _t-1 ,X _t ) By an encoder H _e (. To obtain their respective shape characteristics (M) _t-1 ,M _t ) And texture feature (V) _t-1 ,V _t )；

H _m (M _t-1 ,M _t ,α)＝M _t-1 +h(α×g(M _t -M _t-1 )) (1)

4. The micro-expression recognition method based on video motion amplification and optical flow features as claimed in claim 3, wherein in step 3, a "RAFT" network structure based on deep learning is adopted to calculate the optical flow features of all image frame sequences obtained in step 2.3, and the specific steps are as follows:

first, a feature encoder P _θ Adjacent frames (T) of all image frame sequences resulting from step 2.3 ₁ ,T ₂ ) Extracting optical flow characteristics pixel by pixel and outputting the optical flow characteristics at the resolution of 1/8, wherein the number of channels of an output characteristic map is D =256; also, a context encoder C is included _θ From T only ₁ Extracting optical flow characteristics; feature encoder P _θ And a context encoder C _θ The RAFT feature extraction stage is formed together, and only needs to be executed once;

then, the image feature P obtained by feature extraction is given _θ (T ₁ ) And P _θ (T ₂ ) By pairing all feature vectors (P) _θ (T ₁ ),P _θ (T ₂ ) Performing dot product to obtain a complete correlation quantity Q for calculating the visual similarity, wherein the expression of the correlation quantity Q is as follows:

finally, the optical flow is iteratively updated using a gated loop element based loop update structure to generate a final visualized optical flow graph.

5. The micro-expression recognition method based on video motion amplification and optical flow features as claimed in claim 4, wherein step 4 is implemented by the following steps:

step 4.1, designing a VGG16 network model, wherein the designed VGG16 network model adopts 13 convolutional layers and 5 maximal pooling layers to take charge of extracting features, and zero padding is used for filling feature edges before each convolutional layer; the last 3 layers of full connection layers are responsible for completing classification tasks, a dropout method is applied to the full connection layers, and the ratio of the dropout method is set to be 0.5, namely dropout =0.5;

step 4.2, the initial learning rate used by the VGG16 network model designed in the step 4.1 during training is 10 ^-5 Attenuation of 10 ^-6 Epoch set to 100, batch size set to 3; after all the parameters are set, superposing all the single-channel gray-scale image sequences obtained in the step 2.4 and all the visual RGB optical flow image sequences obtained in the step 3 into a four-channel image sequence, inputting the four-channel image sequence into a designed VGG16 network model, extracting the spatial characteristics of the four-channel image sequence, and realizing emotion classification by utilizing softmax;

then, a method of 10 groups of simple cross validation is adopted, the sample is disturbed, the training set and the test set are reselected, and the data and the validation model continue to be trained; this was repeated 10 times to obtain the accuracy of 10 models and averaged to obtain the final accuracy of the model.