CN111325131B

CN111325131B - Micro-expression detection method based on self-adaptive transition frame depth network removal

Info

Publication number: CN111325131B
Application number: CN202010092959.8A
Authority: CN
Inventors: 付晓峰; 牛力; 柳永翔; 赵伟华; 计忠平
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-02-14
Filing date: 2020-02-14
Publication date: 2023-06-23
Anticipated expiration: 2040-02-14
Also published as: CN111325131A

Abstract

The invention discloses a micro-expression detection method based on a self-adaptive transition frame removal depth network. The method comprises network construction, network training and microexpressive detection, wherein in the network training, data preprocessing is firstly carried out on an original video; then removing the transition frame by using a self-adaptive transition frame removing method; and finally, inputting the micro-expression frames and the neutral frame samples without the transition frames into a MesNet network for training. The MesNet constructed by the invention is essentially a two-class network, and the micro-expression frame detection does not depend on the frame time sequence relationship, so that the MesNet not only can detect the micro-expression frame from the complete video of the micro-expression database, but also can detect the micro-expression frame from a given arbitrary frame set, and can judge whether a given single frame is the micro-expression frame or not.

Description

Micro-expression detection method based on self-adaptive transition frame depth network removal

Technical Field

The invention belongs to the technical field of computer image processing, and relates to a micro-expression detection method based on a self-adaptive transition frame removal depth network.

Background

Unlike traditional facial expressions with duration of 0.5 s-4 s, facial micro-expressions with duration of 1/25 s-1/5 s are an instant and unconscious reaction revealing the true emotion of the person. Microexpressive recognition has attracted increasing attention from researchers over the past decade because of its potential application in various fields of emotion monitoring, lie detection, clinical diagnosis, business negotiations, etc.

The micro-expressions have the characteristics of difficult induction, difficult data acquisition, small sample scale, difficult human eyes recognition and the like, and the initial micro-expression recognition is mainly manually recognized by psychologists and other professionals, so that the automatic recognition of the micro-expressions by using a computer vision method and a machine learning method is possible by the progress of computer hardware in recent years.

The micro-expression recognition comprises two steps of micro-expression detection and micro-expression type discrimination. The micro-expression detection is a precondition for judging the micro-expression type, and for a section of video containing the micro-expression, firstly, the frames in which the micro-expression is distributed are detected, and then the category to which the micro-expression belongs can be further judged. The existing microexpressive detection method has the common problems that the microexpressive detection precision is low or the application range is smaller. Databases commonly used for microexpressive detection are CASME II, SMIC-E-HS and CAS (ME) ² There has been no prior method of detecting microexpressions that can be verified on three databases simultaneously.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a micro-expression detection method based on a self-adaptive transition frame depth network, which has the characteristics of high precision and wide application range in micro-expression detection application.

The invention comprises network construction, network training and detecting micro-expressions by using the trained network.

The network structure specifically comprises:

step S1: selecting a pre-trained CNN model on an ImageNet database, and reserving a convolution layer and pre-training parameters.

Step S2: and adding a full connection layer after the CNN model.

Step S3: and adding an output layer and a logistic classifier after the full connection layer.

Specifically, the invention uses the acceptance-ResNet-V2 as a basis to construct a micro expression detection network, which is named MesNet (micro-expression spotting network).

Specifically, the number of fully connected layer neurons is 512.

Specifically, the MesNet network is a micro-expression frame and neutral frame classification network, and the number of neurons of an output layer is 1.

The network training specifically comprises the following steps:

step S1: and carrying out data preprocessing on the original video of the training set.

Step S2: transition frames are removed from the training set using an adaptive transition frame removal method.

Step S3: and inputting the micro-expression frames and neutral frame samples without the transition frames into a MesNet network for training.

Specifically, the data preprocessing of the original video comprises face detection, face alignment and micro-expression region clipping.

The method for adaptively removing the transition frame specifically comprises the following steps:

step S1: the training set is divided into a confidence sample and a transition frame sample to be removed.

Step S2: and training the MesNet network through the confidence sample to obtain micro expression frame and neutral frame classification models.

Step S3: and predicting transition frame samples to be removed by using a classification model to obtain the probability that each sample belongs to a positive sample micro-expression frame.

Step S4: and adaptively determining a threshold value for screening the transition frames through a transition frame sample probability distribution map to be removed, thereby removing the transition frames.

The method for detecting the micro-expression by using the trained network specifically comprises the following steps:

step S1: and carrying out data preprocessing on the original video of the test set.

Step S2: and inputting the preprocessed sample into a trained MesNet network to obtain a predicted tag value. A label of 1 represents a sample as a micro-expression frame, and a label of 0 represents a neutral frame.

Specifically, the input sample to be detected can be a single-segment video or a multi-segment video of a complete test set.

Compared with the prior art, the method has the following beneficial effects:

the invention has high micro expression detection precision, and MesNet is arranged in CASME II, SMIC-E-HS and CAS (ME) ² The database obtains the current optimal result. The invention has wide application range, has no limit on the length of the input video, is not only suitable for short videos of CASME II and SMIC-E-HS, but also suitable for CAS (ME) ² Of databasesLong video.

Drawings

Fig. 1 shows a MesNet training flowchart.

Fig. 2 shows a CASME II database video clip example.

Fig. 3 shows an adaptive removal transition frame method.

Fig. 4 shows the probability distribution of transition frame samples to be removed.

Fig. 5 (a) shows a certain frame of image in a video.

Fig. 5 (b) shows an extracted rectangular frame of a face.

Fig. 6 shows a face alignment method.

Fig. 7 (a) shows an image after face alignment.

Fig. 7 (b) shows a cropped micro-expression region.

Fig. 8 (a) shows some video frames of the CASME II database.

Fig. 8 (b) is a Dlib face detection diagram corresponding to fig. 8 (a).

Fig. 8 (c) is a diagram obtained by preprocessing fig. 8 (a) by the preprocessing method of the present invention.

Detailed Description

The invention will now be described in detail with reference to the accompanying drawings, it being pointed out that the embodiments described are only intended to facilitate an understanding of the invention and do not in any way limit it.

Fig. 1 illustrates the MesNet training procedure using a video segment numbered 20_ep15_03f in the CASME II database as an example. As shown in equation (1), input is a sample of micro-expression frames and neutral frames Input to the MesNet network, and f (Input) represents the extraction of shape and texture Features from the image using a pre-training model:

Features＝f(Input). (1)

to further extract the micro-expression features, as shown in equation (2), a function f ₁ (Featues, N) means that a fully connected layer containing N neurons is connected after the pre-training model by taking Featues as input:

FC＝f ₁ (Features,N). (2)

then, the FC is taken as an input, and an Output layer Output is constructed as shown in a formula (3). Because MesNet is a two-class network, the output layer contains only 1 neuron:

Output＝f ₁ (FC,1). (3)

MesNet network uses logistic classifier with loss function of

Where m represents the number of samples involved in one iteration, y ⁽ⁱ⁾ Representing the true label value of the ith training sample, label 1 representing positive sample microexpressive frame, and 0 representing negative sample neutral frame.

Probability value indicating that MesNet predicts that the i-th sample is a positive sample, +.>

The calculation method is that

The optimization of the MesNet network adopts a learning rate self-adaptive Adam method.

Fig. 2 shows an example of a video clip of a CASME II database, which is the same video segment as fig. 1, and which has a duration of 5 seconds for 1024 frames. According to the CASME II database description document, the 86 th Frame of the initial Frame (Oset Frame) is the Frame from which the micro-expression starts, the 129 th Frame of the vertex Frame (Apex Frame) is the micro-expression peak Frame, and the 181 th Frame of the end Frame (Offset Frame) is the last Frame from which the micro-expression continues.

In supervised learning, the quality of the label corresponding to the training data has an important influence on the learning effect. From the microexpressive database production process, some frames near the start and end frames are not quite explicitly divided into microexpressive or neutral frames under a high speed camera of 200 fps. Thus, frames near frame 86 and frame 181 may have noisy labels that, if placed in the training set, can interfere with model training. Therefore, the present embodiment defines frames with noise tags near the start frame and the end frame as transition frames, and performs processing for removing transition frames on the training set.

Taking the video shown in fig. 2 as an example, in order to remove the transition frame, the whole video is divided into four segments by taking the 86 th, 129 th and 181 th frames as boundaries, and each segment is divided into two parts, and a total of 8 parts are numbered in the figure. In U shape ₁ Representing a set of part 1 samples, denoted by L ₁ The number of samples in part 1 is indicated, the remaining 7 parts and so on.

Fig. 3 shows a method for adaptively removing transition frames, which specifically comprises the following steps:

step S1: considering that the transition frame is a small number of samples with noise labels, the proportion of the transition frame does not exceed 50% of the total number of training set samples and the transition frame is close to the start frame or the end frame. Then, as shown in FIG. 2, initialize L ₁ :L ₂ ＝L ₃ :L ₄ ＝L ₅ :L ₆ ＝L ₇ :L ₈ =1:1. In U shape _T Representing a transition frame sample set, in U _T0 Representing a transition frame sample set to be removed, U _T0 ＝U ₂ ∪U ₃ ∪U ₆ ∪U ₇ 。

Step S2: u (U) ₄ ∪U ₅ U as a micro-expression frame sample ₁ ∪U ₈ As a neutral frame sample, the MesNet network was trained to obtain model C.

Step S3: predicting U using model C _T0 Sample x in (a) ⁽ⁱ⁾ Probability P belonging to positive sample micro-expression frame _i If P _i Near 0, the samples are neutral frames, if P _i Near 1, the sample is a micro-expression frame. Then the transition frame discrimination formula is U _T ＝{x ⁱ |P1<P _i <P2,x ⁱ ∈U _T0 } (6)

Wherein P1, P2E (0, 1), the specific values of P1, P2 will be discussed below;

step S4: u (U) ₂ 、U ₃ 、U ₆ 、U ₇ After removal of the transition frameThe sample sets are U respectively _2- 、U _3- 、U _6- 、U _7- . The set of micro-expression frame samples put into the training set is:

U _ME ＝U ₃ -UU ₄ UU ₅ UU ₆ -, (7)

the neutral frame sample set is:

U _N ＝U ₁ UU ₂ -UU ₇ -UU ₈ . (8)

fig. 4 shows the probability distribution of transition frame samples to be removed. U is set to _T0 A total of 24454 samples are input into the model C for prediction, and corresponding 24454 probability values are obtained. To determine the optimal threshold values P1, P2, probability distribution is performed as shown in fig. 4. Probability distribution is [0.000,0.050 ]]16616 samples in the interval, and the probability that the model C judges that the model C is a micro-expression frame is not higher than 0.05, namely the probability that the model C is a neutral frame is not lower than 0.95; probability distribution at (0.950,1.000)]5429 samples in the interval, and the probability that the model C judges that the model C is a micro-expression frame is not lower than 0.95; the closer the sample prediction probability value is to 0.5, the more difficult the model C is to judge the category of the model C, the lower the reliability of the prediction result is, and the samples are transition frames. In combination with probability distribution, distributed in [0.000,0.050 ]]The number of samples for a bin is much greater than for the next bin (0.050,0.100]And (0.050,0.100)]The number of interval samples is not much greater than the interval (0.100,0.150]Therefore, the value of P1 can be determined to be 0.05, and the value of P2 can be determined to be 0.95 in the same way. And removing 2409 transition frame samples from 48670 samples in the CASME II database training lump by adopting an adaptive transition frame removal method, wherein the transition frame samples account for about 4.950 percent of the total number of the training set samples.

Fig. 5 (a) shows a frame of image in video number 15_ep03_02 in the CASME II database, and the head of the subject is seen to be inclined at a significant angle, and in addition, a lot of interference information such as background, hair, earphone, etc. The preprocessing of the original video is divided into three steps: face detection, face alignment and micro-expression region clipping. Fig. 5 (b) shows a face rectangular frame extracted by a Dlib forward face detector. And next, detecting 68 human face feature points in the rectangular frame by using a residual neural network human face feature point detection model.

Fig. 6 shows face alignment. The two external canthus coordinates are respectively 36 and 45, and the deflection angle of the face can be calculated by using the horizontal and vertical coordinates of the two points. Let the 36 th and 45 th keypoint coordinates be (x 1, y 1) and (x 2, y 2), respectively, then

Horizontal difference:

dx＝x2-x1, (9)

vertical difference:

dy＝y2-y1, (10)

face deflection angle:

the affine matrix is calculated by angle to carry out affine transformation, so that an image with aligned faces as shown in fig. 7 (a) can be obtained.

As can be seen from fig. 7 (a), the image after the face alignment still contains more noise, for example, the glasses frame and the hair at the four corners of the image, and other subjects may also have the interference information such as the clothing neckline and the earphone line (see fig. 8 (b)), and the intra-class distance caused by the noise interference is more remarkable than the small inter-class distance between the micro-expression frame and the neutral frame. To minimize intra-class spacing, the image needs to be further cropped. In connection with the encoding of related micro-expressions by the facial motion encoding system (Facial Action Coding System, FACS), two principles are based: the action units contained in the CASME II micro expressions are reserved to the maximum extent, noise interference is reduced to the maximum extent, and the image is further cut. The trial and error determines the optimal cropping parameters and the final results are shown in fig. 7 (b). CASME II, SMIC-E-HS and CAS (ME) ² And preprocessing all 32 ten thousand multiframe images in the database according to the flow.

Fig. 8 (a) shows some image preprocessing original images of the CASME II database, fig. 8 (b) is a Dlib face detection image, and fig. 8 (c) is a graph obtained by preprocessing by the preprocessing method of the present invention. As can be seen from comparing fig. 8 (a) and fig. 8 (c), the preprocessing method of the present invention can obtain the facial microexpressive area image from the original video more accurately, and effectively remove most of noise interference affecting microexpressive detection.

After MesNet finishes training, the test set samples are input to obtain the probability that each sample belongs to the positive sample micro expression frame

If->

Judging the micro-expression frame to be more than or equal to 0.5, and outputting a label of 1; if->

Less than 0.5, judging as a neutral frame, and outputting a label of 0. Based on the test set true tags and the MesNet predicted tags, ROC plots can be made and AUC values calculated. The higher the AUC value, the better the model performance.

Experimental results

In order to show that the method of the invention has higher micro-expression detection AUC value, the invention is compared with other methods, and the comparison result is shown in the following table. Other method references in the table are as follows:

[1]DAVISON A K,LANSLEY C,NG C C,et al.Objective Micro-Facial Movement Detection Using FACS-Based Regions and Baseline Evaluation[C]//2018 13th IEEE International Conference on Automatic Face&Gesture Recognition(FG 2018),2018:642-649.

[2]QU F,WANG S J,YAN W J,et al.CAS(ME)^2:ADatabase for Spontaneous Macro-expression and Micro-expression Spotting and Recognition[J].IEEE Transactions on Affective Computing,2018,9(4):424-436.

[3]WANG S J,WU S,QIAN X,et al.Amain directional maximal difference analysis for spotting facial movements from long-term videos[J].Neurocomputing,2017,230:382-389.

[4]LI X,HONG X,MOILANEN A,et al.Towards Reading Hidden Emotions:A Comparative Study of Spontaneous Micro-Expression Spotting and Recognition Methods[J].IEEE Transactions on Affective Computing,2018,9(4):563-577.

[5]DUQUE C A,ALATA O,EMONET R,et al.Micro-Expression Spotting Using the Riesz Pyramid[C]//2018IEEE Winter Conference on Applications of Computer Vision(WACV).IEEE,2018:66-74.

as can be seen from the table, in CASME II, SMIC-E-HS and CAS (ME) ² The AUC values of MesNet are all prior to the existing methods on the database. Compared with other methods, the MesNet has the advantage of wide application range besides higher precision. The MesNet has no limit on the length of the input video, and is not only suitable for short videos of CASME II and SMIC-E-HS, but also suitable for CAS (ME) ² Long video of the database. In contrast, document [1][4][5]The proposed method is only validated on short videos of CASME II or SMIC-E-HS, literature [2 ]][3]The proposed method is only on CAS (ME) ² And the long video database is verified.

In order to show the effectiveness of the self-adaptive transition frame removal method, a comparison experiment of the transition frame removal method and the self-adaptive transition frame removal method is set, and the comparison result of AUC values is shown in the following table.

As can be seen from the table, the micro-expression detection AUC value is effectively improved on all three databases by adopting the self-adaptive transition frame removal method.

While the foregoing has specifically described embodiments of the present invention, it will be appreciated by one of ordinary skill in the art that variations and modifications within the scope of the invention as described above and specifically set forth in the appended claims may be made to the invention as well without departing from the scope of the invention.

Claims

1. The micro-expression detection method based on the self-adaptive transition frame depth network removal comprises network construction, network training and micro-expression detection, and is characterized in that:

the network structure specifically comprises:

step S1: selecting a pre-trained CNN model on an ImageNet database, and reserving a convolution layer and pre-training parameters;

step S2: adding a full connection layer after the CNN model;

step S3: adding an output layer and a logistic classifier after the full connection layer, and constructing a finished network named as a MesNet network;

the network training specifically comprises the following steps:

step S1: preprocessing data of an original video to remove noise interference affecting microexpressive detection;

step S2: removing transition frames from the training set using an adaptive transition frame removal method;

step S3: inputting the micro-expression frames and neutral frame samples without the transition frames into a MesNet network for training;

step S1: dividing the training set into a confidence sample and a transition frame sample to be removed;

step S2: training a MesNet network through a confidence sample to obtain a micro expression frame and a neutral frame classification model;

step S3: predicting transition frame samples to be removed by using a binary classification model to obtain the probability that each sample belongs to a positive sample micro expression frame;

step S4: adaptively determining a threshold value for screening the transition frames through a transition frame sample probability distribution diagram to be removed, so that the transition frames are removed;

the micro expression detection specifically comprises the following steps:

step S1: preprocessing data of the original video of the test set;

step S2: and inputting the preprocessed sample into a trained MesNet network to obtain a predicted tag value.

2. The micro-expression detection method based on the adaptive transition frame depth removal network according to claim 1, wherein:

and in the network construction stage, a pre-trained acceptance-ResNet-V2 model is used as a basis, a full-connection layer containing 512 neurons and an output layer containing 1 neuron are added, and a micro-expression frame and neutral frame classification network is constructed for detecting micro-expressions from videos.

3. The micro-expression detection method based on the adaptive transition frame depth removal network according to claim 1, wherein:

the network training stage, the data preprocessing of the original video comprises face detection, face alignment and micro expression region cutting;

the human face detection is to extract a human face rectangular frame by using a Dlib forward human face detector, and detect 68 human face feature points in the rectangular frame by using a residual neural network human face feature point detection model;

the face alignment is that the face deflection angle is determined by calculating the horizontal difference and the vertical difference of the two outer eye angles, and affine transformation is performed by calculating an affine matrix by utilizing the face deflection angle, so that the face alignment is completed.

4. The micro-expression detection method based on the adaptive transition frame depth removal network according to claim 1, wherein:

the transition frame is provided with a noise label, and the transition frame can be identified and removed by a self-adaptive transition frame removing method.

5. The micro-expression detection method based on the adaptive transition frame depth removal network according to claim 1, wherein:

after the MesNet network finishes training, inputting the test set samples, and obtaining the probability that each sample belongs to a positive sample micro-expression frame; if the probability is more than or equal to 0.5, judging the frame as a micro-expression frame, and outputting a label as 1; if the probability is less than 0.5, judging the frame as a neutral frame, and outputting a label as 0; according to the real label of the test set and the MesNet network prediction label, making an ROC curve graph and calculating an AUC value; the higher the AUC value, the better the MesNet network performance.

6. The micro-expression detection method based on the adaptive transition frame depth removal network according to claim 1, wherein:

in the micro-expression detection process, the MesNet network detects that the micro-expression pair input video is suitable for short videos with the length of tens of frames or long videos with the length of thousands of frames.