CN115049957A

CN115049957A - Micro-expression identification method and device based on contrast amplification network

Info

Publication number: CN115049957A
Application number: CN202210605395.2A
Authority: CN
Inventors: 郑文明; 魏梦婷; 宗源; 江星洵; 刘佳腾; 薛云龙
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-13
Anticipated expiration: 2042-05-31
Also published as: CN115049957B

Abstract

The invention discloses a micro-expression identification method and a device based on a contrast and amplification network, wherein the method comprises the following steps: (1) the method comprises the steps that a micro expression database (2) is obtained, micro expression videos are converted into a micro expression frame sequence, and sampling is conducted after preprocessing to serve as a source sample; (3) for each source sample, calculating the distance between the residual frame and the source sample in an embedding space, and mapping the distance into probability to obtain distance probability distribution; (4) sampling a plurality of frames from the remaining video frames as negative samples according to the probability distribution; (5) constructing a contrast amplifying network; (6) performing data enhancement on each source sample to form an anchor sample and a positive sample, inputting the anchor sample, the positive sample and a corresponding negative sample as training samples into a contrast amplifying network for training, wherein a loss function is the sum of contrast loss in a video, contrast loss between classes and cross entropy loss; (7) and preprocessing the micro expression video to be identified, inputting the preprocessed micro expression video into a trained contrast amplifying network, and identifying the micro expression category. The invention has higher accuracy and is more convenient.

Description

Micro-expression identification method and device based on contrast amplification network

Technical Field

The invention relates to an image processing technology, in particular to a micro-expression identification method and device based on a contrast amplification network.

Background

Micro-expressions are spontaneous, transient facial expressions that hide the true emotional state of a human. On the basis of the expression, the recognition of the micro expression has important significance on emotion calculation and psychological treatment. However, it is induced by subtle facial movements and occurs in much shorter times than macroscopic facial expressions, and automatic micro-expression recognition is therefore a challenging task.

Micro-expressions are a dynamic process of facial muscle movement, and the capture of motion is essential to accurately identify micro-expressions. However, since the intensity of the micro-expression is low, the extraction of the motion feature is very difficult. To address this problem, many methods employ motion amplification techniques to amplify the micro-expression, making facial motion more pronounced. Some conventional amplification techniques and deep learning filters, such as euler motion amplification (EMM), global lagrange motion amplification (GLMM), learning-based motion amplification (LMM), have demonstrated that the performance of micro-expression recognition can be further enhanced by amplifying the micro-expression intensities. Despite this advantage, the existing amplification methods are not yet fully adaptable. When different subjects show different expression states, the change trends of facial muscles are inconsistent, and a uniform amplification level cannot be adapted to all micro expression samples. For example, the same magnification factor may not be sufficient for one sample of micro-expressions, the magnification may not be sufficient to show the emotion classification of the expression, and may be excessive on another sample of micro-expressions, thereby introducing noise. Therefore, the accuracy of the prior art micro-expression recognition based on contrast amplification still needs to be improved.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a micro expression identification method and device based on a contrast and amplification network with higher accuracy.

The technical scheme is as follows: the micro-expression identification method based on the contrast amplifying network comprises the following steps:

(1) acquiring a micro-expression database, wherein the micro-expression database comprises a plurality of micro-expression videos and corresponding micro-expression category labels;

(2) converting each micro-expression video in the database into a micro-expression frame sequence, preprocessing the micro-expression frame sequence, and sampling a fixed number of frames from each micro-expression frame sequence as source samples;

(3) for each source sample, calculating the distance between the residual frame in the frame sequence where the source sample is located and the embedding space of the source sample, and mapping the distance into probability to obtain the distance probability distribution between each source sample and the residual frame in the frame sequence;

(4) for each source sample, sampling a plurality of frames from the rest video frames according to the calculated probability distribution as corresponding negative samples;

(5) constructing a contrast amplifying network, which comprises Resnet-18 without a full connection layer, a multi-layer perceptron connected behind the Resnet-18, and a softmax layer connected behind the multi-layer perceptron;

(6) performing data enhancement on each source sample in the micro-expression frame sequence to form a pair of samples, namely an anchor sample and a positive sample, inputting the anchor sample, the positive sample and a corresponding negative sample as training samples into a contrast amplification network for training, wherein a loss function during training is the sum of contrast loss in a video, contrast loss between classes and cross entropy loss, and optimizing the training through gradient descent;

(7) and (3) preprocessing the micro expression video to be identified according to the step (2), and inputting the preprocessed micro expression video into a trained contrast amplifying network to identify the micro expression category.

Further, the preprocessing in the step (2) includes face registration and face region cutting.

Further, the sampling a fixed number of frames from each micro-expression frame sequence as a source sample in step (2) specifically includes:

for a preprocessed micro-expression frame sequence, sampling N frames as source samples according to uniform distribution, wherein the form of uniform distribution is as follows:

where f (N) represents a probability density function for sampling the source samples, N represents an index number of the sampled source samples, and N represents the number of the sampled source samples.

Further, the step (3) specifically comprises:

(3-1) for each sequence of micro-expression frames, using a network model that has been pre-trained on a macro-expression database: resnet-18 extracts feature vectors u of the source samples and the residual frames, and the name of the database is FER +:

u _i ＝g(x _i ),i＝1,…,N

where g (-) is a neural network, N represents the number of source samples, x _i 、

Represents the ith source sample and x _i J-th residual frame of u _i 、

Denotes x _i 、

Ni represents x _i The number of remaining frames;

(3-2) for each micro-expression frame sequence, calculating the distance of the feature vectors of the rest frames and the source samples in the embedding space in the frame sequence according to the following formula:

in the formula,

representing the distance of the ith source sample from the jth residual frame in an embedding space;

(3-3) mapping, for each sequence of micro-expression frames, different distances into probabilities using a softmax function according to the following formula, thereby obtaining a distance probability distribution of each source sample from the remaining frames:

in the formula,

to represent

The probability of the mapping, P, represents the distance probability distribution of the source sample from the remaining frames.

Further, the step (4) specifically comprises: and for each source sample, sampling from the distance probability distribution, and taking a video frame corresponding to the probability obtained by sampling as a negative sample. .

Further, the loss function in step (6) is specifically:

wherein,

for total loss, λ ₁ ，λ ₂ ，λ ₃ Are the weights of the three loss functions and,

in order to achieve a loss of contrast within the video,

in order to have a loss of contrast between classes,

is the cross entropy loss.

Further, the specific function of contrast loss in the video is as follows:

in the formula, I is a micro-expression database,

are respectively as

Inputting the feature vectors output by the multilayer perceptron after the comparison and amplification network,

the method comprises the steps of obtaining a positive sample, an anchor sample and a jth negative sample of an ith source sample of a kth micro-expression video in the I through data enhancement, wherein tau is a temperature coefficient, S is the number of the negative samples, and N is the number of the source samples.

Further, the specific function of the inter-class contrast loss is as follows:

wherein I is a micro-expression database, P (k) is ≡ P ∈ A (k) y _p ＝y _k Denotes a sample set having the same microexpression class label as the current sample k in a (k), a (k) denotes a set of a positive sample of the current sample and a corresponding negative sample, and v is LSTM (z) ₁ ,z ₂ ,…,z _N )，LSTM(z ₁ ,z ₂ ,…,z _N ) Indicating z by using a long-short memory network LSTM ₁ ,z ₂ ,…,z _N Integration into a feature vector, z, of the video sample ₁ ,z ₂ ,…,z _N Respectively representing the eigenvectors output by the multilayer perceptron after the 1 st, 2 nd, … th and N source samples are input into the contrast amplifying network, wherein tau is a temperature coefficient, and v is _k 、v _p 、v _a Respectively representing the feature vector of the current video sample with subscript k after LSTM integration, and the sum v in the same batch of samples _k Feature vectors of samples with the same label, except v in the same batch of samples _k Feature vectors of other samples than the one.

Further, the cross entropy loss specific function is as follows:

wherein C is the number of micro-expression categories, y _c Is a label value of class c, p _c Is the probability of belonging to class c.

The micro expression recognition device based on the contrast and amplification network comprises a processor and a computer program which is stored on a memory and can run on the processor, and the processor realizes the method when executing the program.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the invention adopts the idea of comparison learning, and can make the network optimization process more concentrate on distinguishing the difference of the positive and negative samples by constructing the positive and negative samples, enlarge the distance of the positive and negative samples in the embedding space, make the intensity comparison and the category comparison more obvious, and directly generate more direct distinguishing characteristics, thereby making the network more smoothly sense the movement change, improving the identification accuracy, reducing the artificially set hyper-parameters, and being more convenient.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a micro expression recognition method based on a contrast amplifying network according to the present invention;

FIG. 2 is a schematic diagram of a contrast amplifying network designed by the present invention;

fig. 3 is a schematic diagram of a negative example generation process.

Detailed Description

The embodiment provides a micro-expression recognition method based on a contrast and amplification network, as shown in fig. 1, including:

(1) and acquiring a micro-expression database, wherein the micro-expression database comprises a plurality of micro-expression videos and corresponding micro-expression category labels. To improve recognition accuracy, multiple common databases may be employed.

(2) Each micro-expression video in the database is converted into a micro-expression frame sequence, the micro-expression frame sequence is preprocessed, and a fixed number of frames are sampled from each micro-expression frame sequence to serve as source samples.

Wherein the preprocessing comprises face registration and face region shearing. The sampling of a fixed number of frames from each sequence of microexpression frames as a source sample specifically comprises:

(3) And for each source sample, calculating the distance between the residual frame in the frame sequence and the source sample in the embedding space, and mapping the distance into probability to obtain the distance probability distribution of each source sample and the residual frame in the frame sequence.

The method specifically comprises the following steps:

(3-1) extracting feature vectors u of the source samples and the residual frames by using a pre-trained neural network for each micro expression frame sequence:

u _i ＝g(x _i )，i＝1，...，N

where g (-) is a neural network, which may be, for example, a network Resnet-18 that has been pre-trained on a macro-expression database, the pre-training using dataset FER +; n denotes the number of source samples, x _i 、

Represents the ith source sample and x _i J-th residual frame of u _i 、

Denotes x _i 、

Ni represents x _i The number of remaining frames;

in the formula,

representing the distance of the ith source sample from the jth residual frame in the embedding space;

(3-3) mapping the different distances into probabilities using a softmax function for each sequence of micro-expression frames according to the following equation, thereby obtaining a distance probability distribution of each source sample from the rest frames:

in the formula,

represent

Thus, a probability distribution of the source sample and all the remaining video frames with the distance as a discrete variable is obtained, namely: the further away from the source sample at the feature level, the greater the probability of being selected as a negative sample. In the same micro expression sequence, the difference between different frames is only reflected on the intensity information actually, the greater the distance between the frames is, the more obvious the intensity difference is, and by calculating the distance between the source sample and the rest video frames, the longer the distance can be ensured when sampling negative samples, or the video frames with greater intensity difference have greater probability to be sampled.

(4) For each source sample, a number of frames are sampled from the remaining video frames as corresponding negative samples according to the calculated probability distribution, as shown in fig. 3.

The step (4) specifically comprises the following steps: and for each source sample, sampling from the distance probability distribution, and taking a video frame corresponding to the probability obtained by sampling as a negative sample. .

(5) A contrast amplifying network was constructed, as shown in fig. 2, comprising Resnet-18 without a fully connected layer, a multi-layered perceptron connected behind it, and a softmax layer connected behind the multi-layered perceptron.

(6) Data enhancement is carried out on each source sample in the micro-expression frame sequence to form a pair of samples, namely an anchor sample and a positive sample, the anchor sample, the positive sample and the corresponding negative sample are used as training samples to be input into a contrast amplification network for training, a loss function in the training process is the sum of contrast loss in a video, contrast loss between classes and cross entropy loss, and training is optimized through gradient descent.

Wherein the loss function is specifically:

wherein,

in order to provide for a loss of contrast within the video,

in order to have a loss of contrast between classes,

is the cross entropy loss.

Each source sample obtained by sampling the micro-expression frame sequence is marked as x, and a pair of samples are obtained by data enhancement

In the actual calculation, call

Anchor and positive samples of each other, i.e. when

When the sample is used as an anchor sample,

as a positive sample and vice versa, the anchor sample and the positive sample should have the same intensity information, and may differ in color, style, and so on, so gaussian noise and random gray scale transformation are selected when data enhancement is performed. Comparing Resnet-18 without full connection layer in the amplifying network and marking as Enc (-) and multilayer perceptron as Proj (-) and marking as the sample after data enhancement and all negative samples corresponding to the source sample as

Sending the video data to a contrast amplification network, wherein a loss function calculated on a characteristic level is contrast loss in a video, and the specific function is as follows:

in the formula, I is a micro-expression database,

are respectively as

the ith source sample of the kth micro-expression video in the I is subjected to data enhancementAnd obtaining a positive sample, an anchor sample and a jth negative sample, wherein tau is a temperature coefficient, S is the number of the negative samples, and N is the number of the source samples.

The specific function of the inter-class contrast loss is as follows:

wherein I is a micro-expression database, P (k) is ≡ P ∈ A (k) y _p ＝y _k Denotes a sample set having the same microexpression class label as the current sample k in a (k), a (k) denotes a set of a positive sample of the current sample and a corresponding negative sample, and v is LSTM (z) ₁ ,z ₂ ,…,z _N )，LSTM(z ₁ ,z ₂ ,…,z _N ) Indicating z by using a long-short memory network LSTM ₁ ,z ₂ ,…,z _N Integration into a feature vector, z, of the video sample ₁ ,z ₂ ,…,z _N Respectively representing the eigenvectors output by the multilayer perceptron after the 1 st, 2 nd, … th and N source samples are input into the contrast amplifying network, wherein tau is a temperature coefficient, and v is _k 、v _p 、v _a Respectively representing the feature vector of the current video sample with subscript k after LSTM integration, the sum of the feature vector and the sum of the sum v and the sum of the sum and the sum of the sum and the sum of the sum and the sum of the sum and the sum of the sum _k Feature vectors of samples with the same label, except v in the same batch of samples _k Feature vectors of other samples than the one.

Only the inter-class loss does not enable the network to know the specific class of the sample, so a softmax function is needed to guide the classification, and the specific function is as follows:

The embodiment also provides a micro-expression recognition device based on a contrast and amplification network, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor executes the computer program to realize the method.

In order to verify the effectiveness of the present invention, micro-expression recognition is performed among the CAME2 micro-expression database, SAMM micro-expression database and HS sub-database of SMIC database, and the verification results and comparison with other latest methods are shown in Table 1:

TABLE 1

Method	Year of year	CASME2	SAMM	SMIC-HS
					LBP-SIP	2014	45.36	36.76	42.12
MagGA	2018	63.30	N/A	N/A
					DSSN	2019	70.78	57.35	63.41
TSCNN-I	2020	74.05	63.53	72.74
					LBPAccP ^u2	2021	69.03	N/A	76.59
AU-GCN	2021	74.27	74.26	N/A
					Method for producing a composite material	2022	79.03	77.21	77.91

In Table 1, N/A indicates that there is no relevant record.

The expressions of the CASME2 database are processed as follows: the category with the number of samples less than 10 is omitted, so that the problem of serious imbalance of the samples is avoided. 5 types of recognition tasks are completed, and the expressions of the SAMM database are processed as follows respectively by happy, regression, distorst, fear and surprie: the classes with the number of samples less than 10 are omitted, the problem of serious imbalance of the samples is avoided, and the identification tasks of 5 classes, namely, happy, angry, distust, fear and surpride, are completed. The SMIC database is classified into positive, negative and surrise, and the recognition task of three classifications is completed.

Experimental results show that the micro expression recognition method provided by the invention achieves higher micro expression recognition rate. Compared with the traditional micro expression amplification mode, the method can avoid the complexity of manual setting of part of hyper-parameters, and has stronger adaptability to individuals and more convenience.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A micro expression recognition method based on a contrast amplifying network is characterized by comprising the following steps:

(4) for each source sample, sampling a plurality of frames from the remaining video frames as corresponding negative samples according to the calculated probability distribution;

2. The micro expression recognition method based on the contrast amplifying network according to claim 1, wherein: the preprocessing in the step (2) comprises face registration and face region cutting.

3. The method for identifying micro-expressions based on contrastive augmentation according to claim 1, wherein: the sampling a fixed number of frames from each micro expression frame sequence as a source sample in the step (2) specifically includes:

4. The micro-expression recognition method based on the contrast and amplification network as claimed in claim 1, wherein: the step (3) specifically comprises the following steps:

(3-1) extracting feature vectors u of the source sample and the residual frame by using a pre-trained neural network for each micro-expression frame sequence:

u _i ＝g(x _i ),i＝1,…,N

Represents the ith source sample and x _i J-th residual frame of u _i 、

Denotes x _i 、

Ni represents x _i The number of remaining frames;

in the formula,

in the formula,

to represent

5. The micro expression recognition method based on the contrast amplifying network according to claim 1, wherein: the step (4) specifically comprises the following steps:

and for each source sample, sampling from the distance probability distribution, and taking a video frame corresponding to the probability obtained by sampling as a negative sample.

6. The micro expression recognition method based on the contrast amplifying network according to claim 1, wherein: the loss function in the step (6) is specifically:

wherein,

in order to achieve a loss of contrast within the video,

in order to have a loss of inter-class contrast,

is the cross entropy loss.

7. The micro expression recognition method based on the contrast amplifying network as claimed in claim 6, wherein: the specific function of contrast loss in the video is as follows:

in the formula, I is a micro-expression database,

are respectively as

8. The micro-expression recognition method based on the contrast and amplification network as claimed in claim 6, wherein: the inter-class contrast loss is specifically a function as follows:

9. The micro expression recognition method based on the contrast amplifying network as claimed in claim 6, wherein: the cross entropy loss specific function is as follows:

10. The utility model provides a little expression recognition device based on contrast amplifier network which characterized in that: comprising a processor and a computer program stored on a memory and executable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-9.