CN117392727A

CN117392727A - Facial micro-expression recognition method based on contrast learning and feature decoupling

Info

Publication number: CN117392727A
Application number: CN202311446975.2A
Authority: CN
Inventors: 于正洋; 陈晓娟; 曲畅; 李雪; 于皓宇; 张昭华
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-01-12
Anticipated expiration: 2043-11-02
Also published as: CN117392727B

Abstract

The invention discloses a facial micro-expression recognition method based on contrast learning and feature decoupling, relates to the technical field of facial micro-expression recognition, and aims to solve the problems that an existing recognition method is inaccurate in micro-expression positioning, insufficient in sample number, incapable of effectively capturing fine changes and capable of recognizing identity information interference. The method comprises the steps of obtaining a starting frame, a peak frame and an offset frame, calculating difference features and double-end optical flow diagrams, expanding the difference features by utilizing triple loss, decoupling, inputting the optical flow diagrams into a self-attention micro-expression recognition network for training, calculating contrast loss, and finally splicing the difference features and the optical flow features to obtain recognition results. The method effectively captures the fine movements of the face, reduces the interference of identity information and enhances the degree of distinction between samples. The model is focused on the micro-expression action information by combining contrast learning and feature decoupling, and a more accurate and deep technical means is provided for the facial micro-expression recognition.

Description

Facial micro-expression recognition method based on contrast learning and feature decoupling

Technical Field

The invention relates to the technical field of facial microexpressive recognition, in particular to a facial microexpressive recognition method based on contrast learning and feature decoupling.

Background

Microexpressive recognition (Micro-Expression Recognition, MER) is the focus of research in the convergence of two major disciplines of psychology and computer vision. As a non-autonomous facial motion that occurs in a very short time, micro-expressions provide a valuable clue to us for insight into and analysis of the individual's intrinsic emotional state. The micro expression recognition shows great practical value in national security, early discovery of mental diseases and application of man-machine interaction. Nevertheless, achieving accurate recognition of micro-expressions has many technical challenges due to their inherent shortness, subtle differences and low intensity. The small scale of the existing micro-expression data sets and the difference between subjects all bring additional difficulties to develop efficient recognition algorithms. For this reason, deep learning methods gradually benefit from the ability of deep learning to automatically extract and recognize features on processing complex data in the field of microexpressive recognition. In general, microexpressive recognition has broad prospects from the viewpoint of the depth of academic research and the breadth of application. Further improving the accuracy of the micro-expression recognition technology and bringing substantial benefits to multiple fields

In the current micro-expression recognition technology field, although convolutional neural networks have been established as core solution strategies. But a great deal of related research appears to be focused on the initial frame of the microexpressions, the peak inter-frame optical flow data, and their inherent features. It is clear that the characteristic information contained from peak frames to offset frames is generally ignored, and in fact this part of the data is decisive in some specific situations. CN114550270a discloses a micro-expression recognition method based on a dual-attention mechanism, which focuses on a local area of a micro-expression by adopting a channel attention mechanism and a space attention mechanism, but the extracted light flow graph only considers a start frame and a peak frame, the extracted features are insufficient, and the light flow information from the peak frame to an offset frame is ignored. Further, in view of the high complexity of the micro-expressions, many current studies fail to adequately investigate the influence of the interference of the attribute of the individual identity information on the micro-expressions, thereby making accurate recognition difficult. CN113496217B discloses a micro-expression classification method based on an optical flow generating network, which trains the network to generate optical flow characteristics according to a start frame and a peak frame, but does not consider that face identity information contained in the start frame can interfere with micro-action characteristics of network learning micro-expressions. In addition, the characteristic data are sparse and the inherent characteristics are not outstanding due to the fact that the micro-expression shows weak transient and change characteristics. In this case, the adoption of a deep learning strategy can lead to the problem of overfitting.

Thus, in light of such technical challenges and limitations of existing approaches, it has become important to develop a new method that can accurately capture and identify micro-expressive details.

Disclosure of Invention

The invention provides a facial micro-expression recognition method based on contrast learning and feature decoupling, which aims to solve the problems that the existing recognition method is inaccurate in micro-expression positioning, insufficient in sample number, incapable of effectively capturing fine changes and capable of recognizing identity information interference.

A facial micro-expression recognition method based on contrast learning and feature decoupling is realized by the following steps:

step one, acquiring a microexpressive data set, and extracting a starting frame, a peak frame and an offset frame from the data set;

extracting optical flow characteristics between the initial frame and the peak frame to obtain an optical flow characteristic diagram between the initial frame and the peak frame; extracting optical flow characteristics between the peak frames and the offset frames to obtain an optical flow characteristic diagram between the peak frames and the offset frames;

step three, extracting static features of the face from the initial frame and the offset frame; extracting dynamic characteristics of the face from the peak frame; performing differentiation processing on the dynamic characteristics and the static characteristics in the peak frames to obtain two face difference characteristics subjected to decoupling processing;

step four, extracting optical flow branch characteristics of the two optical flow characteristic diagrams obtained in the step two through an optical flow characteristic branch comparison learning network to obtain optical flow branch characteristics;

and fifthly, performing feature stitching according to the optical flow branch features obtained in the step four and the facial difference features obtained in the step three to obtain a micro-expression recognition result.

The invention has the beneficial effects that: the method is based on a facial microexpressive recognition strategy of contrast learning and feature decoupling. The method focuses on decoupling any interference possibly generated by the identity information on the final recognition result, and ensures that the model only aims at accurately capturing and analyzing the micro-expression. In order to further improve the discrimination capability of the model, the method selects double-end optical flow information from a start frame to a peak frame and from the peak frame to an offset frame in the micro-expression sequence. Meanwhile, the performance of the model on micro-expression recognition can be enhanced more effectively by adopting a contrast learning mode. This strategy aims to fully solve the problems described in the background art and provides a more efficient and accurate identification method. It has the following advantages:

1. the method adopts a strategy for acquiring a sequence double-end light flow graph, aims at capturing the characteristics from a start frame to a peak frame and from the peak frame to an offset frame in detail, and provides a solid foundation for precise identification of the microexpressions. And the limited data set is expanded, the sequence double-end key information is fully utilized, the sample feature space is enriched, and the extracted features of the double-end optical flow diagram are spliced with the decoupled difference features. The strategy can effectively solve the problem that the facial fine motion characteristics cannot be captured due to the limitation of a network structure, and simultaneously can effectively relieve the problem of inaccurate recognition caused by limited micro-expression sample quantity and inaccurate micro-expression positioning, so that more comprehensive and deep feature description is brought to micro-expression recognition.

2. The method of the invention performs characteristic extraction and decoupling on the input sample, and strengthens the sensitivity of the network to the dynamic change of the pixels between frames by utilizing the characteristic difference between the peak frame and the initial frame and between the peak frame and the offset frame. Aims to reduce the influence of identity information to the maximum extent and further focuses on the details of the micro-expression actions. Compared with the traditional method, the strategy can capture and pay attention to key action information in the micro-expressions more accurately, and provides a more robust and accurate basis for subsequent micro-expression recognition analysis.

3. The method adopts the visual self-attention network structure and combines the contrast learning strategy to carry out the optical flow characteristic extraction network construction, and when the visual self-attention network structure processes an optical flow diagram, the visual self-attention network structure can capture the global dependency relationship in the optical flow diagram and ensure that each pixel or region is connected with other related pixels or regions. On the basis, the contrast learning strategy further enhances the discrimination among optical flow samples, so that different micro-expression features can be more obviously distinguished. The comprehensive application strategy not only strengthens the capturing capability of the network to the local details of the micro-expressions, but also ensures the deep understanding of global dynamic characteristics. Therefore, the method provides a technical method with application prospect for the research field of facial microexpressions identification.

Drawings

This section will briefly summarize the contents of the drawings referred to in the examples or prior art documents in order to ensure a thorough understanding of the examples or related prior art aspects of the present invention. It is expressly noted that the brief description of the drawings referred to herein is for reference only to certain embodiments of the present invention and should not be taken as unduly broad in meaning and application. Other possible illustrations and related technical details can be deduced from the figures presented, without additional innovative effort, for the expert of the art.

FIG. 1 is a schematic diagram of a facial microexpressive recognition method based on contrast learning and feature decoupling according to the present invention;

FIG. 2 is a diagram of a comparative learning network;

FIG. 3 is a block diagram of an encoder in a visual self-attention network;

fig. 4 is a block diagram of a convolution initiation module in an identification network.

Detailed Description

In order to ensure an accurate understanding and application of the present disclosure, technical details and operational flows of the present disclosure will be described in detail below with reference to the drawings provided in the embodiments. However, the illustrated embodiment is merely an example of a particular application and is not intended to be a complete or exclusive implementation of the present invention.

It is noted that, in the present invention, unless explicitly stated otherwise, each technical term or scientific word referred to in this application shall follow its standard meaning commonly accepted and understood by one of ordinary skill in the art to which this invention pertains.

The first embodiment describes the present embodiment with reference to fig. 1 and fig. 2, and a facial microexpressive recognition method based on contrast learning and feature decoupling is implemented by the following steps:

step 1, data preprocessing;

and acquiring a microexpressive data set, selecting a starting frame, a peak frame and an offset frame, and preprocessing the initial frame, the peak frame and the offset frame so as to facilitate subsequent feature extraction and analysis.

In this embodiment, for a data set (sample sequence) with a peak frame number not explicitly marked, a policy is adopted in which an intermediate frame in the sequence is regarded as a peak frame to ensure data integrity; for the data set with specific frame numbers already marked, the initial frame, the peak frame and the offset frame of each sequence are extracted directly according to the marked file (a table for marking the initial frame, the peak frame and the offset frame numbers: three frames corresponding to each sequence can be found in the data set according to the table).

Adopting a Dlib tool (Dlib is an open source code visual library based on C++, and covers face related modules and functions such as face recognition, face key point positioning and the like) to carry out a face detection task on an input image or video frame, so as to ensure that a face part in the image can be accurately positioned; and (3) carrying out positioning operation on the key points of the human face, so that each main characteristic region of the face can be accurately identified, cutting the human face part in the image or video frame based on the detection and positioning results, and further carrying out normalization processing to ensure that the human face part is suitable for subsequent characteristic extraction and analysis.

Step 2, extracting an optical flow diagram between the selected initial frame and the peak frame; this is also done for peak frames and offset frames, from which the corresponding dataflow graph is obtained.

In this embodiment, a TV-L1 algorithm (a method of calculating optical flow motion between images, estimating optical flow by minimizing data items and smoothing items, wherein an L1 norm is used to process motion discontinuity) is employed to extract optical flow, and acquire horizontal and vertical components in the optical flow.

And calculating optical strain according to the horizontal and vertical components, and stacking the horizontal and vertical components and the optical strain to obtain a final optical flow image, so as to provide richer data support for subsequent micro-expression analysis. Each sequence of micro-expressions can be converted into an optical flow-based image by the method described above. For the optical flow map between the peak frame and the offset frame, the calculation method is the same as the above steps.

Step 3, the feature decoupling network obtains static features of the face from the initial frame and the offset frame; for peak frames, facial dynamics are acquired. And carrying out differentiation processing on the dynamic characteristics in the peak frames and the first two static characteristics, and aiming at obtaining two facial difference characteristics subjected to decoupling processing. The method comprises the following steps:

and (3) respectively inputting the initial frame, the peak frame and the offset frame obtained in the step (1) into a residual error network with three branches to extract the characteristics. At the same time introducing a triplet loss function to amplify the initial frame and the peak frameAnd offsetting the characteristic differences between frames. To obtain purer micro-expressive motion features, the extracted peak frame features are used to subtract the initial frame features and the offset frame features to obtain two types of decoupled features F without facial identity information ₁ And F ₂ The operation is helpful to obtain two independent and decoupled new features, and provides a richer and more accurate feature basis for the accurate analysis and identification of the microexpressions.

And 4, inputting the optical flow image between the initial frame and the peak frame and between the peak frame and the offset frame into an optical flow characteristic branch comparison learning network to perform characteristic extraction, and obtaining optical flow branch characteristics.

In this embodiment, the specific steps for constructing the optical flow feature branch comparison learning network are as follows:

and taking the obtained optical flow diagram as a branch to perform feature extraction, and performing two types of data enhancement on the two types of optical flow diagrams. The enhanced data is sent to a convolution initiation module (Convolutional stem) which is responsible for converting the data into low-level features, which lays the foundation for subsequent encoder operations.

The low-level features are subjected to position coding operation, so that the features have spatial references, and the expression capacity of the features is enhanced. These encoded features are then input into the encoder of the visual self-attention network, resulting in new, higher-level feature representations.

To make these advanced features more adaptive to contrast learning, the output of the encoder is fed into a specially designed projection layer, allowing the features to be remapped in new space.

The processed features are further used to calculate a contrast loss function InfoNCE (Information Noise-Contrastive Estimation) loss that ensures that the distance of the different features in the feature space is reasonable, encouraging the network to extract more discriminative features. On the basis, training the whole network by using the calculated loss function to obtain the optical flow branch characteristic F ₃ . The feature extraction by adopting the contrast learning method can learn more discriminative and abstract features.

Step 5, willThe optical flow branch feature F ₃ And two types of decoupled features F ₁ And F ₂ And performing feature splicing on the features, and then inputting the features into the full-connection layer to obtain a recognition classification result.

Namely: the obtained difference characteristic F ₁ Difference characteristics F ₂ Optical flow feature F ₃ And the characteristic is spliced, the splicing mode is selected according to the dimension of the characteristic channel, and the specific information in the characteristic can be better reserved in the mode.

A second embodiment is described with reference to fig. 1 to fig. 4, where the second embodiment is an example of a facial microexpressive recognition method based on contrast learning and feature decoupling according to the first embodiment. The method comprises the following specific steps:

step 1, data acquisition and preprocessing;

step 11, acquiring a micro-expression data set, and extracting a start frame, a peak frame and an offset frame;

in this embodiment, for a data set with frame numbers not explicitly marked, the strategy of regarding the intermediate frame of the sequence as a peak frame, the first frame of the sequence as a start frame, and the last frame of the sequence as an offset frame ensures data integrity, namely: if the number of the sequence frames is eleven, the first frame is a start frame, the sixth frame is a peak frame, and the eleventh frame is an offset frame. For datasets that have been explicitly labeled with specific frame numbers, the start frame, peak frame, and offset frame for each sample are extracted directly from the label file.

In step 12, in order to ensure the data quality and the accuracy of the subsequent processing, dlib tools (Dlib is an open source code visual library based on c++, and covers modules and functions related to face recognition, face key point positioning and the like) are introduced, in addition, the Dlib tools can also ensure that face parts in images can be accurately positioned, and perform the positioning operation of the face key points, so that each main feature region of a face can be accurately identified, and preparation is made for feature extraction. Based on the detection and positioning results, the face part of the person in the image or video frame is cut, and further normalization processing is carried out, so that the consistency and reliability of the data in subsequent processing are ensured.

Step 2, extracting optical flow characteristics;

the optical flow is calculated by the start frame and the peak frame, and horizontal and vertical components in the optical flow are extracted. Calculating optical strain from the horizontal component and the vertical component, and stacking the horizontal component and the optical strain to obtain a final optical flow image OF ₁ And OF ₂ And a richer data support is provided for subsequent micro-expression analysis. For the optical flow map between the peak frame and the offset frame, the calculation method is the same as the above steps. The specific process is as follows:

step 21, the optical flow features are calculated using peak frames, start frames and offset frames. Taking the initial frame and the peak frame as an example, a TV-L1 algorithm (an optical flow estimation method combining total variation regularization and L1 data inconsistency measurement) is used to extract an optical flow, and a horizontal component u (x, y) and a vertical component v (x, y) in the optical flow are obtained, where the optical flow is expressed as follows:

u＝[u,v] ^T (1)

step 22, calculating an optical strain, which can approximate the intensity of the facial transformation, defined as:

wherein ε is the optical strain, u= [ u, v ]] ^T Representing optical flow, u represents optical flow gradient, then the optical strain can be represented in the form of the following matrix:

for a given optical strain, its diagonal element ε _xx And epsilon _yy Represents normal strain, but not the diagonal element epsilon _xy And epsilon _yx Representing shear strain. The optical strain can be calculated by the following formula:

where ε represents the optical strain OF the pixel, and finally each sequence can derive three optical flow-based representations { u, v, ε }, and stacking the three to generate an optical flow feature map OF ₁ The peak frame and the offset frame are calculated in the same way to obtain an optical flow characteristic image OF ₂ 。

Step 3, constructing a characteristic decoupling network;

firstly, inputting the initial frame, the peak frame and the offset frame obtained in the previous step into a residual error network with three branches to extract features, wherein the initial frame and the offset frame branches are responsible for extracting static face information, the vertex frame branches are specially used for extracting micro-expression motion information, and a triple loss function is introduced at the same time, and the loss function is used for amplifying feature differences among the initial frame, the peak frame and the offset frame. And finally, obtaining the difference characteristic which does not contain the facial identity information in a mode of subtracting the initial frame and the offset frame characteristic from the peak frame characteristic. The operation is helpful to obtain two independent and decoupled difference characteristics, and provides a richer and more accurate characteristic basis for accurate analysis and identification of the microexpressions. As shown in fig. 1, the specific process is as follows:

step 31, inputting a start frame, a peak frame and an offset frame into a residual error network with three branches to extract features, wherein the start frame and the offset frame branches are responsible for extracting static face information, the vertex frame branches are specially used for extracting micro-expression motion information, and two types of difference features which do not contain face identity information are obtained by subtracting the start frame and the offset frame features from the peak frame features.

And step 32, introducing a triplet loss function to perform a difference expansion operation in order to improve the characteristic difference between the peak frame and the initial frame and between the peak frame and the offset frame. The main objective of this loss function is to learn such that the distance between a selected anchor point and the positive sample is smaller than the distance between the anchor point and the negative sample. The triplet loss can be calculated by equation (5):

a represents an anchor sample, P represents a positive sample, and is similar or similar to an anchor point; n represents a negative sample, different from or far away from the anchor point; f (A), f (P), f (N) represent the vector space mapping the anchor point, positive sample, negative sample to low dimension, respectively, and f (A) -f (P) and f (A) -f (N) represent the vector differences between the anchor point and the positive and negative samples in this low dimension vector space, |·|| ₂ Is the L2 norm for calculating the euclidean distance of these two differences; alpha is a hyper-parameter representing margin (i.e. the difference in expected distance between positive and negative samples is a threshold value) ensuring that there is at least a distance of alpha between positive and negative samples, further explained is that the loss function ensures that the distance between samples of the same class is smaller than the distance between samples of different classes, while the hyper-parameter ensures a clear separation between the two distances.

And 33, expanding the characteristic difference among the initial frame, the peak frame and the offset frame by utilizing a triplet loss function, and then performing characteristic decoupling operation, so as to more accurately mine dynamic change information in the microexpressions, thereby performing characteristic analysis and comparison. To this end, two different difference feature outputs are defined.

Specifically, as shown in equation (6), the peak frame characteristic is used to subtract the initial frame characteristic to form a difference characteristic F ₁ This step aims at capturing initial dynamic change information of the microexpressions. Then, as shown in the formula (7), the same strategy is adopted to subtract the offset frame characteristic from the peak frame characteristic to generate a difference characteristic F ₂ To reflect the end phase dynamic information of the microexpressions.

F ₁ ＝F _{Peak frame} -F _{Start frame} (6)

F ₂ ＝F _{Peak frame} -F _{Offset frame} (7)

Step 4, constructing an optical flow characteristic branch comparison learning network;

in order to improve the micro-expression recognition and classification precision, the optical flow image extracted in the step 2 is taken as a feature extraction branch, and an optical flow feature image OF ₁ And an optical flow feature map OF ₂ As data input contrast learning network; concrete embodimentsThe process is as follows:

step 41, as shown in FIGS. 2 to 4, OF generating an optical flow characteristic map OF between the start frame and the peak frame ₁ Optical flow characteristic map OF between peak frame and offset frame ₂ As input to the contrast learning network for training.

Step 42, inputting the optical flow map into a network, and performing two times of data enhancement (augmentations) operation for each image in a small batch (mini-batch) data x containing N optical flow maps. Thus, as shown in equation (8), this batch of samples will produce 2N enhanced samples with x ^a ,x ^b And (3) representing.

The data enhancement is shown below:

x ^a ,x ^b ＝Augment(x) (8)

these enhanced samples are then partitioned by Convolutional stem module, as shown in fig. 4. This module is made up of a plurality of convolutions stacked together. A conventional visual self-attention encoder (Vision Transformer Encoder) is then used to obtain a characteristic representation of these samples and further input into the Projection layer (project). In particular, the core goal of Convolutional stem is to map a two-dimensional image into a one-dimensional feature sequence. In the ViT architecture (Vision Transformer), one common strategy is to employ a p×p convolution operation with a step size of p on the input image. This way a large number of p x p-sized non-overlapping patches can be efficiently extracted and further flattened into an input sequence that can be interpreted by the encoder. However, this method with large convolution kernel and large stride is significantly different from the small-size convolution, such as 3×3, that is often used in conventional neural networks. This strategy may result in disregarding detailed local information inside the patch such that a simple linear mapping based on such information becomes less accurate.

To address the above-mentioned problem, the present method employs Convolutional stem composed of four blocks, each of which performs a 3×3 convolution with a step size of 2. Since the input requirement of the visual self-attention network (visual self-attention encoder) is a 1D sequence, a 1 x 1 convolution is employed to accommodate its input specification. The use of multiple small convolutions more effectively identifies fine-grained features than large convolution kernels, thereby enhancing the performance of the model. Fig. 4 transitions as follows:

the converted output is fed to a visual self-attention encoder comprising 3 transducer modules in series, each transducer module being configured as shown in FIG. 3, then for x ^a ,x ^b Data h after passing through the convolution initiation module ^a And h ^b The method comprises the following steps:

h ^a '＝MultiHeadAttention(Norm(x ^a ))+x ^a (10)

h ^b '＝MultiHeadAttention(Norm(x ^b ))+x ^b (11)

wherein Norm represents normalization operation, multiHeadAttention represents multi-head self-attention mechanism, h ^a' ,h ^b' Represents x ^a ,x ^b Residual connection output after normalization layer and multi-head self-attention mechanism, multi-head self-attention in the formula can be defined as:

MultiHeadAttention(Q,K,V)＝Concat(head ₁ ,...,head _h )W _O (12)

each attention header may be defined as:

head _i ＝Attention(QW _Qi ,KW _Ki ,VW _Vi ) (13)

wherein head is _i Is the attention head. Q, K, V are Query, key, and Value. W (W) _Qi ,W _Ki ,W _Vi Is the weight of the linear mapping, and W _O Is used to combine the outputs of the different heads and Attention represents the Attention mechanism.

Will h ^a' ,h ^b' Again after passing through a normalization layer and MLP (multi-layer perceptron), expressed as:

y ^a ＝MLP(Norm(h ^a' ))+h ^a' (14)

y ^b ＝MLP(Norm(h ^b' ))+h ^b' (15)

and projects it into a low-dimensional subspace to obtain an output characteristic z ^a ,z ^b ：

Step 43, training the whole network by InfoNCE to compare the loss function; specifically, for the microexpressive anchor sample x and one positive sample thereof, and N-1 negative samples, it is desirable that the representation of the model be at the anchor sample x and the positive sample x ⁺ Is close to, but is significantly different from the representation of all negative samples.

The output of the features mapped to the low-dimensional subspace is used for calculating similarity, and cosine similarity calculation is selected to be used in the method:

then in z ^a InfoNCE loss L for anchor point _a The definition is as follows:

the Z is ^- Parameters represent a set of negative samples, Z ⁺ Representing a group of positive samples, s representing cosine similarity calculation, exp representing exponential operation, negatives being a set of negative samples, τ being a temperature parameter that can change the weight of the negative samples in loss;

in z ^b InfoNCE loss L for anchor point _b The definition is as follows:

the final InfoNCE loss is defined as:

L _InfoNCE ＝L _a +L _b (20)

obtaining final optical flow branch characteristic F through the loss function and encoder training ₃ 。

Step 5, feature splicing and recognition result classification;

the difference characteristic F obtained in the step 3 and the step 4 ₁ Difference characteristics F ₂ Optical flow branching feature F ₃ Performing characteristic splicing, wherein the splicing operation is expressed as:

F＝[F ₁ ；F ₂ ；F ₃ ] (21)

wherein F represents the characteristics after splicing, [;]representing a splice operation performed in the channel dimension. Splicing can better retain F ₁ ,F ₂ ,F ₃ Specific information of (a) is provided. And finally, obtaining a final micro-expression recognition classification result through a full-connection layer.

In the embodiments, various technical features may be constructed via a variety of combination modes. In view of the conciseness goals of the literature, all potential combining strategies have not been elaborated. However, all such combinations should be considered within the technical scope of the disclosure of this document, as long as the combination of the technical features is logically not unreasonable.

Furthermore, the embodiments shown herein are only intended to exemplify part of the possibilities of the invention and have been sought to be specific and detailed. However, this is not meant to be an absolute definition of the scope of the invention. Other variations and optimizations may be derived by those skilled in the art without departing from the basic concepts of the present invention. Such derivatization and optimization should be within the legal scope of the present invention. For this reason, the following claims should be studied to determine the true scope of this invention.

Claims

1. A facial micro-expression recognition method based on contrast learning and feature decoupling is characterized by comprising the following steps: the identification method is realized by the following steps:

2. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 1, wherein the method comprises the following steps: in the first step, for a data set with a peak frame number which is not marked explicitly, adopting a strategy of regarding an intermediate frame in a sequence as a peak frame to ensure data integrity; for the data set with specific frame numbers already marked, the initial frame, the peak frame and the offset frame of each sequence are extracted directly according to the marking file.

3. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 1, wherein the method comprises the following steps: in the first step, the method further comprises the steps of adopting a Dlib tool to detect the faces of the input images or video frames, ensuring that the faces in the images are accurately positioned, and carrying out normalization processing after clipping on the detected and positioned faces for realizing feature extraction.

4. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 1, wherein the method comprises the following steps: in the second step, for the optical flow characteristics between the initial frame and the peak frame, the extraction method of the optical flow characteristics between the peak frame and the offset frame is the same; the method comprises the following steps:

and acquiring a horizontal component and a vertical component in the optical flow, calculating optical strain according to the horizontal component and the vertical component, and stacking the horizontal component and the optical strain to obtain a final optical flow diagram.

5. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 1, wherein the method comprises the following steps: the specific process of the third step is as follows:

inputting the initial frame, peak frame and offset frame obtained in the first step into three branch residual error network correspondingly to extract features, introducing a triple loss function to expand feature differences among the initial frame, peak frame and offset frame, subtracting the initial frame features from the extracted peak frame features, subtracting the offset frame features from the peak frame features, and obtaining two types of decoupled difference features F ₁ And difference feature F ₂ 。

6. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 5, wherein the facial microexpressive recognition method is characterized in that: the triplet loss function is expressed as:

wherein A is an anchor point, P is a positive sample, N is a negative sample, and f (A), f (P) and f (N) are vector spaces for mapping the anchor point, the positive sample and the negative sample to low dimensions respectively; alpha is a hyper-parameter representing the desired distance difference between the positive and negative samples.

7. The facial microexpressive recognition method based on contrast learning and feature decoupling according to claim 1, wherein the method comprises the following steps: the specific process of the fourth step is as follows: constructing an optical flow characteristic branch comparison learning network, wherein the optical flow characteristic branch comparison learning network comprises a data enhancement module, a convolution initiation module, a visual attention encoder and a projection layer;

inputting two optical flow feature graphs into the optical flow feature branch comparison learning network for training;

data enhancement is carried out on the two optical flow feature graphs, and the enhanced data are mapped and output to two features through a corresponding convolution initial module, a visual attention encoder and a projection layer;

calculating the similarity between the two features by adopting a contrast loss function, and finally training and outputting the branch feature F of the optical flow ₃ 。