CN116935465A

CN116935465A - Micro-expression recognition method based on three-dimensional residual convolution neural network and optical flow method

Info

Publication number: CN116935465A
Application number: CN202310808285.0A
Authority: CN
Inventors: 李军; 许静怡; 王有为; 徐文涛; 徐晓峰
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-10-24
Anticipated expiration: 2043-07-04

Abstract

The application provides a microexpressive recognition method based on a three-dimensional residual convolution neural network and an optical flow method, which comprises the following steps: pre-processing an original micro-expression video: preprocessing comprises video framing, face alignment and clipping, and peak frame positioning to extract key frame sequences; carrying out graying treatment on the key frame image sequence, and extracting optical flow characteristics from the gray image sequence to obtain a three-channel image sequence as input of a network model; and combining a residual error module to improve the three-dimensional convolutional neural network, and obtaining the recognition and classification results of the micro-expression emotion through feature extraction and analysis. The application improves the recognition rate of the micro-expressions in the video clips and has good practical value.

Description

Micro-expression recognition method based on three-dimensional residual convolution neural network and optical flow method

Technical Field

The application relates to the technical field of computer vision, in particular to a microexpressive recognition method based on a three-dimensional residual convolution neural network and an optical flow method.

Background

Facial microexpressions are short and subtle facial movements that occur during emotional communication, and are the result of conscious or unconscious suppression. It usually occurs when people try to hide their own true feelings, on the contrary, macro-expressions are ordinary emotional facial expressions, which are easily perceived and interpreted by others in daily interactions. The main difference between macro-and micro-expressions is their intensity and duration. The macro expression typically lasts 0.5 to 4 seconds, while the micro expression lasts no more than 0.5 seconds. Microexpressions are more challenging to analyze due to the short time span and fine granularity variation. Microexpressions are used as clues for detecting lie, and have wide research and practical application in many fields such as psychology, education, medical health, criminal investigation, etc.

Early, the dominant methods of microexpressive feature extraction were Local Binary Pattern (LBP) based and optical flow based feature representation methods. However, the manual extraction of the features is large in calculation amount and long in time consumption, redundant information is easy to generate, and compared with the traditional features which need priori knowledge to perform manual design, the automatic learning through the neural network can not only obtain the micro-expression semantic information of a higher layer, but also enhance the generalization capability of the recognition model. With the development of technology, researchers have introduced deep learning algorithms into the field of micro-expression recognition, including applications of convolutional neural networks, cyclic neural networks, long-short-term memory networks and the like, but only spatial domain features are generally focused, and time domain information of continuous actions is ignored, so that recognition effects are improved.

Disclosure of Invention

The application provides a microexpressive recognition method based on a three-dimensional residual convolution neural network and an optical flow method, which can be used for solving the technical problem that the time domain information of continuous actions per se is ignored in the prior art.

The application provides a micro-expression recognition method based on a three-dimensional residual convolution neural network and an optical flow method, which comprises the following steps:

step A, pre-processing an original micro-expression video: preprocessing includes video framing, face alignment and cropping, and locating peak frames to extract key frame sequences.

Step B, carrying out graying treatment on the key frame image sequence, and extracting optical flow characteristics from the gray frame image sequence to obtain a three-channel image sequence as the input of a network model;

step C, a three-dimensional convolutional neural network is improved by combining a residual error module, and recognition and classification results of the micro-expression emotion are obtained through feature extraction and analysis;

optionally, preprocessing the original micro-expression video includes:

positioning a micro-expression peak value frame by adopting a frequency domain-based method, and extracting the peak value frame and 4 continuous images before and after the peak value frame to form a 9-frame micro-expression image key frame sequence;

peak frame positioning is achieved by:

dividing a video frame sequence according to a preset interval, dividing a face region in an image frame into 6 multiplied by 6 blocks, combining a sliding window with the length of N, adopting three-dimensional fast Fourier transform in each frame interval in sequence, and calculating frequency values of 36 blocks by a 3D FFT; the block is denoted as { b } _i1 ,b _i2 ,…,b _i36 The frequency value of the j-th block of the i-th section is shown in formula (1):

where (x, y, z) denotes its position in the frequency domain, L _b And W is _b Respectively represent the j-th block b _ij Height and width at the i-th interval, and j= {1,2, …,36};

after obtaining the frequency domain signal, filtering the low frequency signal by adopting a high-pass filter; high pass filterIs defined as formula (2), wherein D ₀ Representing a threshold value:

filtering the frequency domain signal of the video block according to equation (3):

next, the frequency domain amplitude of 36 blocks of the ith video interval is determined according to equation (4)

Wherein A is _i Representing the frequency amplitude of the ith frame interval, i.e. the range of rapid facial motion of the ith interval;

and obtaining frequency information of all video frame intervals, namely obtaining the highest intensity frame representing the rapid facial movement in the peak value interval with the largest frequency amplitude, and taking the middle frame of the peak value interval with the largest frequency amplitude as the micro expression peak value frame.

Optionally, the method includes performing graying processing on the key frame image sequence, and extracting optical flow features from the gray frame image sequence to obtain a three-channel image sequence as an input of a network model, including:

extracting optical flow characteristics from the gray picture sequence: the optical flow horizontal component u and the optical flow vertical component v are obtained by equations (5) and (6):

the optical strain ε is further extracted by calculating the derivative of the optical flow as shown in equation (7):

in the formula, the diagonal terms (. Epsilon.) _xx ,ε _yy ) Is the normal strain component (ε) _xy ,ε _yx ) Is a shear strain component;

then, the optical strain value of each pixel is calculated by taking the sum of squares of the normal strain component and the shear strain component, thereby obtaining |ε|, as shown in formula (8):

the optical flow horizontal component u, the optical flow vertical component v and the optical strain epsilon are formed into a new three-channel micro-expression image sequence in a channel cascade mode.

Optionally, the three-dimensional convolutional neural network is improved by combining with a residual error module, and recognition and classification results of the micro-expression emotion are obtained through feature extraction and analysis, including:

firstly, a three-dimensional convolutional neural network model is improved by combining a residual error module, and a three-dimensional residual error convolutional neural network is constructed:

the 3D ResNet network comprises two 3D Conv modules, 3D Res modules, a 3-layer Dropout layer, a 2-layer Dense layer, a 1-layer flattening layer, a 1-layer batch regularization layer, 1 Relu activation function and 1 Softmax layer;

the 3D Conv module comprises 1 three-dimensional convolution layer, 1 batch regularization layer, 1 Relu activation function and one three-dimensional maximum pooling layer;

the 3D Res module adds an original input x on the basis of an original Relu function output F (x) aiming at the input of the MaxPooling layer;

in the three-dimensional residual convolution module, a shortcut for connecting input and output is established by adding a direct connection edge to a nonlinear convolution layer;

then, carrying out emotion classification on the micro-expressions by adopting a Softmax classifier, wherein a loss function uses cross entropy;

the method is shown as a formula (9):

in the formula, y represents the true distribution,representing network output distribution, n representing total category number;

and inputting the optimized micro-expression video into an improved three-dimensional convolutional neural network, and obtaining recognition and classification results of the micro-expression emotion through feature extraction and analysis.

Compared with the prior art, the application has the remarkable advantages that: (1) Extracting optical flow characteristics of the key frame sequence as the input of the network model can effectively eliminate redundant information, and the recognition effect is superior to micro expression recognition based on appearance characteristics; (2) The micro expression characteristics of the time dimension and the space dimension can be synchronously learned by utilizing the three-dimensional convolution neural network to carry out micro expression recognition, the problem of network degradation and gradient explosion can be effectively solved by introducing the residual error module, and a foundation is provided for further building a deeper neural network.

Drawings

FIG. 1 is a flow chart of the method of the present application for identifying facial micro-expressions in actual operation;

FIG. 2 is a general structural overview of a facial microexpressive recognition process designed by the method of the present application;

FIG. 3 is a diagram of a facial microexpressive feature extraction network designed by the method of the application;

fig. 4 is a block diagram of a three-dimensional residual module in a feature extraction network of the method of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

An embodiment of the present application will be described first with reference to fig. 1.

Referring to fig. 1, the microexpressive recognition method based on the three-dimensional residual convolution neural network and the optical flow method mainly comprises the steps of firstly carrying out video framing, face alignment and clipping on an original microexpressive video, extracting a key frame sequence and extracting optical flow characteristics, further obtaining a three-channel optical flow image sequence, and inputting the three-channel optical flow image sequence into a constructed three-dimensional residual convolution neural network model, so that the recognition and classification of microexpressions in the video are realized. The method comprises the following steps:

step A: preprocessing an original micro-expression video. Preprocessing includes video framing, face alignment and cropping, and locating peak frames to extract key frame sequences. And step A, positioning a micro-expression peak value frame by adopting a frequency domain-based method, and extracting the peak value frame and 4 continuous images before and after the peak value frame to form a 9-frame micro-expression image key frame sequence.

The step A specifically comprises the following steps:

a1, selecting a micro expression data set. The application selects the micro expression video sequences of the SMIC and CASME II data sets for experiments. In order to alleviate the problem of emotion category imbalance between the employed datasets, each micro-expression video sample is re-labeled and mapped to three common expression tags, namely "Positive", "Negative" and "surrise", respectively. The emotion distribution of the dataset microexpressive sample is shown in table 1.

Table 1 data set microexpressive sample distribution

Emotion classification	CASMEⅡ	SMIC	Total number of categories
				Negative	88	70	158
Positive	32	51	83
				Surprise	25	43	68
Total number of categories	145	164	309

A2, aligning and cutting the human face. First, two center points of an eye region are accurately positioned by an eye detector, and a starting position for describing a face shape by an active shape model Algorithm (ASM) is determined, and then the face shape position is iteratively fitted by the ASM to determine 68 contour coordinate points of the face. And taking the coordinates of the upper, lower, left and right terminals of the face outline to perform the face region clipping work.

Based on 68 face coordinate points, a local weighted average (LWM) transformation is performed on any frame sequence i to align the cropped face regions. Setting a conversion value of any coordinate (x, y) within the image frame as shown in formula (10):

wherein W is a weight value, D _n Is the i-th control point (x _i ,y _i ) The length from the (n-1) th nearest control point located in the selected reference frame, S _i (x, y) is a number of (x) _i ,y _i ) The calculated polynomial comprising n parameters is measured. By using LWM transforms, all images within a sequence can be aligned frame by frame。

The selected data set is subjected to face alignment and clipping operation, and then a video frame sequence with the resolution of 128×128×3 (3 is an RGB channel) is obtained through normalization processing.

A3, positioning peak frames to extract key frame sequences. In the CASME II micro-expression data set, the position information of the peak frame is provided, and the SMIC data set is not marked with the peak frame information, so that a three-dimensional fast Fourier transform (3D-FFT) based method can be adopted to position the micro-expression peak frame.

The specific process of peak frame positioning is to divide the video frame sequence according to a certain interval, divide the face area in the image frame into 6×6 blocks, and then combine with a sliding window with length of N to sequentially calculate the frequency values of 36 blocks in each frame interval by adopting three-dimensional fast fourier transform (3D FFT). The block is denoted as { b } _i1 ,b _i2 ,…,b _i36 The frequency value of the j-th block of the i-th section is shown in formula (11):

where (x, y, z) denotes its position in the frequency domain, L _b And W is _b Respectively represent the j-th block b _ij Height and width in the i-th interval, and j= {1,2, …,36}.

After the frequency domain signal is obtained, a high-pass filter is used for filtering the low-frequency signal, so that the influence of unchanged pixels in the video frame is reduced. High pass filterIs defined as formula (12), wherein D ₀ Representing a threshold value:

filtering the frequency domain signal of the video block according to equation (13):

next, the frequency domain amplitude of 36 blocks of the ith video interval is determined according to equation (14)

Wherein A is _i The frequency amplitude of the i-th frame section, i.e., the range of the face rapid motion of the i-th section, is represented. In the same way, the frequency information of all video frame intervals can be obtained, the peak value interval with the largest frequency amplitude is the highest intensity frame representing the rapid facial movement, and the middle frame of the interval is taken as the micro expression peak value frame.

In order to eliminate redundant interference information, after the peak value frame is positioned, a key frame sequence of 9 frames of micro-expression images formed by 4 frames of continuous images before and after the peak value frame is selected and used as the input of the subsequent steps.

And A4, expanding a data set. Because the quality of the data set has direct influence on the experimental result of the deep learning, the data set needs to be expanded under the condition of less sample data volume, and the application adopts an affine transformation strategy to enhance the data. The specific process is as follows: the acquired face image frames are respectively moved by 15 pixels leftwards, rightwards, upwards and downwards and are vertically folded, so that the number of data sets can be enlarged to 4 times of the original number.

A5, dividing the training set and the testing set. And dividing the training data set and the test data set by the microexpressive data set subjected to normalization, key frame extraction and data enhancement, wherein the specific dividing ratio is 8:2.

And B, carrying out graying treatment on the micro-expression key frame image sequence, extracting optical flow characteristics including optical flow horizontal components, optical flow vertical components and optical strain from the gray image sequence, and combining the extracted optical flow characteristics into a three-channel image sequence in a channel cascade mode to serve as input of a network model.

B1, graying treatment. And C, carrying out graying treatment on the micro-expression key frame image sequence obtained in the step A. And (3) graying the RGB three-channel picture sequence to obtain a gray picture sequence of one channel.

B2, extracting optical flow characteristics: and C, extracting optical flow characteristics of the one-channel gray scale picture sequence obtained in the step B1 by adopting a TVL1 energy functional with good noise robustness.

The optical flow method is a method for finding out the correspondence existing between the previous frame and the current frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, so as to calculate the object motion information between the adjacent frames. The instantaneous rate of change of gray scale at a particular coordinate point of a two-dimensional image plane is generally defined as an optical flow vector. The optical flow estimation is based on the luminance constant equation, as shown in equation (15):

wherein I (x, y, t) represents an image intensity function of a pixel point with a coordinate (x, y) at time t,representing spatially varying gradients, I _t Representing the time-varying gradient. In the formula (16), p and q represent horizontal and vertical motion vectors, respectively. The optical flow horizontal component u and the optical flow vertical component v can be obtained by the expressions (17) and (18):

the optical strain can be approximated to the facial deformation intensity, and the optical strain ε can be further extracted by calculating the derivative of the optical flow as shown in equation (19):

in the formula, the diagonal terms (. Epsilon.) _xx ,ε _yy ) Is the normal strain component (ε) _xy ,ε _yx ) Is a shear strain component. Then, the optical strain value of each pixel is calculated by taking the sum of squares of the normal strain component and the shear strain component, thereby obtaining ε, as shown in formula (20):

each micro-expression sample comprises 9 frames of pictures, and after the operation of extracting the optical flow characteristics is carried out, each micro-expression sample can obtain 8 frames of horizontal optical flow sequences, 8 frames of vertical optical flow sequences and 8 frames of optical strain sequences.

B3, cascading channels to form input. And B2, forming a new three-channel micro-expression image sequence by using the optical flow horizontal component u, the optical flow vertical component v and the optical strain |epsilon| obtained in the step B2 in a channel cascade mode, and taking the new three-channel micro-expression image sequence as the input of a subsequent feature extraction network, wherein the size of each sample data is 8 multiplied by 128 multiplied by 3.

And C, improving the three-dimensional convolutional neural network by combining a residual error module, and inputting the optimized data set into the constructed three-dimensional residual error convolutional network to obtain the recognition and classification result of the micro-expression emotion.

And C1, improving a three-dimensional convolutional neural network model by combining a residual error module to construct a three-dimensional residual error convolutional neural network, wherein the specific structure of the three-dimensional residual error convolutional neural network by combining with the 3D ResNet network shown in fig. 3 comprises two 3D Conv modules, 3D Res modules, 3 Dropout layers, 2 Dense layers, 1 flattening layer, 1 batch regularization layer, 1 Relu activation function and 1 Softmax layer. The 3D Conv module includes 1 three-dimensional convolution layer, 1 batch regularization layer, 1 Relu activation function, and one three-dimensional max pooling layer. In connection with fig. 4, the 3D Res module adds the original input x on the basis of the original Relu function output F (x) for the input of the MaxPooling layer compared to the 3D Conv module.

In the three-dimensional residual convolution module, a direct connection edge is added to a nonlinear convolution layer, and a shortcut for connecting input and output is established, so that information is transmitted only through a main road of a network layer, and the problem that when the network layer is deeper, all identity mapping is difficult to fit is effectively solved.

And C2, wherein the two three-dimensional convolution modules are used for extracting shallow features of the input image sequence obtained in the step B, including feature extraction of a time domain and a space domain, and the sizes of convolution kernels are 3 multiplied by 3. Since the number of frames of the input image sequence is small, the pooling size of the largest pooling layer in both 3D Conv modules is set to 1×2×2 to preserve temporal features. The size of the convolution kernel in the first 3D Res module is 3 x 3, the filling mode adopted is same, and the pooling size is set to be 1 multiplied by 2. The size of the convolution kernel in the second and third 3D Res modules is 3 x 3 and the pooling size of the largest pooling layer is set to 2 x 2 to downsample the time domain features.

And C3, combining with FIG. 4, in the three-dimensional residual convolution module, by adding a direct connection edge to a nonlinear convolution layer, a shortcut for connecting input and output is established, and the problem that all identity mappings are difficult to fit when a network layer is deeper can be effectively solved.

C4, introducing a batch regularization layer (BN), and carrying out standardized processing on the input of each middle layer of the network to ensure that the output obeys normal distribution with the mean value of 0 and the variance of 1, thereby avoiding the problem of variable distribution deviation; introducing a Relu activation function, avoiding gradient saturation problem at the part with input x more than or equal to 0, and improving nonlinear fitting capacity; dropOut layer was introduced to reduce the number of intermediate features and the drop rate was set to 0.2.

And C5, carrying out emotion classification on the microexpressions by adopting a Softmax classifier, wherein a loss function uses cross entropy, and the calculation method is shown as a formula (21):

in the formula, y represents the true distribution,representing the network output distribution, n represents the total number of categories.

And C6, setting parameters of a 3D ResNet network model of the micro-expression feature extraction network as shown in table 2.

Table 23 d res net network model parameter settings

C7, the adopted performance evaluation indexes are Accuracy (Accuracy), unweighted F1 score (UF 1) and Unweighted Average Recall (UAR). The calculation formula of the evaluation index Acc is shown as a formula (22), the calculation formula of UF1 is shown as a formula (23), and the calculation formula of UAR is shown as a formula (26):

wherein, the liquid crystal display device comprises a liquid crystal display device,

wherein E represents the number of emotion categories divided, alpha represents the alpha-th video frame sequence, beta represents the beta-th experiment, precision _α Representing the precision of the alpha-th video frame sequence, recall _α Representing an alpha-th video frame sequenceIs a recall rate of (a).

In summary, the method firstly utilizes the image processing technology to process the original micro-expression video in combination with the optimization method, then combines the characteristics of the residual error module to improve the three-dimensional convolution neural network, and constructs the three-dimensional residual error convolution network to extract micro-expression characteristics. The optimized micro-expression video sequence can effectively improve the recognition rate of micro-expressions in video clips through the improved feature extraction network, can be applied to multiple fields of psychology, education, medical health, criminal investigation and the like, and has good practical value.

It will be apparent to those skilled in the art that the techniques of embodiments of the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the service building apparatus and the service loading apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description in the method embodiments for the matters.

The embodiments of the present application described above do not limit the scope of the present application.

Claims

1. The microexpressive recognition method based on the three-dimensional residual convolution neural network and the optical flow method is characterized by comprising the following steps of:

step A, pre-processing an original micro-expression video: preprocessing comprises video framing, face alignment and clipping, and peak frame positioning to extract key frame sequences;

and step C, combining a residual error module to improve the three-dimensional convolutional neural network, and obtaining recognition and classification results of the micro-expression emotion through feature extraction and analysis.

2. The method for identifying the microexpressive video based on the three-dimensional residual convolution network and the optical flow method according to claim 1, wherein the preprocessing of the original microexpressive video comprises the following steps:

peak frame positioning is achieved by:

where (x, y, z) denotes its position in the frequency domain, L _b And W is _b Respectively represent the j-th block b _ij In the first placei intervals, and j= {1,2, …,36};

3. The method for identifying the microexpressive motion based on the three-dimensional residual convolution network and the optical flow method according to claim 1, wherein the steps of performing gray scale processing on the key frame image sequence, extracting optical flow characteristics from the gray scale image sequence, and obtaining the three-channel image sequence as the input of the network model include:

4. The method for identifying the microexpressive motion based on the three-dimensional residual convolution network and the optical flow method according to claim 1, wherein the method for identifying and classifying the microexpressive motion based on the three-dimensional residual convolution network is characterized by combining a residual module to improve the three-dimensional convolution neural network, and obtaining the identification and classification result of the microexpressive motion through feature extraction and analysis, and comprises the following steps:

the method is shown as a formula (9):