CN114639136A

CN114639136A - Long video micro-expression detection method based on shallow network

Info

Publication number: CN114639136A
Application number: CN202210075626.3A
Authority: CN
Inventors: 夏召强; 郭旭鹏; 梁桓; 胡智星; 张雨姗; 冯晓毅; 蒋晓悦
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-01-22
Filing date: 2022-01-22
Publication date: 2022-06-17
Anticipated expiration: 2042-01-22
Also published as: CN114639136B

Abstract

Aiming at the problems of low detection accuracy and poor detection capability of the conventional micro-expression detection, the invention provides a long video micro-expression detection method based on a shallow network. The invention combines the different characteristics of the shallow convolutional network and the Transformer encoder, so that the micro-expression detection has higher precision, higher speed and lower error.

Description

Long video micro-expression detection method based on shallow network

The technical field is as follows:

the invention relates to the field of computer vision, in particular to a long video micro-expression detection method based on a shallow network.

Background art:

the micro expression is a real feeling that a human is inadvertently exposed, and is often expressed by a slight movement change of the face. The micro expression research plays an important role in lie detection, medical treatment, negotiation and the like, but the micro expression research is difficult to detect due to the short duration and small action change amplitude. In recent years, face representation analysis based on computer vision technology has received much attention from many people because it can automatically reveal human emotions. Micro-expression studies are broadly divided into two categories: micro-expression detection and micro-expression identification. Among these tasks, obtaining temporal positioning of expression perception frames from long video sequences is one of the most challenging issues. Micro-expression detection requires more research than recognition tasks to further improve performance and facilitate subsequent recognition tasks. The micro-expression data is automatically analyzed by utilizing a computer vision technology, and the micro-expression data becomes one of hot problems in the field of emotion calculation.

From an early traditional macro expression characterization model to an end-to-end learning method based on a depth model, the performance of a micro expression analysis technology is remarkably improved. The micro expression change rule can be accurately described by using the existing macro expression change description features (such as LBP-TOP and MDMO) or general convolution networks (such as VGGNet and ResNet), but because the micro expression has the characteristics of short duration, weak change intensity and the like, how to automatically extract the related information of the facial micro expression from a long video sequence is still a difficult point in the micro expression automatic analysis technology.

The document "Local Bilinear temporal Neural Network for spoting Macro-and Micro-expression in Long Video Sequences" Hang Pan and the like uses Bilinear temporal Neural Network (BCNN) to extract global and Local features of each frame picture in a Long Video, and classifies the features after the global and Local features are fused to obtain a final Micro-expression detection interval. However, the detection accuracy of the technology is still low, correlation information between frames is less, and robustness is poor when facial expressions change.

The invention aims to:

aiming at the problems of low detection accuracy and poor detection capability of micro-expression in the current long video, the invention provides a long video micro-expression detection method based on a shallow network.

The invention content is as follows:

the invention mainly researches a long video micro-expression detection algorithm based on a shallow convolutional neural network and a transform encoder. And (3) extracting the visual features of each image from the preprocessed video sequence by utilizing a shallow convolutional neural network (MeNet), extracting dynamic features from the extracted continuous visual features by adopting a transform encoder, and finally detecting the micro-expression by using a moving sliding frame. The invention mainly comprises three steps: data preprocessing and feature extraction, feature analysis model construction and network model

Step 1: data preprocessing and feature extraction

Meanwhile, the video database containing the macro expression and the micro expression contains rich contents, such as background, earphone and other noises, which can greatly affect the detection of the micro expression, so that the preprocessing operation needs to be performed on the video sequence of the database. The influence caused by factors such as the size and the position of the human face can be effectively reduced by preprocessing the human face image sequence. And after the preprocessing is finished, extracting the light stream motion characteristics of the processed human face sequence to be used as the input of a subsequent characteristic analysis model.

1) Extracting video sequence human face

Extracting the spatial position of the face from the micro-expression long video, cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position, and comprising the following three steps of: positioning a face in the first frame to obtain a spatial position frame; further optimizing position frames and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.

When preprocessing the video, firstly, the face position coordinate (x) is obtained by using the face detection algorithm₀,y₀),(x₁,y₁) And respectively represent the coordinates of the upper left corner and the lower right corner of the face position frame. The obtained position frame contains noise such as earphones, hairs and the like, and cannot be directly input into the feature analysis model, so that the face position frame needs to be further cut. By initial face position coordinates (x)₀,y₀), (x₁,y₁) Obtaining new coordinates (x ') after cutting'₀,y₀’),(x’₁,y’₁) The concrete formula is as follows:

(x′₀,y₀′)＝(x₀+a,y₀-b) (1)

(x′₁,y₁′)＝(x₁-a,y₁+b) (2)

wherein a is the transverse cutting distance and b is the longitudinal cutting distance.

Uniformly cutting the video sequence, taking the first frame image of each micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, obtaining a cutting matrix of the model face, and cutting the rest images of the video sequence by using the cutting matrix to finish face extraction.

In order to ensure that the video sequences input into the feature analysis model have the same length, the micro-expression long video sequences are required to be normalized, so that the video sequences with different lengths are changed into the same length.

2) Optical flow motion feature extraction

And extracting optical flow motion characteristics from the preprocessed human face image sequence by an optical flow method, wherein an optical flow graph is a two-dimensional vector field and can reflect the change trend of the gray value at each point on the image. In the algorithm, the corresponding relation between a current frame and a previous frame is explored by using time domain pixel change in a video sequence, and spatial motion information between adjacent frames is calculated.

In the preprocessed face image sequence, assuming that the gray value at the pixel point (x, y) at the time t is I (x, y, t), after Δ t, the pixel point moves to (x + Δ x, y + Δ y), and is obtained according to gray conservation:

I(x,y,t)＝I(x+Δx,y+Δy,t+Δt) (3)

expanding the right side of the equation by the Taylor formula to obtain:

wherein, beta represents a derivative term of second order or above, and after neglecting the derivative term of higher order, further obtains:

I_xu+I_yv+I_t＝0 (5)

wherein:

I_x,I_yand I_tRespectively representing partial derivatives of pixel gray values in the x direction, the y direction and the t direction in an image, wherein (u, v) are velocity vectors of an optical flow in the horizontal direction and the vertical direction; taking (u, v) as the front two-dimension of the optical flow image, taking the result of further processing (u, v) by equation (7) as the third dimension of the optical flow features, and introducing the third dimension of the optical flow features, the feature vectors (optical flow graph) can be organized like an RGB image;

before extracting the optical flow features, the video sequence needs to be time-normalized, so when extracting the optical flow, the k value between the current frame and the previous frame, that is, the inter-frame optical flow interval needs to be adjusted.

And 2, step: feature analysis model construction

The invention provides a micro-expression detection model combining a shallow convolutional neural network and a Transformer encoder. Visual features are extracted from a single frame stream image by utilizing a shallow convolutional neural network (CNN model), the extracted visual features are input into two layers of transform encoders, the relation between frames of a sequence is modeled, and finally two different complete connection layers are adopted to realize two subtasks: and identifying and positioning, wherein the identification is constructed as an auxiliary task and is used for assisting the main task.

1) CNN model

Because the samples in the micro expression perception task are limited, a shallow layer structure is adopted to complete the positioning task, and the sub-network consists of two blocks: a multi-stream (multi-stream) block and an embedding (embedding) block. In the multi-stream block, there are three streams with different parameters representing different feature extraction methods, and in the convolution filters of the three streams, spread convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of dense connection for several times, and treats one embedded block as a plurality of convolution layers by sharing parameters. In each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.

The embedded block uses an attention module, the core idea of the attention mechanism is similar to the attention of a human, when the human observes a certain picture, the human can focus on the interested area and ignore other background information in the image. Since the input to the shallow network is an optical flow graph, containing the characteristics required for less micro-expression detection, it is desirable to make the network pay more attention to these features through a mechanism of attention.

After the visual features extracted by the multi-stream block are input into the embedding block, the attention mechanism obtains an attention matrix through an operation, and a formula for obtaining the attention matrix is as follows:

a＝softmax(w·y+b) (8)

where w is the parameter to be learned, b is the offset, and y is the output of the multi-stream block.

And (3) giving different attention degrees to each area according to the attention matrix to obtain new visual features, wherein a specific formula is as follows:

wherein l denotes the multiplication of elements of two matrix sums, d (-) denotes the down-sampling operation, W_AWeight matrix representing the unit of attention, F_jIs the j-th channel profile with respect to input X, and N is the number of profiles.

2) Transformer encoder

According to the method, a shallow Transformer encoder is selected as a global attention module, the relationship among frames of a sequence is modeled based on visual features extracted by a shallow convolutional neural network, and the dependency relationship among the frames is modeled by using the global attention of the similarity among the frames;

the self-attention module in the Transformer encoder models different positions of a single sequence, and Q, K, V three layers are used for representing the given sequence; wherein the self-attention function is used to map a query (Q layer) and a set of key value (K layer and V layer) pairs to an output, the function is as follows:

wherein Q, K, V represents a Q layer, a K layer, and a V layer, respectively;

further, for an input sequence, it is often necessary to split the input sequence into a plurality of sequences and input the sequences into a plurality of heads for processing, and finally, an MHA (multi-head-self-attack) function performs the following processing on the plurality of heads:

MHA(Q,K,V)＝Concat(head₀,...,head_z)W^O,

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (11)

wherein, W^O，W^Q，W^KAnd W^VIs a parameter matrix.

And step 3: training and micro-expression detection of network models

1) Network model training

And inputting the extracted optical flow sequence into a feature analysis model to obtain the prediction output of two full-connection layers of the transform encoder. The output p of the first fully-connected layer is used to predict whether the input optical flow sequence contains a macro expression or a micro expression (binary output), and the regression information of the second fully-connected layer output optical flow sequence contains the offset information of the micro expression relative to the time position of the current optical flow sequence and the length of the frame.

And constructing a loss function by utilizing the prediction output of the full-connection layer, wherein the loss function consists of three parts. And determining a first part loss item according to the comparison between macro and micro expression classification results and the real labeling information:

wherein, y_iIs the label of example i, p_iIs the probability of predicting i. And comparing the regression information of the output optical flow sequence with the real labeling information to determine second and third loss terms, wherein the losses are respectively expressed by the following formulas (13) and (14):

wherein, the first and the second end of the pipe are connected with each other,

and d_iRespectively representing an estimated value and a true value;

wherein the content of the first and second substances,

and l_iRespectively representing an estimated value and a true value;

and obtaining a final loss function according to the first partial loss term and the second and third partial loss terms:

Loss＝λ₁Loss_ce+λ₂Loss_o+λ₃Loss_l (15)

2) micro-expression detection

In the detection process, the long video is divided into N sections, each section is independently detected, the long video can cause detection interval overlapping, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression. Removing the repeated frames through non-maximum suppression (NMS) and obtaining the best matching micro-expression interval, and implementing the following steps: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence coefficient; and so on, and finally obtaining the result of the detection interval.

Has the advantages that:

the effectiveness of the invention is measured by three evaluation indexes of F1 score (F1-score), Precision (Precision) and recall (call). In two public data sets SAMM and CAS (ME)²The proposed method was evaluated and shown to be competitive. The method has the advantages of simple network input, shallow network structure and reduced calculated amount.

Description of the drawings:

FIG. 1 is a diagram of a shallow network-based micro-expression detection framework according to the present invention

FIG. 2 is an example optical flow diagram extracted from macro-expressions and micro-expressions

FIG. 3 is a diagram of a shallow CNN model according to the present invention

FIG. 4 is a diagram of a Transformer model

The specific implementation mode is as follows:

step 1: data preprocessing and feature extraction

The method comprises the steps of preprocessing a human face image sequence, reducing influences caused by factors such as the size and the position of a human face, and extracting optical flow motion characteristics from the processed human face sequence to be used as input of a subsequent characteristic analysis model.

1) Extracting video sequence human face

Extracting the spatial position of the face from the micro-expression long video, and cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position, wherein the method comprises the following three steps: positioning a human face in the first frame by using an OpenCV tool box to obtain a spatial position frame; further optimizing the position frame and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.

When video preprocessing is carried out, firstly, the OpenCV toolbox is utilized to obtain the face position coordinates (x)₀,y₀),(x₁,y₁) And respectively represent the coordinates of the upper left corner and the lower right corner of the face position frame. The obtained position frame contains noise such as earphones, hairs and the like, and cannot be directly input into the feature analysis model, so that the face position frame needs to be further cut. By initial face position coordinates (x)₀,y₀), (x₁,y₁) Obtaining new coordinates (x ') after cutting'₀,y₀’),(x’₁,y’₁) The concrete formula is as follows:

(x′₀,y₀′)＝(x₀+a,y₀-b) (16)

(x′₁,y₁′)＝(x₁-a,y₁+b) (17)

And then uniformly cutting the video sequence, taking the first frame image of each section of the micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, acquiring a cutting matrix of the model face, and cutting the residual images of the video sequence by using the cutting matrix to finish face extraction.

2) Optical flow motion feature extraction

Extracting optical flow motion characteristics from the preprocessed human face image sequence through an optical flow method, wherein an optical flow graph is a two-dimensional vector field and can reflect the change trend of gray value values at each point on an image; in the algorithm, the corresponding relation between a current frame and a previous frame is explored by using time domain pixel change in a video sequence, and spatial motion information between adjacent frames is calculated.

I(x,y,t)＝I(x+Δx,y+Δy,t+Δt) (18)

expanding the right side of the equation by the Taylor formula to obtain:

wherein, beta represents a derivative term of second order and above, and after ignoring the derivative term of higher order, further obtaining:

I_xu+I_yv+I_t＝0 (20)

wherein:

I_x,I_yand I_tRespectively representing partial derivatives of pixel gray values in the image along three directions of x, y and t, wherein (u and v) are velocity vectors of optical flow along the horizontal direction and the vertical direction; taking (u, v) as the front two-dimension of the optical flow image, taking the result of further processing (u, v) by equation (7) as the third dimension of the optical flow feature, and introducing the third dimension of the optical flow feature, the feature vector (optical flow graph) can be organized like an RGB image;

before extracting optical flow features, time normalization is required to be performed on a video sequence, so that when extracting optical flow, k values between a current frame and a previous frame and optical flow intervals between frames need to be adjusted.

Step 2: feature analysis model construction

The invention provides a micro-expression detection model combining a shallow convolutional neural network and a Transformer encoder. Visual features are extracted from a single frame stream image by utilizing a shallow convolutional neural network (CNN model), the extracted visual features are input into two layers of transform encoders, the relation between frames of a sequence is modeled, and finally two different complete connection layers are adopted to realize two subtasks: identifying and locating, the identifying being configured as an auxiliary task for assisting a primary task.

1) CNN model

Because the samples in the micro expression perception task are limited, a shallow layer structure is adopted to complete the positioning task; the sub-network is composed of two blocks: a multi-stream (multi-stream) block and an embedding (embedding) block. In the multi-stream block, there are three streams with different parameters representing different feature extraction methods, and in convolution filters of the three streams, extended convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of dense connection for several times, and treats one embedded block as a plurality of convolution layers by sharing parameters. In each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.

The embedded block uses an attention module, the core idea of the attention mechanism is similar to the attention of a human, when the human observes a certain picture, the human can focus on the interested area and ignore other background information in the image. Since the input to the shallow network is an optical flow graph, containing the characteristics required for less micro-expression detection, it is desirable to make the network give more attention to these features through an attention mechanism.

a＝softmax(w·y+b) (23)

wherein l denotes the multiplication of elements of two matrix sums, d (-) denotes the down-sampling operation, W_AWeight matrix representing attention units, F_jIs the j-th channel profile with respect to input X, and N is the number of profiles.

2) Transformer encoder

wherein Q, K, V represents a Q layer, a K layer, and a V layer, respectively;

MHA(Q,K,V)＝Concat(head₀,...,head_z)W^O,

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (26)

wherein, W^O，W^Q，W^KAnd W^VIs a parameter matrix.

And step 3: training and micro-expression detection of network models

1) Network model training

wherein, y_iIs the label of example i, p_iIs the probability of prediction i;

and comparing the regression information of the output optical flow sequence with the real labeling information to determine a second loss term and a third loss term, wherein the losses are respectively expressed by the following formulas (28) and (29):

wherein the content of the first and second substances,

and d_iRespectively representing an estimated value and a true value;

and l_iThe estimated value and the true value are respectively represented. And obtaining a final loss function according to the first partial loss term and the second and third partial loss terms:

Loss＝λ₁Loss_ce+λ₂Loss_o+λ₃Loss_l (30)

the three linear coefficients are set to 1.1,0.5, respectively.

2) Micro-expression detection

In the detection process, the long video is divided into N sections, each section is independently detected, the long video can cause detection interval overlapping, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression. These repeated boxes are removed by non-maximum suppression (NMS) and the best matching micro-expression interval is obtained, embodying the steps of: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence; and so on, and finally obtaining the result of the detection interval.

Claims

1. The invention provides a long video micro-expression detection method based on a shallow network, which mainly comprises the following three parts: data preprocessing mode, optical flow characteristics, characteristic analysis model and model training and detection.

(1) Data preprocessing mode and optical flow characteristics

Extracting the spatial position of the face from the micro-expression long video, and cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position: positioning a human face in the first frame to obtain a spatial position frame; further optimizing position frames and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.

After the position coordinates of the face are obtained, further cutting is needed; by initial face position coordinates (x)₀,y₀),(x₁,y₁) To obtain new coordinates (x ') after cutting'₀,y₀’),(x’₁,y’₁) The concrete formula is as follows:

(x′₀,y′₀)＝(x₀+a,y₀-b) (1)

(x′₁,y′₁)＝(x₁-a,y₁+b) (2)

wherein, a is a transverse cutting distance, and b is a longitudinal cutting distance; and taking the first frame image of each section of the micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, acquiring a cutting matrix of the model face, and cutting the rest images of the video sequence by using the cutting matrix to complete face extraction.

Extracting optical flow motion characteristics from a preprocessed human face image sequence, exploring the corresponding relation between a current frame and a previous frame by utilizing time domain pixel change in the video sequence, calculating spatial motion information between adjacent frames, obtaining extracted optical flows and recording the extracted optical flows as (u, v) through light intensity conservation before and after pixel point motion and Taylor expansion simplification, and further processing the (u, v) through an equation (3) to form a three-dimensional composite optical flow characteristic diagram:

before extracting the optical flow features, the video sequence needs to be time-normalized, so when extracting the optical flow, the k value between the current frame and the previous frame needs to be adjusted to set the optical flow interval between the frames.

(2) Feature analysis model

The invention provides a novel shallow network structure for further extracting motion characteristics in micro-expressions. The network constructs a new shallow convolutional network and a shallow transform encoder.

The shallow convolutional network consists of two modules: multi-stream (multi-stream) modules and embedded (embedded) modules. In the multi-stream module, there are three streams with different parameters representing different feature extraction methods, and in the convolution filters of the three streams, the extended convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of intensive connection for a plurality of times, and takes one embedded block as a plurality of convolution layers through sharing parameters; in each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.

The shallow Transformer encoder is used as a global attention module, integrates visual features extracted by a shallow convolutional neural network, models the relation among the frames of the sequence, and models the dependency relation among the frames by using the global attention of the similarity among the frames. The self-attention module in the encoder models the different positions of a single sequence, representing a given sequence using Q, K, V tri-layers, for mapping a query (Q-layer) and a set of key-value (K-layer and V-layer) pairs to the output. The prediction outputs of the two fully-connected layers of the encoder: the output p of the first fully-connected layer is used to predict whether the input optical flow sequence contains a macro-expression or a micro-expression (binary output); the second full-connection layer outputs regression information of the optical flow sequence, which contains offset information of the micro expression relative to the time position of the current optical flow sequence and the length of the frame.

(3) Model training and detection

The invention integrates a plurality of loss function training models, and the loss function consists of three parts: and determining a first part loss item according to the comparison between macro and micro expression classification results and the real labeling information:

wherein, y_iIs the tag of example i, p_iIs the probability of prediction i;

and comparing the regression information of the output optical flow sequence with the real labeling information to determine second and third loss terms, wherein the losses are respectively shown as a formula (4) and a formula (5):

wherein the content of the first and second substances,

and d_iRespectively representing an estimated value and a true value;

and l_iThe estimated value and the true value are respectively represented.

Loss＝λ₁Loss_ce+λ₂Loss_o+λ₃Loss_l (6)

the parameters of the model are learned using back propagation algorithms using the loss function in equation (6).

In the detection process, the long video is divided into N sections, and each video section is independently detected by using a depth model. Detection intervals are overlapped in a long video, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression; the present invention removes these repeating boxes by non-maximum suppression (NMS) and obtains the best matching micro-expression interval: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence coefficient; and so on, and finally obtaining the result of the detection interval.