CN114639136A - Long video micro-expression detection method based on shallow network - Google Patents

Long video micro-expression detection method based on shallow network Download PDF

Info

Publication number
CN114639136A
CN114639136A CN202210075626.3A CN202210075626A CN114639136A CN 114639136 A CN114639136 A CN 114639136A CN 202210075626 A CN202210075626 A CN 202210075626A CN 114639136 A CN114639136 A CN 114639136A
Authority
CN
China
Prior art keywords
micro
expression
optical flow
sequence
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210075626.3A
Other languages
Chinese (zh)
Other versions
CN114639136B (en
Inventor
夏召强
郭旭鹏
梁桓
胡智星
张雨姗
冯晓毅
蒋晓悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210075626.3A priority Critical patent/CN114639136B/en
Publication of CN114639136A publication Critical patent/CN114639136A/en
Application granted granted Critical
Publication of CN114639136B publication Critical patent/CN114639136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Aiming at the problems of low detection accuracy and poor detection capability of the conventional micro-expression detection, the invention provides a long video micro-expression detection method based on a shallow network. The invention combines the different characteristics of the shallow convolutional network and the Transformer encoder, so that the micro-expression detection has higher precision, higher speed and lower error.

Description

Long video micro-expression detection method based on shallow network
The technical field is as follows:
the invention relates to the field of computer vision, in particular to a long video micro-expression detection method based on a shallow network.
Background art:
the micro expression is a real feeling that a human is inadvertently exposed, and is often expressed by a slight movement change of the face. The micro expression research plays an important role in lie detection, medical treatment, negotiation and the like, but the micro expression research is difficult to detect due to the short duration and small action change amplitude. In recent years, face representation analysis based on computer vision technology has received much attention from many people because it can automatically reveal human emotions. Micro-expression studies are broadly divided into two categories: micro-expression detection and micro-expression identification. Among these tasks, obtaining temporal positioning of expression perception frames from long video sequences is one of the most challenging issues. Micro-expression detection requires more research than recognition tasks to further improve performance and facilitate subsequent recognition tasks. The micro-expression data is automatically analyzed by utilizing a computer vision technology, and the micro-expression data becomes one of hot problems in the field of emotion calculation.
From an early traditional macro expression characterization model to an end-to-end learning method based on a depth model, the performance of a micro expression analysis technology is remarkably improved. The micro expression change rule can be accurately described by using the existing macro expression change description features (such as LBP-TOP and MDMO) or general convolution networks (such as VGGNet and ResNet), but because the micro expression has the characteristics of short duration, weak change intensity and the like, how to automatically extract the related information of the facial micro expression from a long video sequence is still a difficult point in the micro expression automatic analysis technology.
The document "Local Bilinear temporal Neural Network for spoting Macro-and Micro-expression in Long Video Sequences" Hang Pan and the like uses Bilinear temporal Neural Network (BCNN) to extract global and Local features of each frame picture in a Long Video, and classifies the features after the global and Local features are fused to obtain a final Micro-expression detection interval. However, the detection accuracy of the technology is still low, correlation information between frames is less, and robustness is poor when facial expressions change.
The invention aims to:
aiming at the problems of low detection accuracy and poor detection capability of micro-expression in the current long video, the invention provides a long video micro-expression detection method based on a shallow network.
The invention content is as follows:
the invention mainly researches a long video micro-expression detection algorithm based on a shallow convolutional neural network and a transform encoder. And (3) extracting the visual features of each image from the preprocessed video sequence by utilizing a shallow convolutional neural network (MeNet), extracting dynamic features from the extracted continuous visual features by adopting a transform encoder, and finally detecting the micro-expression by using a moving sliding frame. The invention mainly comprises three steps: data preprocessing and feature extraction, feature analysis model construction and network model
Step 1: data preprocessing and feature extraction
Meanwhile, the video database containing the macro expression and the micro expression contains rich contents, such as background, earphone and other noises, which can greatly affect the detection of the micro expression, so that the preprocessing operation needs to be performed on the video sequence of the database. The influence caused by factors such as the size and the position of the human face can be effectively reduced by preprocessing the human face image sequence. And after the preprocessing is finished, extracting the light stream motion characteristics of the processed human face sequence to be used as the input of a subsequent characteristic analysis model.
1) Extracting video sequence human face
Extracting the spatial position of the face from the micro-expression long video, cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position, and comprising the following three steps of: positioning a face in the first frame to obtain a spatial position frame; further optimizing position frames and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.
When preprocessing the video, firstly, the face position coordinate (x) is obtained by using the face detection algorithm0,y0),(x1,y1) And respectively represent the coordinates of the upper left corner and the lower right corner of the face position frame. The obtained position frame contains noise such as earphones, hairs and the like, and cannot be directly input into the feature analysis model, so that the face position frame needs to be further cut. By initial face position coordinates (x)0,y0), (x1,y1) Obtaining new coordinates (x ') after cutting'0,y0’),(x’1,y’1) The concrete formula is as follows:
(x′0,y0′)=(x0+a,y0-b) (1)
(x′1,y1′)=(x1-a,y1+b) (2)
wherein a is the transverse cutting distance and b is the longitudinal cutting distance.
Uniformly cutting the video sequence, taking the first frame image of each micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, obtaining a cutting matrix of the model face, and cutting the rest images of the video sequence by using the cutting matrix to finish face extraction.
In order to ensure that the video sequences input into the feature analysis model have the same length, the micro-expression long video sequences are required to be normalized, so that the video sequences with different lengths are changed into the same length.
2) Optical flow motion feature extraction
And extracting optical flow motion characteristics from the preprocessed human face image sequence by an optical flow method, wherein an optical flow graph is a two-dimensional vector field and can reflect the change trend of the gray value at each point on the image. In the algorithm, the corresponding relation between a current frame and a previous frame is explored by using time domain pixel change in a video sequence, and spatial motion information between adjacent frames is calculated.
In the preprocessed face image sequence, assuming that the gray value at the pixel point (x, y) at the time t is I (x, y, t), after Δ t, the pixel point moves to (x + Δ x, y + Δ y), and is obtained according to gray conservation:
I(x,y,t)=I(x+Δx,y+Δy,t+Δt) (3)
expanding the right side of the equation by the Taylor formula to obtain:
Figure BDA0003483853850000031
wherein, beta represents a derivative term of second order or above, and after neglecting the derivative term of higher order, further obtains:
Ixu+Iyv+It=0 (5)
wherein:
Figure BDA0003483853850000032
Ix,Iyand ItRespectively representing partial derivatives of pixel gray values in the x direction, the y direction and the t direction in an image, wherein (u, v) are velocity vectors of an optical flow in the horizontal direction and the vertical direction; taking (u, v) as the front two-dimension of the optical flow image, taking the result of further processing (u, v) by equation (7) as the third dimension of the optical flow features, and introducing the third dimension of the optical flow features, the feature vectors (optical flow graph) can be organized like an RGB image;
Figure BDA0003483853850000033
before extracting the optical flow features, the video sequence needs to be time-normalized, so when extracting the optical flow, the k value between the current frame and the previous frame, that is, the inter-frame optical flow interval needs to be adjusted.
And 2, step: feature analysis model construction
The invention provides a micro-expression detection model combining a shallow convolutional neural network and a Transformer encoder. Visual features are extracted from a single frame stream image by utilizing a shallow convolutional neural network (CNN model), the extracted visual features are input into two layers of transform encoders, the relation between frames of a sequence is modeled, and finally two different complete connection layers are adopted to realize two subtasks: and identifying and positioning, wherein the identification is constructed as an auxiliary task and is used for assisting the main task.
1) CNN model
Because the samples in the micro expression perception task are limited, a shallow layer structure is adopted to complete the positioning task, and the sub-network consists of two blocks: a multi-stream (multi-stream) block and an embedding (embedding) block. In the multi-stream block, there are three streams with different parameters representing different feature extraction methods, and in the convolution filters of the three streams, spread convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of dense connection for several times, and treats one embedded block as a plurality of convolution layers by sharing parameters. In each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.
The embedded block uses an attention module, the core idea of the attention mechanism is similar to the attention of a human, when the human observes a certain picture, the human can focus on the interested area and ignore other background information in the image. Since the input to the shallow network is an optical flow graph, containing the characteristics required for less micro-expression detection, it is desirable to make the network pay more attention to these features through a mechanism of attention.
After the visual features extracted by the multi-stream block are input into the embedding block, the attention mechanism obtains an attention matrix through an operation, and a formula for obtaining the attention matrix is as follows:
a=softmax(w·y+b) (8)
where w is the parameter to be learned, b is the offset, and y is the output of the multi-stream block.
And (3) giving different attention degrees to each area according to the attention matrix to obtain new visual features, wherein a specific formula is as follows:
Figure BDA0003483853850000041
wherein l denotes the multiplication of elements of two matrix sums, d (-) denotes the down-sampling operation, WAWeight matrix representing the unit of attention, FjIs the j-th channel profile with respect to input X, and N is the number of profiles.
2) Transformer encoder
According to the method, a shallow Transformer encoder is selected as a global attention module, the relationship among frames of a sequence is modeled based on visual features extracted by a shallow convolutional neural network, and the dependency relationship among the frames is modeled by using the global attention of the similarity among the frames;
the self-attention module in the Transformer encoder models different positions of a single sequence, and Q, K, V three layers are used for representing the given sequence; wherein the self-attention function is used to map a query (Q layer) and a set of key value (K layer and V layer) pairs to an output, the function is as follows:
Figure BDA0003483853850000042
wherein Q, K, V represents a Q layer, a K layer, and a V layer, respectively;
further, for an input sequence, it is often necessary to split the input sequence into a plurality of sequences and input the sequences into a plurality of heads for processing, and finally, an MHA (multi-head-self-attack) function performs the following processing on the plurality of heads:
MHA(Q,K,V)=Concat(head0,...,headz)WO,
headi=Attention(QWi Q,KWi K,VWi V) (11)
wherein, WO,WQ,WKAnd WVIs a parameter matrix.
And step 3: training and micro-expression detection of network models
1) Network model training
And inputting the extracted optical flow sequence into a feature analysis model to obtain the prediction output of two full-connection layers of the transform encoder. The output p of the first fully-connected layer is used to predict whether the input optical flow sequence contains a macro expression or a micro expression (binary output), and the regression information of the second fully-connected layer output optical flow sequence contains the offset information of the micro expression relative to the time position of the current optical flow sequence and the length of the frame.
And constructing a loss function by utilizing the prediction output of the full-connection layer, wherein the loss function consists of three parts. And determining a first part loss item according to the comparison between macro and micro expression classification results and the real labeling information:
Figure BDA0003483853850000051
wherein, yiIs the label of example i, piIs the probability of predicting i. And comparing the regression information of the output optical flow sequence with the real labeling information to determine second and third loss terms, wherein the losses are respectively expressed by the following formulas (13) and (14):
Figure BDA0003483853850000052
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003483853850000053
and diRespectively representing an estimated value and a true value;
Figure BDA0003483853850000054
wherein the content of the first and second substances,
Figure BDA0003483853850000055
and liRespectively representing an estimated value and a true value;
and obtaining a final loss function according to the first partial loss term and the second and third partial loss terms:
Loss=λ1Lossce2Losso3Lossl (15)
2) micro-expression detection
In the detection process, the long video is divided into N sections, each section is independently detected, the long video can cause detection interval overlapping, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression. Removing the repeated frames through non-maximum suppression (NMS) and obtaining the best matching micro-expression interval, and implementing the following steps: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence coefficient; and so on, and finally obtaining the result of the detection interval.
Has the advantages that:
the effectiveness of the invention is measured by three evaluation indexes of F1 score (F1-score), Precision (Precision) and recall (call). In two public data sets SAMM and CAS (ME)2The proposed method was evaluated and shown to be competitive. The method has the advantages of simple network input, shallow network structure and reduced calculated amount.
Description of the drawings:
FIG. 1 is a diagram of a shallow network-based micro-expression detection framework according to the present invention
FIG. 2 is an example optical flow diagram extracted from macro-expressions and micro-expressions
FIG. 3 is a diagram of a shallow CNN model according to the present invention
FIG. 4 is a diagram of a Transformer model
The specific implementation mode is as follows:
step 1: data preprocessing and feature extraction
The method comprises the steps of preprocessing a human face image sequence, reducing influences caused by factors such as the size and the position of a human face, and extracting optical flow motion characteristics from the processed human face sequence to be used as input of a subsequent characteristic analysis model.
1) Extracting video sequence human face
Extracting the spatial position of the face from the micro-expression long video, and cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position, wherein the method comprises the following three steps: positioning a human face in the first frame by using an OpenCV tool box to obtain a spatial position frame; further optimizing the position frame and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.
When video preprocessing is carried out, firstly, the OpenCV toolbox is utilized to obtain the face position coordinates (x)0,y0),(x1,y1) And respectively represent the coordinates of the upper left corner and the lower right corner of the face position frame. The obtained position frame contains noise such as earphones, hairs and the like, and cannot be directly input into the feature analysis model, so that the face position frame needs to be further cut. By initial face position coordinates (x)0,y0), (x1,y1) Obtaining new coordinates (x ') after cutting'0,y0’),(x’1,y’1) The concrete formula is as follows:
(x′0,y0′)=(x0+a,y0-b) (16)
(x′1,y1′)=(x1-a,y1+b) (17)
wherein a is the transverse cutting distance and b is the longitudinal cutting distance.
And then uniformly cutting the video sequence, taking the first frame image of each section of the micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, acquiring a cutting matrix of the model face, and cutting the residual images of the video sequence by using the cutting matrix to finish face extraction.
In order to ensure that the video sequences input into the feature analysis model have the same length, the micro-expression long video sequences are required to be normalized, so that the video sequences with different lengths are changed into the same length.
2) Optical flow motion feature extraction
Extracting optical flow motion characteristics from the preprocessed human face image sequence through an optical flow method, wherein an optical flow graph is a two-dimensional vector field and can reflect the change trend of gray value values at each point on an image; in the algorithm, the corresponding relation between a current frame and a previous frame is explored by using time domain pixel change in a video sequence, and spatial motion information between adjacent frames is calculated.
In the preprocessed face image sequence, assuming that the gray value at the pixel point (x, y) at the time t is I (x, y, t), after Δ t, the pixel point moves to (x + Δ x, y + Δ y), and is obtained according to gray conservation:
I(x,y,t)=I(x+Δx,y+Δy,t+Δt) (18)
expanding the right side of the equation by the Taylor formula to obtain:
Figure BDA0003483853850000061
wherein, beta represents a derivative term of second order and above, and after ignoring the derivative term of higher order, further obtaining:
Ixu+Iyv+It=0 (20)
wherein:
Figure BDA0003483853850000071
Ix,Iyand ItRespectively representing partial derivatives of pixel gray values in the image along three directions of x, y and t, wherein (u and v) are velocity vectors of optical flow along the horizontal direction and the vertical direction; taking (u, v) as the front two-dimension of the optical flow image, taking the result of further processing (u, v) by equation (7) as the third dimension of the optical flow feature, and introducing the third dimension of the optical flow feature, the feature vector (optical flow graph) can be organized like an RGB image;
Figure BDA0003483853850000072
before extracting optical flow features, time normalization is required to be performed on a video sequence, so that when extracting optical flow, k values between a current frame and a previous frame and optical flow intervals between frames need to be adjusted.
Step 2: feature analysis model construction
The invention provides a micro-expression detection model combining a shallow convolutional neural network and a Transformer encoder. Visual features are extracted from a single frame stream image by utilizing a shallow convolutional neural network (CNN model), the extracted visual features are input into two layers of transform encoders, the relation between frames of a sequence is modeled, and finally two different complete connection layers are adopted to realize two subtasks: identifying and locating, the identifying being configured as an auxiliary task for assisting a primary task.
1) CNN model
Because the samples in the micro expression perception task are limited, a shallow layer structure is adopted to complete the positioning task; the sub-network is composed of two blocks: a multi-stream (multi-stream) block and an embedding (embedding) block. In the multi-stream block, there are three streams with different parameters representing different feature extraction methods, and in convolution filters of the three streams, extended convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of dense connection for several times, and treats one embedded block as a plurality of convolution layers by sharing parameters. In each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.
The embedded block uses an attention module, the core idea of the attention mechanism is similar to the attention of a human, when the human observes a certain picture, the human can focus on the interested area and ignore other background information in the image. Since the input to the shallow network is an optical flow graph, containing the characteristics required for less micro-expression detection, it is desirable to make the network give more attention to these features through an attention mechanism.
After the visual features extracted by the multi-stream block are input into the embedding block, the attention mechanism obtains an attention matrix through an operation, and a formula for obtaining the attention matrix is as follows:
a=softmax(w·y+b) (23)
where w is the parameter to be learned, b is the offset, and y is the output of the multi-stream block.
And (3) giving different attention degrees to each area according to the attention matrix to obtain new visual features, wherein a specific formula is as follows:
Figure BDA0003483853850000081
wherein l denotes the multiplication of elements of two matrix sums, d (-) denotes the down-sampling operation, WAWeight matrix representing attention units, FjIs the j-th channel profile with respect to input X, and N is the number of profiles.
2) Transformer encoder
According to the method, a shallow Transformer encoder is selected as a global attention module, the relationship among frames of a sequence is modeled based on visual features extracted by a shallow convolutional neural network, and the dependency relationship among the frames is modeled by using the global attention of the similarity among the frames;
the self-attention module in the Transformer encoder models different positions of a single sequence, and Q, K, V three layers are used for representing the given sequence; wherein the self-attention function is used to map a query (Q layer) and a set of key value (K layer and V layer) pairs to an output, the function is as follows:
Figure BDA0003483853850000082
wherein Q, K, V represents a Q layer, a K layer, and a V layer, respectively;
further, for an input sequence, it is often necessary to split the input sequence into a plurality of sequences and input the sequences into a plurality of heads for processing, and finally, an MHA (multi-head-self-attack) function performs the following processing on the plurality of heads:
MHA(Q,K,V)=Concat(head0,...,headz)WO,
headi=Attention(QWi Q,KWi K,VWi V) (26)
wherein, WO,WQ,WKAnd WVIs a parameter matrix.
And step 3: training and micro-expression detection of network models
1) Network model training
And inputting the extracted optical flow sequence into a feature analysis model to obtain the prediction output of two full-connection layers of the transform encoder. The output p of the first fully-connected layer is used to predict whether the input optical flow sequence contains a macro expression or a micro expression (binary output), and the regression information of the second fully-connected layer output optical flow sequence contains the offset information of the micro expression relative to the time position of the current optical flow sequence and the length of the frame.
And constructing a loss function by utilizing the prediction output of the full-connection layer, wherein the loss function consists of three parts. And determining a first part loss item according to the comparison between macro and micro expression classification results and the real labeling information:
Figure BDA0003483853850000091
wherein, yiIs the label of example i, piIs the probability of prediction i;
and comparing the regression information of the output optical flow sequence with the real labeling information to determine a second loss term and a third loss term, wherein the losses are respectively expressed by the following formulas (28) and (29):
Figure BDA0003483853850000092
wherein the content of the first and second substances,
Figure BDA0003483853850000093
and diRespectively representing an estimated value and a true value;
Figure BDA0003483853850000094
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003483853850000095
and liThe estimated value and the true value are respectively represented. And obtaining a final loss function according to the first partial loss term and the second and third partial loss terms:
Loss=λ1Lossce2Losso3Lossl (30)
the three linear coefficients are set to 1.1,0.5, respectively.
2) Micro-expression detection
In the detection process, the long video is divided into N sections, each section is independently detected, the long video can cause detection interval overlapping, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression. These repeated boxes are removed by non-maximum suppression (NMS) and the best matching micro-expression interval is obtained, embodying the steps of: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence; and so on, and finally obtaining the result of the detection interval.

Claims (1)

1. The invention provides a long video micro-expression detection method based on a shallow network, which mainly comprises the following three parts: data preprocessing mode, optical flow characteristics, characteristic analysis model and model training and detection.
(1) Data preprocessing mode and optical flow characteristics
Extracting the spatial position of the face from the micro-expression long video, and cutting the whole video sequence by using a position-invariant frame to obtain a fixed face position: positioning a human face in the first frame to obtain a spatial position frame; further optimizing position frames and denoising (image downsampling); the remaining images of the video sequence are cropped using the position box.
After the position coordinates of the face are obtained, further cutting is needed; by initial face position coordinates (x)0,y0),(x1,y1) To obtain new coordinates (x ') after cutting'0,y0’),(x’1,y’1) The concrete formula is as follows:
(x′0,y′0)=(x0+a,y0-b) (1)
(x′1,y′1)=(x1-a,y1+b) (2)
wherein, a is a transverse cutting distance, and b is a longitudinal cutting distance; and taking the first frame image of each section of the micro-expression long video as a reference frame, taking the face image in the reference frame as a model face, acquiring a cutting matrix of the model face, and cutting the rest images of the video sequence by using the cutting matrix to complete face extraction.
Extracting optical flow motion characteristics from a preprocessed human face image sequence, exploring the corresponding relation between a current frame and a previous frame by utilizing time domain pixel change in the video sequence, calculating spatial motion information between adjacent frames, obtaining extracted optical flows and recording the extracted optical flows as (u, v) through light intensity conservation before and after pixel point motion and Taylor expansion simplification, and further processing the (u, v) through an equation (3) to form a three-dimensional composite optical flow characteristic diagram:
Figure FDA0003483853840000011
before extracting the optical flow features, the video sequence needs to be time-normalized, so when extracting the optical flow, the k value between the current frame and the previous frame needs to be adjusted to set the optical flow interval between the frames.
(2) Feature analysis model
The invention provides a novel shallow network structure for further extracting motion characteristics in micro-expressions. The network constructs a new shallow convolutional network and a shallow transform encoder.
The shallow convolutional network consists of two modules: multi-stream (multi-stream) modules and embedded (embedded) modules. In the multi-stream module, there are three streams with different parameters representing different feature extraction methods, and in the convolution filters of the three streams, the extended convolution is performed using different filter sizes; the embedded block introduces a soft attention mechanism, performs convolution operation of intensive connection for a plurality of times, and takes one embedded block as a plurality of convolution layers through sharing parameters; in each volume block, a non-linear operation is performed on the depth features using a batch normalization and correction linear unit (ReLU), the features of the volume layers of the downstream modules are connected, an adaptive average pooling layer is used, and the pooled layer output is flattened into feature vectors.
The shallow Transformer encoder is used as a global attention module, integrates visual features extracted by a shallow convolutional neural network, models the relation among the frames of the sequence, and models the dependency relation among the frames by using the global attention of the similarity among the frames. The self-attention module in the encoder models the different positions of a single sequence, representing a given sequence using Q, K, V tri-layers, for mapping a query (Q-layer) and a set of key-value (K-layer and V-layer) pairs to the output. The prediction outputs of the two fully-connected layers of the encoder: the output p of the first fully-connected layer is used to predict whether the input optical flow sequence contains a macro-expression or a micro-expression (binary output); the second full-connection layer outputs regression information of the optical flow sequence, which contains offset information of the micro expression relative to the time position of the current optical flow sequence and the length of the frame.
(3) Model training and detection
The invention integrates a plurality of loss function training models, and the loss function consists of three parts: and determining a first part loss item according to the comparison between macro and micro expression classification results and the real labeling information:
Figure FDA0003483853840000012
wherein, yiIs the tag of example i, piIs the probability of prediction i;
and comparing the regression information of the output optical flow sequence with the real labeling information to determine second and third loss terms, wherein the losses are respectively shown as a formula (4) and a formula (5):
Figure FDA0003483853840000021
wherein the content of the first and second substances,
Figure FDA0003483853840000022
and diRespectively representing an estimated value and a true value;
Figure FDA0003483853840000023
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003483853840000024
and liThe estimated value and the true value are respectively represented.
And obtaining a final loss function according to the first partial loss term and the second and third partial loss terms:
Loss=λ1Lossce2Losso3Lossl (6)
the parameters of the model are learned using back propagation algorithms using the loss function in equation (6).
In the detection process, the long video is divided into N sections, and each video section is independently detected by using a depth model. Detection intervals are overlapped in a long video, and a plurality of prediction frames are used for repeatedly positioning the interval of the micro expression; the present invention removes these repeating boxes by non-maximum suppression (NMS) and obtains the best matching micro-expression interval: 1) sequencing the confidence degrees of all detection intervals; 2) comparing the frame with the highest confidence with the rest intervals, and deleting the current interval if the intersection ratio of the maximum confidence interval and the current interval exceeds a certain threshold; 3) performing the same operation on the interval with the second highest confidence coefficient; and so on, and finally obtaining the result of the detection interval.
CN202210075626.3A 2022-01-22 2022-01-22 Long video micro expression detection method based on shallow network Active CN114639136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210075626.3A CN114639136B (en) 2022-01-22 2022-01-22 Long video micro expression detection method based on shallow network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210075626.3A CN114639136B (en) 2022-01-22 2022-01-22 Long video micro expression detection method based on shallow network

Publications (2)

Publication Number Publication Date
CN114639136A true CN114639136A (en) 2022-06-17
CN114639136B CN114639136B (en) 2024-03-08

Family

ID=81946013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210075626.3A Active CN114639136B (en) 2022-01-22 2022-01-22 Long video micro expression detection method based on shallow network

Country Status (1)

Country Link
CN (1) CN114639136B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842539A (en) * 2022-05-30 2022-08-02 山东大学 Micro-expression discovery method and system based on attention mechanism and one-dimensional convolution sliding window
CN116091956A (en) * 2022-09-08 2023-05-09 北京中关村科金技术有限公司 Video-based micro-expression recognition method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001241A (en) * 2020-07-16 2020-11-27 山东大学 Micro-expression identification method and system based on channel attention mechanism
WO2021042547A1 (en) * 2019-09-04 2021-03-11 平安科技(深圳)有限公司 Behavior identification method, device and computer-readable storage medium
CN112560812A (en) * 2021-02-19 2021-03-26 中国科学院自动化研究所 Micro-expression recognition method based on fusion depth features
CN113496217A (en) * 2021-07-08 2021-10-12 河北工业大学 Method for identifying human face micro expression in video image sequence
WO2021259005A1 (en) * 2020-06-23 2021-12-30 平安科技(深圳)有限公司 Video-based micro-expression recognition method and apparatus, computer device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021042547A1 (en) * 2019-09-04 2021-03-11 平安科技(深圳)有限公司 Behavior identification method, device and computer-readable storage medium
WO2021259005A1 (en) * 2020-06-23 2021-12-30 平安科技(深圳)有限公司 Video-based micro-expression recognition method and apparatus, computer device, and storage medium
CN112001241A (en) * 2020-07-16 2020-11-27 山东大学 Micro-expression identification method and system based on channel attention mechanism
CN112560812A (en) * 2021-02-19 2021-03-26 中国科学院自动化研究所 Micro-expression recognition method based on fusion depth features
CN113496217A (en) * 2021-07-08 2021-10-12 河北工业大学 Method for identifying human face micro expression in video image sequence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
牛斌;张钥迪;马利;: "可解释时空卷积网络的微表情识别应用", 辽宁大学学报(自然科学版), no. 02, 15 May 2020 (2020-05-15) *
王晓华;潘丽娟;彭穆子;胡敏;金春花;任福继;: "基于层级注意力模型的视频序列表情识别", 计算机辅助设计与图形学学报, no. 01, 15 January 2020 (2020-01-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842539A (en) * 2022-05-30 2022-08-02 山东大学 Micro-expression discovery method and system based on attention mechanism and one-dimensional convolution sliding window
CN116091956A (en) * 2022-09-08 2023-05-09 北京中关村科金技术有限公司 Video-based micro-expression recognition method, device and storage medium

Also Published As

Publication number Publication date
CN114639136B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111310659B (en) Human body action recognition method based on enhanced graph convolution neural network
CN114639136A (en) Long video micro-expression detection method based on shallow network
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN111598928B (en) Abrupt motion target tracking method based on semantic evaluation and region suggestion
CN108830170B (en) End-to-end target tracking method based on layered feature representation
CN112489081A (en) Visual target tracking method and device
CN112288778B (en) Infrared small target detection method based on multi-frame regression depth network
CN113706581A (en) Target tracking method based on residual channel attention and multilevel classification regression
CN113920170A (en) Pedestrian trajectory prediction method and system combining scene context and pedestrian social relationship and storage medium
CN114972426A (en) Single-target tracking method based on attention and convolution
CN113901922A (en) Hidden representation decoupling network-based occluded pedestrian re-identification method and system
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN110688512A (en) Pedestrian image search algorithm based on PTGAN region gap and depth neural network
CN111242003B (en) Video salient object detection method based on multi-scale constrained self-attention mechanism
CN113033283A (en) Improved video classification system
CN116311518A (en) Hierarchical character interaction detection method based on human interaction intention information
CN116630641A (en) Long-time target tracking method based on attention mechanism
CN116051601A (en) Depth space-time associated video target tracking method and system
CN110580712A (en) Improved CFNet video target tracking method using motion information and time sequence information
CN116188555A (en) Monocular indoor depth estimation algorithm based on depth network and motion information
CN114581485A (en) Target tracking method based on language modeling pattern twin network
CN107016675A (en) A kind of unsupervised methods of video segmentation learnt based on non local space-time characteristic
Bharathi et al. Texture feature extraction of infrared river ice images using second-order spatial statistics
Chen et al. The obstacles detection for outdoor robot based on computer vision in deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant