CN114005157B

CN114005157B - Micro-expression recognition method for pixel displacement vector based on convolutional neural network

Info

Publication number: CN114005157B
Application number: CN202111204917.XA
Authority: CN
Inventors: 何双江; 项金桥; 董喆; 方博; 鄢浩; 赵俭辉; 赵慧娟; 翟芷君
Original assignee: Hubei Provincial People's Procuratorate; Wuhan Fiberhome Information Integration Technologies Co ltd
Current assignee: Hubei Provincial People's Procuratorate; Wuhan Fiberhome Information Integration Technologies Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2024-05-10
Anticipated expiration: 2041-10-15
Also published as: CN114005157A

Abstract

The invention provides a pixel displacement vector micro-expression recognition method based on a convolutional neural network, which comprises the steps of establishing an end-to-end micro-expression recognition network based on a pixel displacement generation module, and processing flow based on the micro-expression recognition network: selecting a maximum frame, wherein a certain frame before and after the original maximum frame is randomly selected as a maximum frame image in the training process; inputting the selected maximum frame image and the initial frame image into a pixel displacement generating module together, and outputting a pixel displacement vector feature map between the two images; calculating a correlation loss function, which comprises the steps of firstly sampling the generated displacement vector feature image to obtain a displacement feature image, then sampling to generate an approximate maximum frame image, and calculating reconstruction loss and regular loss; normalizing operation, including normalizing the generated pixel displacement vector feature map; and performing feature learning and microexpressive classification, namely connecting the largest frame image with the normalized pixel displacement vector feature image, and inputting the connected image into a classification network to obtain a classification prediction result.

Description

Micro-expression recognition method for pixel displacement vector based on convolutional neural network

Technical Field

The invention belongs to the technical field of micro-expression recognition, and relates to a micro-expression recognition technology based on dynamic feature representation.

Background

Currently, the mainstream deep learning methods for micro-expression recognition are divided into two main categories:

The first major category is to sequentially perform feature extraction on each frame in an image sequence and input the feature extraction into a time-series neural network, and learn spatial distribution and time-varying features at the same time. As a ELRCN network (document 1) has been proposed in recent years, experimental results indicate that temporal and spatial features play different roles in microexpressive recognition, and that good recognition effects depend on the effective combination of both.

The second major category extracts the variation characteristics of the whole expression sequence as a characteristic map, and the characteristic map is directly input into a classification network for prediction, and is generally classified by utilizing variation difference characteristics between a starting frame and a maximum frame of a micro expression segment. The feature extraction method is continuously improved, and LBP-TOP (document 2) is widely used in the early stage to extract the spatiotemporal variation features of micro expressions and is used as a reference method in the field. Based on this approach, a series of LBP variants have also been proposed one by one to improve the quality and robustness of the extracted features. Later gradually replaced by Optical flow (literature 3), optical flow estimates the change of the object position between two frames, characterizes the direction and the size of the image pixel movement, and can extract the inter-frame object movement information more robustly. Bi-WOOF (document 4) calculates Optical strain as a supplement on the basis of Optical flow. In addition, the method for extracting the change characteristics of the micro-expression segment also includes DYNAMIC IMAGING (document 5) method for the motion recognition field, which compresses a picture sequence into an RGB image, wherein the RGB image contains the spatial characteristics and the time dynamic characteristics of the whole image sequence.

However, the extraction of the change characteristics of the expression sequence is realized in the preprocessing process of training at present, is limited to the respective processing process, is not fused with a deep learning network for classification, cannot adjust the generated dynamic characteristics according to the feedback of the classification effect, and lacks sufficient flexibility and adaptability.

Related literature:

[ literature ] 1】H.Khor,J.See,R.C.Phan,W.Lin,"Enriched Long-term Recurrent Convolutional Network for Facial Micro-Expression Recognition,"Proceedings of the 2018International Conference on Automatic Face&Gesture Recognition(FG),2018,pp.667–674.

[ Literature ] 2】G.Zhao,M.Pietikainen,"Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions,"Pattern Analysis and Machine Intelligence,IEEE Transactions,2009,pp.915–928.

[ Document 3 ] D.Fleet, Y.Weiss, "Optical Flow Estimation," Springer US,2006.

[ Literature ] 4】Liong ST,See J,Wong K,Phan RC,"Less is more:Micro-expression recognition from video using apex frame,"Signal Processing:Image Communication,2018,pp.62:82–92.

[ Literature ] 5】H.Bilen,B.Fernando,E.Gavves,A.Vedaldiand S.Gould,"Dynamic image networks for action recognition,"In Proc.IEEE Int.Conf.Comput.Vis.Pattern Recognit,2016,pp.3034–3042.

Disclosure of Invention

Aiming at the defects of the existing micro-expression recognition method, the invention provides an end-to-end micro-expression recognition network based on a pixel displacement generation module based on deep learning, and more spaces capable of being automatically adjusted according to data are provided for the displacement feature extraction and expression recognition classification module so as to increase the overall fitness of the model.

The technical proposal of the invention is a micro-expression recognition method based on a pixel displacement vector of a convolutional neural network, which establishes an end-to-end micro-expression recognition network based on a pixel displacement generation module, the processing flow based on the micro-expression recognition network comprises the following steps,

Selecting a maximum frame, wherein a certain frame before and after the original maximum frame is randomly selected as a maximum frame image in the training process;

Generating a pixel displacement vector feature map, which comprises inputting a selected maximum frame image and a starting frame image into a pixel displacement generating module, and outputting the pixel displacement vector feature map between the two images through the learning and feature fusion of each convolution layer;

Calculating a correlation loss function, which comprises the steps of firstly carrying out bilinear interpolation up-sampling on the generated displacement vector feature image to obtain a displacement feature image with the same size as a maximum frame, then carrying out sampling on an original initial frame image according to the displacement feature image to generate an approximate maximum frame image, and calculating reconstruction loss and regular loss according to the generated approximate maximum frame image and the original selected maximum frame image;

Normalizing operation, including normalizing the generated pixel displacement vector feature map;

and performing feature learning and microexpressive classification, namely connecting the previously selected maximum frame image with the normalized pixel displacement vector feature image, and inputting the connected maximum frame image and normalized pixel displacement vector feature image into a classification network to obtain a classification prediction result.

In the training process, the selection of the maximum frame is realized through a randomization process, a certain frame in a certain range before and after the original maximum frame is randomly selected, and the image pair actually used for training is increased; if in the verification or test stage, directly adopting the original maximum frame image;

the generated pixel displacement vector features are normalized before being input into the classification network, and each displacement vector feature map is divided by the average value of the first several values of the absolute values of each displacement vector feature map.

Moreover, for the generated pixel displacement vector feature map, the loss function thereof includes a reconstruction loss between the original maximum frame and the maximum frame reconstructed from the start frame and the displacement vector, and an L1 canonical loss calculated for the displacement vector feature map itself.

And the selected maximum frame image and the generated pixel displacement vector feature image are input into a classification network together for learning, and after a classification prediction result is obtained, the classification loss is calculated according to the requirement so as to use the related evaluation index.

Compared with the prior art, the invention has the following advantages and positive effects:

(1) The pixel displacement generation module provided by the invention can be combined with a classification network to perform end-to-end unified training, and classification loss can be reversely transmitted to the pixel displacement generation module, so that the pixel displacement generation module automatically adjusts parameters according to classification effects to generate displacement characteristics easier to classify, and meanwhile, the overall model also has higher fitness.

(2) The random maximum frame selection operation provided by the invention can increase the image pair actually used for training, enhance the robustness of the network and the sensitivity to slight change, and improve the generation and classification effects of displacement characteristics.

(3) The normalization operation provided by the invention is equivalent to reducing the expression displacement with larger amplitude, amplifying the expression displacement with smaller amplitude, playing a role in self-adaptive expression amplitude adjustment, reducing the influence of amplitude difference between different image pairs on the classification network, and enabling the classification network to be easier to learn.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate and explain the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram of an overall structure of an end-to-end micro-expression recognition network based on a pixel displacement generation module according to an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of a pixel displacement generating module according to an embodiment of the present invention;

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description is presented by way of illustration or example only, and is not intended to limit the invention.

The invention discloses a full convolution pixel Displacement Generation Module (DGM) which is used for generating a pixel displacement vector feature map (displacement) between two frames to replace dynamic features generated by a traditional Optical Flow or DYNAMIC IMAGING method. The module is combined with the existing LEARNet classification network to form an end-to-end micro-expression recognition model. We also disclose a randomization operation on the largest frame to increase training sample pairs and increase the sensitivity of the network to fine expression changes. The invention directly takes a starting frame in the expression sequence and a maximum frame selected by randomizing operation as input, generates a pixel displacement vector feature map by using a pixel displacement generating module, performs normalization processing on the feature map, and then is connected with a maximum frame image to be input to LEARNet classification network for learning and prediction. The model can back-propagate the gradient of the classification loss to the DGM, so that the gradient can adjust parameters according to classification results, and displacement features easier to classify are generated.

As shown in fig. 1, an embodiment of the present invention provides an end-to-end micro-expression recognition method based on a pixel displacement generating Module, and the end-to-end micro-expression recognition network based on the pixel displacement generating Module provided by the present invention includes two parts of neural networks for generating pixel displacement vector features and for learning and classifying, the network model performs end-to-end training, and dynamic features of micro-expressions are represented by pixel displacement vectors between a start frame and a maximum frame generated by a convolution Module DGM (i.e., a pixel displacement generating Module DISPLACEMENT GENERATING Module shown in fig. 2).

In an embodiment, the main flow based on the micro-expression recognition network is as follows:

(1) Selecting the maximum frame: in the training process, a randomization process is needed to select the maximum frame, a certain frame in a certain range before and after the original maximum frame is randomly selected as the maximum frame image, and the image pair actually used for training is increased. If in the verification or test stage, the original maximum frame image is directly adopted.

Let the initial frame index be I, the maximum frame index be j, and the index of the last frame in the image sequence be I _last, then the randomizing selected frame index I _select is calculated by the following formula:

I_select＝random[MAX(i+1,round(j-(j-i)*0.2)),MIN(round(j+(j-i)*0.2),I_max)]

Where MAX represents selecting the larger of the two and MIN represents selecting the smaller of the two to ensure that the selected largest frame is after the start frame and does not exceed the last frame of the sequence. random () represents any integer within the randomly chosen interval.

(2) Generating a pixel displacement vector feature map: the selected maximum frame image and the initial frame image are input into a pixel displacement generating module together, and a pixel displacement vector feature diagram between the two images is output through the learning and feature fusion of each convolution layer.

(3) Calculating a correlation loss function: the generated displacement vector feature image is subjected to bilinear interpolation up-sampling to obtain the displacement feature image with the same size as the maximum frame, and then the original initial frame image is sampled according to the displacement feature image to generate an approximate maximum frame image. If the value of the displacement vector is not an integer, bilinear interpolation is adopted to calculate the pixel value of the corresponding point. Then, based on the generated approximate maximum frame image and the original selected maximum frame image, the L _rec reconstruction loss and the L ₁ regular loss are calculated.

(4) Normalization operation: and normalizing the generated pixel displacement vector feature map. Let the network generated displacement feature map be I _f, the M (I, n) function represents taking the average of the first n numbers of the image I, and the normalized image I _n can be obtained by the following formula:

Wherein the comparison with 0.0001 is to avoid zero errors.

(5) Feature learning and micro expression classification are carried out: and connecting the previously selected maximum frame image with the normalized pixel displacement vector feature image (concat), inputting the maximum frame image and the normalized pixel displacement vector feature image into a classification network to obtain a classification prediction result (micro expressions are divided into three types of negative, positive and surprise) and calculating the classification Loss Softmax Loss, the evaluation indexes such as UF1 and UAR according to the requirement.

The invention can be considered to provide an end-to-end micro-expression recognition model based on a pixel displacement generation module, which comprises a displacement vector feature generation module, a randomization processing module, a normalization processing module and a classification network module. Wherein:

The displacement vector feature generation module takes a starting frame in the expression sequence and a maximum frame selected by randomization operation as inputs, and a convolutional neural network provided by the pixel displacement generation module generates a pixel displacement vector feature map (DISPLACEMENTS) between the two frames to replace a dynamic image generated by a traditional Optical flow or DYNAMIC IMAGING method. The pixel displacement vector feature represents the displacement of each pixel of the maximum frame image on the basis of the initial frame image, the range of values is between (-1, 1), and in order to make the network concentrate on the features around each pixel point, the generated pixel displacement vector feature map is multiplied by a scaling factor alpha epsilon (0, 1) to limit the range between (-alpha, alpha). The correlation loss function includes: reconstruction loss between the original maximum frame and the maximum frame reconstructed from the start frame and the displacement vector; l1 canonical loss calculated on the displacement vector feature map itself.

In the randomization processing module, because different image sequences among data sets and in the data sets have different expression amplitudes, in order to make a network for generating displacement vector characteristic images more robust, a certain frame in a certain range before and after the original maximum frame is randomly selected as the maximum frame when data is loaded.

In the normalization processing module, in order to normalize the pixel displacement vector feature maps with different magnitudes, the invention divides each displacement vector feature map by the average value of the first n large values of the absolute values of the displacement vector feature maps. Averaging instead of maximum is to reduce the interference of larger noise points that may occur.

The classifying network can select different existing network structures, and the pixel displacement vector feature images obtained before are normalized and then input into the classifying network together with the selected maximum frame image for learning and prediction, so that the characteristics of time dimension and space dimension are maintained. According to the embodiment of the invention, LEARNet(Verma Monu,Vipparthi Santosh Kumar,Singh Girdhari,Murala Subrahmanyam,"LEARNet:Dynamic Imaging Network for Micro Expression Recognition,"IEEE Transactions on Image Processing,2019,pp.99.) is selected as the classification network, and compared with the classical ResNet and VGG structure, the network can retain more details and better learn and distinguish the characteristics of different expression categories.

As shown in fig. 2, the present invention provides a schematic structure diagram of a pixel displacement generation module, in which Conv, conv1, conv2, conv3, conv4, up, conv5, conv6 are sequentially set, and the Up output is connected (Concat) with the output of the Conv1 layer as the input of Conv 5. The module thus comprises two downsamples (implemented by Conv1 and Conv3 layers with stride 2) and one upsample (implemented by Up layer), the specific parameter configurations of the respective convolution layers being shown in the following table. Wherein each convolutional layer Conv, conv1, conv2, conv3, conv4, conv5 is followed by BN layer (batch normalization layer) and leak_ relu activation function layer, and the last Conv6 layer is followed by BN layer and Tanh activation function layer. The Up layer represents an Up-sampling layer using bilinear interpolation, the output of which is connected to the output of the Conv1 layer as input to the next layer.

For the input images with the width and the height of w and h respectively, the final output channel number is 2, and the pixel displacement vector characteristic diagrams with the width and the height of w/2 and h/2 are used for classification. Wherein the first channel represents displacement in the X direction and the second channel represents displacement in the Y direction. And meanwhile, carrying out bilinear interpolation up-sampling on the generated pixel displacement vector feature map to obtain a displacement feature map with width and height of w and h for calculating a loss function. The method comprises the steps of firstly carrying out grid sampling on an initial frame image according to an up-sampled displacement characteristic image to generate an approximate maximum frame image, then calculating L _rec loss of the approximate maximum frame and an original selected maximum frame, and simultaneously calculating L ₁ regular loss of the displacement characteristic image. The associated loss function is set as follows:

(1) Let the original starting frame image be designated as I _s, the selected largest frame image be designated as I _t, and T (I _s) represent the approximate largest frame image obtained by sampling the starting frame according to the displacement signature, then the L _rec reconstruction loss is calculated by:

L_rec＝||T(I_s)-I_t||₁

(2) To further refine the generated displacement features, let T _xy represent the pixel displacement vector at (x, y), calculate the L ₁ canonical loss for the pixel displacement vector feature map as shown in the following equation:

(3) The classification loss of the microexpressions uses classical Cross Entropy cross entropy loss, denoted as L _c, and the overall loss of the network is calculated as follows:

L＝w₁×L_c+w₂×L_rec+w₃×L₁

Where w ₁,w₂ and w ₃ are the weights of the L _c,L_rec and L ₁ loss functions, respectively. The three loss functions can be respectively and reversely propagated to the displacement generation module, and the weight coefficient is selected according to the magnitude difference, and the weight is set to enable the gradient to have higher magnitude than that of the L _c and the L ₁ because the module takes the reconstruction loss L _rec as the main loss. The examples preferably take the experimental value w ₁＝0.0001,w₂＝1000,w₃ =1.

The pixel displacement value is expressed as a percentage relative to the image width and height, assuming T _xy＝(Δ_x,Δ_y) as the pixel displacement vector at (x, y), it indicates that the pixel at (x, y) of the original starting frame has moved to (x+w×Δ _x,y+h×Δ_y) of the approximate maximum frame image. In order to make the network concentrate on the displacement characteristics around the pixels, multiplying the displacement characteristics by a scaling factor alpha E (0, 1) to obtain a final pixel displacement vector characteristic diagram in the range of [ -alpha, alpha ], namely limiting the actual X-direction displacement component size between [ -w X alpha, w X alpha ] and Y-direction displacement component size between [ -h X alpha, h X alpha ].

For the convenience of understanding the technical effects of the present invention, the following experimental results are attached:

TABLE 1 ablation experiment results on networks proposed by the patent

TABLE 2 UF1 and UAR results comparison of networks using conventional dynamic feature extraction methods and networks proposed by the patent

In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.

In some possible embodiments, a micro-expression recognition system based on a pixel displacement vector of a convolutional neural network is provided, and the micro-expression recognition system comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the stored instructions in the memory to execute a micro-expression recognition method based on the pixel displacement vector of the convolutional neural network.

In some possible embodiments, a micro-expression recognition system based on a pixel displacement vector of a convolutional neural network is provided, which comprises a readable storage medium, wherein a computer program is stored on the readable storage medium, and the computer program is executed to realize the micro-expression recognition method based on the pixel displacement vector of the convolutional neural network.

It should be understood that parts of the specification not specifically set forth herein are all prior art.

It should be understood that the foregoing description of the implementation examples of the current popular framework is not to be construed as limiting the scope of the invention, but that the appended claims are intended to cover all such alternatives and modifications as may be included within the scope of the invention as defined by the appended claims.

Claims

1. A micro-expression recognition method of pixel displacement vector based on convolutional neural network is characterized in that: establishing an end-to-end micro-expression recognition network based on a pixel displacement generation module, wherein the processing flow based on the micro-expression recognition network comprises the following steps of selecting a maximum frame, and randomly selecting a certain frame before and after the original maximum frame as a maximum frame image in a training process;

2. The method for identifying the microexpressions of the pixel displacement vectors based on the convolutional neural network according to claim 1, wherein the method comprises the following steps: in the training process, the selection of the maximum frame is realized through a randomization process, a certain frame in a certain range before and after the original maximum frame is randomly selected, and an image pair actually used for training is increased; if in the verification or test stage, the original maximum frame image is directly adopted.

3. The method for identifying the microexpressions of the pixel displacement vectors based on the convolutional neural network according to claim 1, wherein the method comprises the following steps: the generated pixel displacement vector features are normalized before being input into the classification network, and each displacement vector feature map is divided by the average value of the first several values of the absolute values of each displacement vector feature map.

4. The method for identifying the microexpressions of the pixel displacement vectors based on the convolutional neural network according to claim 1, wherein the method comprises the following steps: for the generated pixel displacement vector feature map, the loss function comprises reconstruction loss between the original maximum frame and the maximum frame reconstructed according to the initial frame and the displacement vector, and L1 regular loss calculated on the displacement vector feature map.

5. The micro-expression recognition method of the pixel displacement vector based on the convolutional neural network according to claim 1,2, 3 or 4, wherein: the selected maximum frame image and the generated pixel displacement vector feature image are input into a classification network together for learning, and after a classification prediction result is obtained, the classification loss is calculated according to the need so as to correlate with an evaluation index.