CN114005157A

CN114005157A - Micro-expression recognition method of pixel displacement vector based on convolutional neural network

Info

Publication number: CN114005157A
Application number: CN202111204917.XA
Authority: CN
Inventors: 何双江; 项金桥; 董喆; 方博; 鄢浩; 赵俭辉; 赵慧娟; 翟芷君
Original assignee: Hubei Provincial People's Procuratorate; Wuhan Fiberhome Information Integration Technologies Co ltd
Current assignee: Hubei Provincial People's Procuratorate; Wuhan Fiberhome Information Integration Technologies Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-02-01
Anticipated expiration: 2041-10-15
Also published as: CN114005157B

Abstract

The invention provides a micro expression recognition method of pixel displacement vector based on convolution neural network, which establishes an end-to-end micro expression recognition network based on a pixel displacement generation module, and comprises the following processing procedures based on the micro expression recognition network: selecting a maximum frame, including randomly selecting a certain frame before and after the original maximum frame in the training process as a maximum frame image; inputting the selected maximum frame image and the initial frame image into a pixel displacement generation module together, and outputting a pixel displacement vector characteristic diagram between the two images; calculating a correlation loss function, including sampling the generated displacement vector characteristic graph to obtain a displacement characteristic graph, then sampling to generate an approximate maximum frame image, and calculating reconstruction loss and regular loss; normalization operation, including normalizing the generated pixel displacement vector characteristic diagram; and (4) performing feature learning and micro-expression classification, namely connecting the maximum frame image with the normalized pixel displacement vector feature map and inputting the maximum frame image and the normalized pixel displacement vector feature map into a classification network to obtain a classification prediction result.

Description

Micro-expression recognition method of pixel displacement vector based on convolutional neural network

Technical Field

The invention belongs to the technical field of micro-expression identification, and relates to a micro-expression identification technology based on dynamic feature representation.

Background

Currently, mainstream deep learning methods for micro-expression recognition are divided into two major categories:

the first category is to perform feature extraction on each frame in an image sequence in sequence and input the frame into a time-series neural network, and simultaneously learn spatial distribution and time variation features. As for the ELRCN network (document 1) proposed in recent years, experimental results thereof show that temporal and spatial features play different roles in micro-expression recognition, and a good recognition effect depends on an effective combination of the two.

The second major category extracts the variation features of the whole expression sequence into a feature map, directly inputs the feature map into a classification network for prediction, and generally classifies the variation features between the initial frame and the maximum frame of the micro expression segment. The feature extraction method is continuously improved, and LBP-TOP (document 2) is generally used for extracting the space-time change features of the micro expression in the early stage and is used as a reference method in the field. Based on this method, a series of LBP variants are also proposed one by one to improve the quality and robustness of the extracted features. Later, the motion information of the object between frames can be extracted more robustly by replacing the motion information step by an Optical flow (document 3) which estimates the change of the position of the object between two frames, characterizes the direction and the magnitude of the movement of the image pixels and can extract the motion information of the object between frames more robustly. Bi-WOOF (reference 4) calculates an Optical string as a supplement on the basis of the Optical flow. Besides, the method for extracting the change features of the micro-expression segments also includes a Dynamic imaging (document 5) method used in the field of motion recognition, which compresses a picture sequence into an RGB image, wherein the RGB image includes spatial features and temporal Dynamic features of the whole image sequence.

However, the change feature extraction of the expression sequence is realized in the training preprocessing process at present, is limited to respective processing processes, is not fused with a deep learning network for classification, cannot adjust the generated dynamic features according to the feedback of the classification effect, and lacks sufficient flexibility and adaptability.

The related documents are:

[ document 1 ] H.Khor, J.See, R.C.phan, W.Lin, "appended Long-term Current capacitive Network for custom Micro-Expression registration," Proceedings of the 2018International Conference on Automatic Face & Gesture registration (FG),2018, pp.667-674.

G.ZHao, M.Pietikaine, "Dynamic Texture registration Using Local Binary Patterns with an Application to Facial Expressions," Pattern Analysis and Machine understanding, IEEE Transactions,2009, pp.915-928.

Reference 3 d. fleet, y. weiss, "Optical Flow Estimation," Springer US,2006.

[ document 4 ] Liong ST, See J, Wong K, Phan RC, "Less is more: Micro-expression recognition from video using apex frame," Signal Processing: Image Communication,2018, pp.62: 82-92.

Document 5 describes h.bilen, b.fernando, e.gavves, a.veldiand s.gould, "Dynamic image networks for action recognition," In proc.ieee int.conf.com.vis.pattern Recognit,2016, pp.3034-3042.

Disclosure of Invention

Aiming at the defects of the existing micro expression recognition method, the invention provides an end-to-end micro expression recognition network based on a pixel displacement generation module based on deep learning, and more spaces which can be automatically adjusted according to data are provided for the modules for displacement feature extraction and expression recognition classification, so as to increase the overall fitting degree of the model.

The technical scheme of the invention is a micro-expression recognition method of pixel displacement vector based on convolutional neural network, which establishes an end-to-end micro-expression recognition network based on a pixel displacement generation module, and the processing flow based on the micro-expression recognition network comprises the following steps,

selecting a maximum frame, including randomly selecting a certain frame before and after the original maximum frame in the training process as a maximum frame image;

generating a pixel displacement vector characteristic diagram, wherein the pixel displacement vector characteristic diagram comprises that the selected maximum frame image and the initial frame image are input into a pixel displacement generation module together, and the pixel displacement vector characteristic diagram between the two images is output through the learning and characteristic fusion of each convolution layer;

calculating a correlation loss function, including firstly performing bilinear interpolation up-sampling on the generated displacement vector feature map to obtain a displacement feature map with the same size as the maximum frame, then performing sampling on the original initial frame image according to the displacement feature map to generate an approximate maximum frame image, and calculating reconstruction loss and regular loss according to the generated approximate maximum frame image and the originally selected maximum frame image;

normalization operation, including normalizing the generated pixel displacement vector characteristic diagram;

and (3) performing feature learning and micro-expression classification, namely connecting the selected maximum frame image with the normalized pixel displacement vector feature map and inputting the maximum frame image and the normalized pixel displacement vector feature map into a classification network to obtain a classification prediction result.

In the training process, the selection of the maximum frame is realized through a randomization process, a certain frame in a certain range before and after the original maximum frame is randomly selected, and the image pair actually used for training is added; if the image is in the verification or test stage, directly adopting the original maximum frame image;

the generated pixel displacement vector features are normalized before being input into the classification network, and each bit displacement vector feature map is divided by the average value of the previous values of the absolute values of the bit displacement vector feature maps.

Moreover, for the generated pixel displacement vector feature map, the loss function thereof includes the reconstruction loss between the original maximum frame and the maximum frame reconstructed from the starting frame and the displacement vector, and the regular loss of L1 calculated on the displacement vector feature map itself.

And the selected maximum frame image and the generated pixel displacement vector characteristic graph are input into a classification network together for learning, and after a classification prediction result is obtained, classification loss is calculated according to needs so as to obtain a related evaluation index.

Compared with the prior art, the invention has the following advantages and positive effects:

(1) the pixel displacement generation module provided by the invention can be combined with a classification network to carry out end-to-end unified training, and the classification loss can be reversely propagated into the pixel displacement generation module, so that the parameters can be automatically adjusted according to the classification effect, the displacement characteristics which are easier to classify are generated, and meanwhile, the integral model has higher fitness.

(2) The random maximum frame selection operation provided by the invention can increase the image pair actually used for training, enhance the network robustness and the sensitivity to slight change, and improve the generation and classification effects of the displacement characteristics.

(3) The normalization operation provided by the invention is equivalent to reducing the expression displacement with larger amplitude and amplifying the expression displacement with smaller amplitude at the same time, thereby playing the role of self-adapting expression amplitude adjustment, reducing the influence of the amplitude difference between different image pairs on the classification network and making the classification network easier to learn.

Drawings

The accompanying drawings, which form a part of the specification, are included to provide a further explanation of the present application and are incorporated in and constitute a part of this specification.

FIG. 1 is a schematic diagram of an overall structure of an end-to-end micro expression recognition network based on a pixel displacement generation module according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a pixel displacement generation module according to an embodiment of the present invention;

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of or illustrative of the invention and are not to be construed as limiting the invention.

The invention discloses a full convolution pixel Displacement Generation Module (DGM) which is used for generating a pixel displacement vector characteristic diagram (displacement) between two frames to replace Dynamic characteristics generated by a traditional Optical Flow or Dynamic Imaging method. The module is combined with the existing LEARNET classified network to form an end-to-end micro expression recognition model. We also disclose a randomization operation on the largest frames to increase the training sample pairs and improve the sensitivity of the network to subtle expression changes. The invention directly takes the initial frame in the expression sequence and the maximum frame selected by the randomization operation as input, uses the pixel displacement generation module to generate the pixel displacement vector characteristic diagram, and connects the characteristic diagram with the maximum frame image after normalization processing, and inputs the characteristic diagram and the maximum frame image together into the LEARNET classification network for learning and prediction. The model can reversely transmit the gradient of classification loss to the DGM, so that the DGM can adjust parameters according to the classification result and generate displacement characteristics which are easier to classify.

As shown in fig. 1, an embodiment of the present invention provides an end-to-end micro expression recognition method based on a pixel Displacement generation Module, which is implemented based on an end-to-end micro expression recognition network based on a pixel Displacement generation Module provided by the present invention, where the micro expression recognition network includes two parts of neural networks for Generating pixel Displacement vector features and for learning and classifying, the network model performs end-to-end training, and the dynamic features of micro expressions are represented by pixel Displacement vectors between a start frame and a maximum frame generated by a convolution Module DGM (i.e., a pixel Displacement generation Module shown in fig. 2).

In the embodiment, the main flow of the network based on micro expression recognition is as follows:

(1) selecting a maximum frame: in the training process, the selection of the maximum frame needs a randomization process, and a certain frame in a certain range before and after the original maximum frame is randomly selected as the maximum frame image, so that the image pair actually used for training is increased. If in the verification or test stage, the original maximum frame image is directly adopted.

Let the index of the initial frame be I, the index of the maximum frame be j, and the index of the last frame in the image sequence be I_lastRandomizing the selected frame index I_selectCalculated from the following formula:

I_select＝random[MAX(i+1,round(j-(j-i)*0.2)),MIN(round(j+(j-i)*0.2),I_max)]

where MAX denotes the larger of the two, and MIN denotes the smaller of the two, to ensure that the largest frame selected is after the starting frame and not beyond the last frame of the sequence. random () represents any integer within the randomly selected interval.

(2) Generating a pixel displacement vector feature map: and the selected maximum frame image and the initial frame image are input into a pixel displacement generation module together, and a pixel displacement vector characteristic diagram between the two images is output through learning and characteristic fusion of each convolution layer.

(3) Calculating a correlation loss function: firstly, the generated displacement vector characteristic image is subjected to bilinear interpolation up-sampling to obtain a displacement characteristic image with the same size as the maximum frame, and then the original initial frame image is sampled according to the displacement characteristic image to generate an approximate maximum frame image. If the value of the displacement vector is not an integer, the pixel value of the corresponding point is calculated by adopting bilinear interpolation. Then, L is calculated according to the generated approximate maximum frame image and the originally selected maximum frame image_recReconstruction loss and L₁Loss of regularity.

(4) And (3) normalization operation: for the generated pixel displacementAnd normalizing the vector feature map. Let the displacement characteristic diagram generated by the network be I_fThe M (I, n) function represents the normalized image I obtained by averaging the first n most significant numbers of the image I_nCan be obtained by the following formula:

where the comparison with 0.0001 is to avoid divide by zero errors.

(5) And (3) carrying out feature learning and micro-expression classification: connecting (concat) the maximum frame image selected previously with the normalized pixel displacement vector feature map, inputting the maximum frame image and the normalized pixel displacement vector feature map into a classification network to obtain a classification prediction result (in the invention, the micro expression is divided into three types of negative, positive and surpride), and calculating the classification Loss Softmax Loss, UF1, UAR and other evaluation indexes according to needs.

The invention can be considered as providing an end-to-end micro expression recognition model based on a pixel displacement generation module, which comprises a displacement vector feature generation module, a randomization processing module, a normalization processing module and a classification network module. Wherein:

the displacement vector feature generation module takes a starting frame in an expression sequence and a maximum frame selected by randomization operation as input, and generates a pixel displacement vector feature map (displacements) between two frames through a convolutional neural network provided by the pixel displacement generation module to replace a Dynamic image generated by a traditional Optical flow or Dynamic imaging method. The pixel displacement vector characteristic represents the displacement of each pixel of the maximum frame image on the basis of the initial frame image, the value range is between (-1,1), in order to enable the network to concentrate more on the characteristics around each pixel point, the generated pixel displacement vector characteristic graph is multiplied by a scaling factor alpha epsilon (0,1), and the range is limited to be between (-alpha, alpha). The correlation loss function includes: reconstruction loss between the original maximum frame and the maximum frame reconstructed from the starting frame and the displacement vector; the L1 regularized penalty computed for the displacement vector feature map itself.

In the randomization processing module, different image sequences among data sets and in the data sets have different expression amplitudes, so that a network for generating displacement vector characteristic images is more robust, and a certain frame in a certain range before and after an original maximum frame is randomly selected as the maximum frame when data is loaded.

In the normalization processing module, in order to normalize the pixel displacement vector characteristic graphs with different amplitudes, the invention divides each bit displacement vector characteristic graph by the average value of the first n large values of the absolute value of the bit displacement vector characteristic graph. The averaging is done instead of the maximum in order to reduce the interference of larger noise points that may occur.

The classification network can select different existing network structures, and the pixel displacement vector characteristic graph obtained before is input into the classification network together with the selected maximum frame image after being subjected to normalization processing to be learned and predicted, so that the characteristics of time dimension and space dimension are simultaneously kept. In the embodiment of the invention, LEARNet (Verma Monu, Vipparthi Santosh Kumar, Singh Girdhari, Mura Subrahmanyam, LEARNet: Dynamic Imaging Network for Micro Expression registration, IEEE Transactions on Image Processing,2019, pp.99.) is selected as a classification Network, and compared with the classical ResNet and VGG structure, the Network can keep more details and better learn and distinguish the characteristics of different Expression classes.

As shown in fig. 2, the present invention provides a schematic structural diagram of a pixel displacement generating module, wherein Conv, Conv1, Conv2, Conv3, Conv4, Up, Conv5, and Conv6 are sequentially arranged, and an output of Up is connected (Concat) with an output of a Conv1 layer as an input of Conv 5. This module therefore includes two downsampling (implemented by the Conv1 and Conv3 layers with stride 2) and one upsampling (implemented by the Up layer), and the specific parameter configuration for each convolutional layer is shown in the following table. Wherein each convolutional layer Conv, Conv1, Conv2, Conv3, Conv4, Conv5 is followed by a BN layer (batch normalization layer) and a leak _ relu activation function layer, while the last Conv6 layer is followed by a BN layer and a Tanh activation function layer. The Up layer represents the upsampled layer using bilinear interpolation, with the output connected to the output of the Conv1 layer as the input for the next layer.

And for the input image with the width and height of w and h respectively, the final output channel number is 2, and the pixel displacement vector characteristic graph with the width and height of w/2 and h/2 is used for classification. Where the first channel represents displacement in the X direction and the second channel represents displacement in the Y direction. And meanwhile, carrying out bilinear interpolation up-sampling on the generated pixel displacement vector characteristic graph to obtain a displacement characteristic graph with the width and height of w and h for calculating a loss function. Firstly, grid sampling is carried out on the initial frame image according to the displacement characteristic diagram after up-sampling to generate an approximate maximum frame image, and then L is calculated for the approximate maximum frame and the originally selected maximum frame_recLoss, simultaneous computation of L for displacement profiles₁Loss of regularity. The correlation loss function is set as follows:

(1) the original initial frame image is marked as I_sThe selected maximum frame image is marked as I_tAnd T (I)_s) Representing the approximate maximum frame image obtained by sampling the starting frame according to the displacement feature map, then L_recThe reconstruction loss is calculated by:

L_rec＝||T(I_s)-I_t||₁

(2) to make the generated displacement characteristics more compact, let T_xyRepresenting the pixel displacement vector at (x, y), L is calculated for the pixel displacement vector feature map as shown in the following equation₁Regular loss:

(3) classification loss for micro-expressions Using the classical Cross Entropy loss, denoted L_cThen the overall loss calculation L for the network is as follows:

L＝w₁×L_c+w₂×L_rec+w₃×L₁

wherein w₁,w₂And w₃Are respectively L_c,L_recAnd L₁The weight of the function is lost. Can be done separately for three loss functionsThe gradient is propagated back to the displacement generation module, the weight coefficient is selected according to the difference of its magnitude, the loss L is reconstructed by the module_recFor the major loss, the weight is set so that the gradient ratio L is_cAnd L₁Of higher order. Example preferred taking the Experimental value w₁＝0.0001,w₂＝1000,w₃＝1。

Pixel displacement values are expressed in terms of percentage of width and height of the image, let T_xy＝(Δ_x,Δ_y) Is the pixel displacement vector at (x, y), it means that the pixel at (x, y) of the original starting frame has moved to (x + w × Δ) approximating the maximum frame image_x,y+h×Δ_y) To (3). Network output [ -1,1 [)]Displacement feature map within the range, which is multiplied by a scaling factor α ∈ (0,1) in order to focus the network more on displacement features around the pixel]To obtain the final [ -alpha, alpha [ -alpha [, ]]The pixel displacement vector characteristic map within the range is limited to the actual X-direction displacement component size of [ -wx α, wx α]The magnitude of the displacement component in the Y direction is [ -hx α, hx α [ - ]]In the meantime. If the value of the displacement vector is not an integer, the pixel value of the corresponding point is calculated by adopting bilinear interpolation.

For the sake of understanding the technical effects of the present invention, the following experimental results are attached:

TABLE 1 ablation test results on the networks proposed by the patent

TABLE 2 UF1 and UAR results for networks using conventional dynamic feature extraction methods and networks proposed by the patents

In specific implementation, a person skilled in the art can implement the automatic operation process by using a computer software technology, and a system device for implementing the method, such as a computer-readable storage medium storing a corresponding computer program according to the technical solution of the present invention and a computer device including a corresponding computer program for operating the computer program, should also be within the scope of the present invention.

In some possible embodiments, a micro-expression recognition system based on a pixel displacement vector of a convolutional neural network is provided, which includes a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the stored instructions in the memory to execute a micro-expression recognition method based on the pixel displacement vector of the convolutional neural network.

In some possible embodiments, a micro-expression recognition system based on pixel displacement vectors of a convolutional neural network is provided, which includes a readable storage medium, on which a computer program is stored, and when the computer program is executed, the micro-expression recognition system based on pixel displacement vectors of a convolutional neural network is implemented.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description is for illustrative purposes only and is not intended to limit the scope of the present disclosure, which is to be accorded the full scope consistent with the claims appended hereto.

Claims

1. A micro-expression recognition method of pixel displacement vector based on convolution neural network is characterized in that: establishing an end-to-end micro expression recognition network based on a pixel displacement generation module, wherein the processing flow based on the micro expression recognition network comprises the following steps of selecting a maximum frame, including randomly selecting a certain frame before and after the original maximum frame in the training process as a maximum frame image;

2. The micro-expression recognition method of pixel displacement vectors based on convolutional neural network of claim 1, wherein: in the training process, the selection of the maximum frame is realized through a randomization process, a certain frame in a certain range before and after the original maximum frame is randomly selected, and the image pair actually used for training is added; if in the verification or test stage, the original maximum frame image is directly adopted.

3. The micro-expression recognition method of pixel displacement vectors based on convolutional neural network of claim 1, wherein: the generated pixel displacement vector features are normalized before being input into the classification network, and each bit displacement vector feature graph is divided by the average value of a plurality of previous values of the absolute value of each bit displacement vector feature graph.

4. The micro-expression recognition method of pixel displacement vectors based on convolutional neural network of claim 1, wherein: for the generated pixel displacement vector feature map, the loss function comprises reconstruction loss between an original maximum frame and a maximum frame reconstructed according to the starting frame and the displacement vector, and L1 regular loss calculated on the displacement vector feature map.

5. The micro expression recognition method of pixel displacement vector based on convolutional neural network as claimed in claim 1 or 2 or 3 or 4, wherein: and inputting the selected maximum frame image and the generated pixel displacement vector characteristic graph into a classification network for learning, and calculating classification loss according to requirements to obtain a classification prediction result so as to obtain a related evaluation index.