CN112800891B

CN112800891B - Discriminative feature learning method and system for micro-expression recognition

Info

Publication number: CN112800891B
Application number: CN202110060936.3A
Authority: CN
Inventors: 卢官明; 韩震; 卢峻禾
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-08-26
Anticipated expiration: 2041-01-18
Also published as: CN112800891A

Abstract

The invention discloses an identifying characteristic learning method and system for micro-expression recognition. Firstly, extracting an initial frame and a peak frame in a micro-expression video sequence, preprocessing the initial frame and the peak frame, and further calculating optical flow information between the peak frame and the initial frame to obtain an optical flow graph; then selecting an image with an expression category different from that of the peak frame from a common expression image library, cutting the image, and replacing a corresponding area of the peak frame image with the image block obtained by cutting to obtain a composite image; then constructing a double-current convolution neural network model based on a class activation graph attention mechanism, inputting a light-flow graph and a synthetic image into two branches of the double-current convolution neural network respectively, and training the model; and finally, extracting features with strong discriminative power from the input video sequence by using the trained model for micro-expression classification and identification. The method can effectively prevent the model from being over-fitted, enables the model to learn the micro-expression characteristics with strong discriminative power, and improves the accuracy of micro-expression recognition.

Description

Discriminative feature learning method and system for micro-expression recognition

Technical Field

The invention relates to a method and a system for learning discriminative features for micro-expression recognition, and belongs to the field of micro-expression recognition and artificial intelligence.

Background

The expression is a non-linguistic behavior for expressing the human emotion and is also an important way for the robot to intelligently understand the human emotion. The general expression is expressed by a human under the condition that the expression of emotion is not inhibited, the amplitude of facial movement is large, and the duration is long. However, in some cases, people intentionally suppress and hide their own emotions, and these suppressed emotions are spontaneously expressed by extremely fast facial expressions, which are called micro-expressions. The duration of the micro expression is extremely short, less than 0.2 second, and the facial motion changes are so subtle that the recognition accuracy of the micro expression by human is low. At present, micro expression recognition is to classify micro expression sample sequences on the existing database, and the micro expression recognition is generally divided into two steps: and (5) extracting and classifying the features. The main work is focused on feature extraction, and micro expression recognition can be simply divided into two types according to a feature extraction mode, wherein the first type is based on manually designed features, and the second type is based on features extracted by a convolutional neural network.

The method based on the manual design features obtains certain achievements in the aspect of micro-expression recognition through decades of development, but needs professional prior knowledge and a complex parameter adjusting process, and has poor generalization ability and robustness. With the rapid development of machine learning and deep learning, the convolutional neural network obtains good performance in many fields of computer vision, and more researchers apply the convolutional neural network to micro-expression recognition. Ruicong proposes to combine 3D convolutional neural networks (3D-CNNs) with migration learning, firstly, the 3D-CNNs are supervised and learned in a common expression database Ouclu-CASIA, then, a model obtained by pre-training is used for micro-expression training, and in order to solve the problem of too few database samples, an author expands the database by 7 times by utilizing image turning and rotation. Kim combines a convolutional neural network and a long-short term memory network (LSTM) to extract the spatial and temporal information of the micro-expression video sequence, learns the spatial information of each frame of the micro-expression video by using the convolutional neural network, and then learns the temporal information among each frame by using the LSTM, and experimental results show that the method is superior to LBP-TOP and corresponding variants. Liong et al calculate the optical flow information using the start frame and peak frame of the micro expression, then extract and fuse the features of the horizontal direction optical flow graph and the vertical direction optical flow graph respectively using a double-current convolutional neural network, and finally classify.

In chinese patent application "micro expression recognition method and system based on channel attention mechanism" (patent application No. CN202010687230.5, publication No. CN112001241A), a three-dimensional tensor is formed by calculating the horizontal component, the vertical component and the optical flow strength of the optical flow between the peak frame and the start frame, and then the three-dimensional tensor is input into a micro expression recognition network model based on the channel attention mechanism, and finally a classification result is obtained. The input of the method is based on optical flow information, so that the spatial information of the micro-expression video sequence cannot be effectively extracted.

Chinese patent application "a micro expression recognition method based on 3D convolutional neural network" (patent application No. CN201610954555.9, publication No. CN106570474A), extracting a grayscale channel feature map, a horizontal direction gradient channel feature map, a vertical direction gradient channel feature map, a horizontal direction optical flow channel feature map, and a vertical direction optical flow channel feature map for each frame of image in a micro expression video sequence to obtain a feature map group corresponding to the micro expression video sequence to be recognized, and then inputting the feature map group to the 3D convolutional neural network to further extract features and classify the feature. The method has the advantages that each frame of image of the micro-expression video sequence is processed, the calculated amount is extremely large, training data are not expanded, and the model is easy to overfit in the training process.

Although convolutional neural networks have achieved excellent performance in the field of micro-expression recognition, many challenges remain. First, training a convolutional neural network requires a large number of samples, while the database of microexpressions is limited. The micro-expression video library CASME II has only 256 micro-expression video sequences, which easily causes model overfitting. Secondly, the micro expression has small change amplitude and weak strength compared with the common expression, and a general convolutional neural network model usually only focuses on regions (such as the mouth, eyes and other regions) with obvious facial changes, but ignores regions with small facial changes, so that the model extraction information is insufficient, and how to improve the learning capacity of the model on the micro expression discriminative features is an important factor for improving the micro expression recognition accuracy.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems of model overfitting, insufficient extraction of micro-expression discriminative features and the like of a micro-expression recognition method based on convolutional neural network extraction features, the invention provides a discriminative feature learning method and a discriminative feature learning system for micro-expression recognition. In addition, in order to enhance the learning capability of the model to the space-time discriminant characteristics, the method utilizes the double-current convolutional neural network space flow branch to generate the similar activation graph, and utilizes the activation graph to carry out attention enhancement on the input of the double-current convolutional neural network time flow branch.

The technical scheme is as follows: in order to realize the purpose of the invention, the invention adopts the following technical scheme:

a discriminative feature learning method for micro-expression recognition comprises the following steps:

(1) extracting initial frames and peak frames of video sequence samples in a micro-expression video library;

(2) normalizing the sizes of the images of the initial frame and the peak frame to be uniform into NxN pixels, and amplifying the normalized images by using different amplification factors to perform Euler motion to obtain a plurality of groups of images of the micro expression initial frame and the peak frame;

(3) calculating optical flow information between each group of micro expression peak frames and the initial frame to obtain an optical flow graph;

(4) for each micro expression peak frame image, selecting an image with an expression type different from the peak frame from a common expression image library, cutting the image, and replacing a corresponding area of the peak frame image with the cut image block to obtain a composite image containing two different expression type labels; the positions of the image blocks to be cut are randomly selected, and the sizes of the image blocks to be cut are controlled by the superparameters which are uniformly distributed from 0 to 1;

(5) constructing a double-current convolutional neural network model based on a class activation graph attention machine mechanism; the model is divided into a time flow branch and a space flow branch, the space flow branch sequentially comprises a feature extraction layer, a global average pooling layer, a full connection layer, a classification layer and a class activation map generation layer, the time flow branch sequentially comprises an attention enhancement layer, a feature extraction layer, a global average pooling layer, a full connection layer and a classification layer, and finally, a decision fusion layer is used for combining the outputs of the two flow classification layers; the class activation graph generation layer outputs a class activation graph according to the feature graph output by the feature extraction layer of the spatial stream branch and the weight between the full connection layer and the global average pooling layer; the attention enhancement layer of the time stream branch utilizes the class activation map output by the spatial stream branch to carry out attention enhancement on the input of the time stream branch;

(6) respectively inputting the light flow graph and the synthetic image into two branches of the constructed double-current convolution neural network model, and training the model;

(7) extracting an initial frame and a peak frame from an input video sequence, carrying out size normalization and Euler motion amplification pretreatment on the initial frame and the peak frame, further calculating optical flow information between the peak frame and the initial frame to obtain an optical flow diagram, respectively inputting the optical flow diagram and a pretreated peak frame image into two branches of a trained double-current convolution neural network model, extracting and obtaining micro-expression characteristics with strong discriminative power, and using the micro-expression characteristics for micro-expression classification identification.

Further, the step (1) comprises the following substeps:

(1.1) taking a first frame of the micro-expression video sequence as an initial frame of the micro-expression video sequence;

(1.2) setting the total frame number of the micro-expression video sequence as k, and carrying out subtraction operation on each frame image and the first frame image from the second frame to obtain a difference image:

in the above formula

Representing a subtraction operation of corresponding pixels, m representing a frame number, F _m Representing the m-th frame image, F ₁ Is a first frame image;

(1.3) calculating the sum of pixel values of each difference image:

in the above formula D _m (i, j) represents a pixel value of the difference image at the coordinate (i, j) position;

(1.4) obtaining the frame number of the peak frame image:

(1.5) the peak value frame image is the p frame image F in the micro-expression video sequence corresponding to the frame number p _p 。

Further, the step (4) comprises the following substeps:

(4.1) setting the micro expression peak frame image in the step (3) as G and the corresponding category label as l _G Selecting a common expression image O with a class label l different from the micro expression peak frame image _O ；

(4.2) normalizing the size of the common expression image O into N multiplied by N pixels, wherein the size of the common expression image O is the same as that of the micro expression peak frame image;

(4.3) generating coordinates R ═ C of the bounding box of the clipping region _x ,C _y ,C _h ,C _w ) The purpose is to remove the pixels in the clipping region corresponding to the micro expression peak frame image G and replace the pixels with the pixels in the clipping region corresponding to the common expression O, wherein C _x 、C _y Respectively representing the abscissa and ordinate of the center point of the bounding box, C _h 、C _w Height and width of the bounding box are respectively represented:

where δ is a hyper-parameter obeying a uniform distribution between 0 and 1, C _x And C _y Obey an even distribution between 0 and N;

(4.4) generating a binary mask T e {0,1} by clipping the region bounding box R ^N×N The size of the mask T is N × N and is composed of 0 and 1, and the mask is in the boundary frame of the cutting areaThe value in the region corresponding to R is 0, and the rest value is 1;

(4.5) generating a composite image containing two different emoji category labels from the binary mask T:

in the above formula

For the resulting composite image, I is a mask of size N x N, all values of which are 1,

representing the multiplication of the corresponding elements.

Further, the specific structure of the spatial flow branch of the dual-flow convolutional neural network model based on the class activation graph attention machine mechanism, which is constructed in the step (5), is as follows:

the feature extraction layer of the spatial stream branch extracts the features of the spatial stream branch input to obtain a multi-channel feature map

The size of the characteristic diagram is H multiplied by H, and the number of channels is c;

global average pooling layer of spatial stream branches, feature map output from feature extraction layer using H × H pooling kernel

Conversion to c eigenvalues:

in the above formula θ _n The nth feature value representing the global average pooling layer output,

representing the value of the nth channel feature map at the coordinate (i, j) position;

the full-connection layer of the spatial flow branch fully connects the output of the global average pooling layer to v output neurons, and outputs a v-dimensional feature vector:

xi in the above formula _n The nth characteristic value representing the output of the full connection layer,

representing the weight of the connection between the nth output neuron of the full connection layer and the jth characteristic value output by the global average pooling layer;

the Softmax classification layer of the spatial flow branch fully connects the feature vectors output by the full connection layer to v output nodes corresponding to the expression classes, a v-dimensional feature vector is output, the number of each dimension in the vector represents the probability of belonging to the class, wherein v is the number of the classes;

the class activation graph generation layer of the spatial stream branch generates a class activation graph corresponding to a certain class:

in the above formula M _j A j-th channel feature map representing the output of the feature extraction layer,

a weight representing the connection of the nth output neuron of the full connection layer with the jth eigenvalue output by the global average pooling layer,

the class activation graph corresponding to the nth class is represented and has the size of H multiplied by H, the class activation graph generation layer outputs the class activation graph corresponding to the micro expression peak frame image label during training, and the class activation graph generation layer outputs a spatial stream branch when the micro expression is identified by using the trained double-current convolutional neural network modelClass activation graph of the class with highest probability in Softmax classification layer.

Further, the specific structure of the time-flow branch of the dual-flow convolutional neural network model based on the class activation graph attention machine mechanism constructed in the step (5) is as follows:

the attention enhancement layer of the temporal streaming branch performs attention enhancement on the input of the temporal streaming branch by using the class activation map of the spatial streaming branch output: the size of the class activation graph is first aligned with the input of the time flow leg:

in the above formula, Upsample () is an upsampling function, the size of the class activation map is changed from H × H to N × N by the upsampling function, and then the value on the class activation map is mapped to be between 0 and 1:

sig () is a Sigmoid function, which maps values on the class activation graph to be between 0 and 1, and finally performs attention mechanism enhancement on the input of the time stream branch by using the class activation graph:

in the above equation a is the input of the time stream branch,

is the input after the attention mechanism enhancement, I is a mask of size N x N, all values of 1,

representing the multiplication of corresponding elements;

the time flow branch input subjected to attention enhancement sequentially passes through a feature extraction layer, a global average pooling layer and a full connection layer of the time flow branch, and finally the probability that the time flow input belongs to each category is output through a time flow Softmax classification layer.

Further, the step (6) comprises the following substeps:

(6.1) initializing network weights using a random initialization method;

(6.2) inputting the synthetic image in the step (4) into a spatial flow branch of the dual-flow convolutional neural network, and constructing a loss function of the spatial flow branch according to the output of the Softmax classification layer of the spatial flow branch:

L _s ＝(-φ _s [l _G ]+log((∑ _j exp(φ _s [j])))×δ+(-φ _s [l _O ]+log(∑ _j exp(φ _s [j]) 1-delta) in the above formula _G 、l _O Respectively corresponding to the micro expression peak frame image G and the general expression O for synthesis, delta being the hyper-parameter phi in the step (4) _s [j]Represents the value, phi, of the spatial stream tributary Softmax classification layer output corresponding to the class label j _s [l _G ]Representing the class label l in the output of the Softmax classification layer of the spatial stream branch _G Value of (phi) _s [l _O ]Representing the class label l in the output of the Softmax classification layer of the spatial stream branch _O A value of (d);

(6.3) inputting the optical flow diagram in the step (3) into a time flow branch of the dual-flow convolutional neural network, and constructing a loss function of the time flow branch according to the output of the Softmax classification layer of the time flow branch:

middle phi of the above formula _t [l _G ]Representing the corresponding class label l in the output of the Softmax classification layer of the time flow branch _G Value of (phi) _t [j]Representing the value of the time flow branch Softmax classification layer output corresponding to the class label j;

(6.4) adding the spatial flow loss function and the time flow loss function to obtain the total loss function of the dual-flow convolution neural network:

L _sum ＝L _t +L _s

from the total loss function L of the double-flow convolutional neural network _sum Performing gradient calculation and weight updating on the double-current convolution neural network model;

and (6.5) obtaining a trained double-current convolutional neural network model through repeated iterative training.

Based on the same inventive concept, the invention discloses a discriminative feature learning system for micro-expression recognition, which comprises:

the preprocessing module is used for extracting initial frames and peak frames of video sequence samples in the micro-expression video library; normalizing the sizes of the images of the initial frame and the peak frame to be uniform into NxN pixels, and amplifying the normalized images by using different amplification factors to perform Euler motion to obtain a plurality of groups of images of the micro expression initial frame and the peak frame;

the optical flow information calculation module is used for calculating optical flow information between each group of micro expression peak frames and the initial frame to obtain an optical flow graph;

the image synthesis module is used for selecting an image with an expression type different from that of the peak frame from a common expression image library for each micro expression peak frame image, cutting the image, and replacing a corresponding area of the peak frame image with the cut image block to obtain a synthesized image containing two different expression type labels; the positions of the image blocks to be cut are randomly selected, and the sizes of the image blocks to be cut are controlled by the superparameters which are uniformly distributed from 0 to 1;

the network model building and training module is used for building a double-current convolutional neural network model based on a class activation graph attention machine mechanism, the model is divided into a time flow branch and a space flow branch, the space flow branch sequentially comprises a feature extraction layer, a global average pooling layer, a full connection layer, a classification layer and a class activation graph generation layer, the time flow branch sequentially comprises an attention enhancement layer, a feature extraction layer, a global average pooling layer, a full connection layer and a classification layer, and finally the output of the two stream classification layers is combined by a decision fusion layer; the class activation graph generation layer outputs a class activation graph according to the feature graph output by the feature extraction layer of the spatial stream branch and the weight between the full connection layer and the global average pooling layer; the attention enhancement layer of the time stream branch utilizes the class activation map output by the spatial stream branch to carry out attention enhancement on the input of the time stream branch; respectively inputting the light flow graph and the synthetic image into two branches of the constructed double-current convolution neural network model, and training the model;

and the micro expression recognition module is used for extracting an initial frame and a peak frame from an input video sequence, carrying out size normalization and Euler motion amplification pretreatment on the initial frame and the peak frame, further calculating optical flow information between the peak frame and the initial frame to obtain an optical flow diagram, respectively inputting the optical flow diagram and the preprocessed peak frame image into two branches of a trained double-flow convolution neural network model, extracting and obtaining micro expression features with strong discriminative power, and using the micro expression features for micro expression classification recognition.

Based on the same inventive concept, the invention discloses a system for learning the discriminative features for micro expression recognition, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the method for learning the discriminative features for micro expression recognition when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the following advantages:

(1) the invention constructs a double-current convolution neural network based on a class activation graph attention force mechanism, wherein a space flow branch and a time flow branch of the double-current convolution neural network are not mutually independent, the space flow branch generates a class activation graph, and the class activation graph is utilized to carry out attention enhancement on the input of the time flow branch. The class activation diagram generated by the spatial stream branch indicates that the micro expression characteristics of which regions in the spatial domain are strong in distinctiveness, and in order to enable the model to pay attention to the regions with strong distinctiveness characteristics in the time domain, the class activation diagram is used for carrying out attention enhancement on the input of the spatial stream branch. The class activation graph is priori knowledge generated by the spatial stream branch, the temporal stream branch has information supplement of the priori knowledge, and the learning capability of the model on the time-space discriminant characteristics is enhanced, so that the micro-expression identification accuracy rate is improved;

(2) in the model training stage, the constructed double-current convolutional neural network model is trained by utilizing a synthetic image and an optical flow graph, wherein the synthetic image comprises two different expression class labels, one is a micro expression peak frame class label, and the other is a common expression class label. The benefits of this are: firstly, for a micro expression peak frame image of a certain category, a composite image is obtained by respectively using a common expression image and a micro expression peak frame image which are different from the category, so that a training sample is further amplified, and overfitting of a model is prevented; secondly, the composite image comprises a common expression image part and a micro expression peak frame image part, and compared with the whole micro expression peak frame image which is difficult to identify, the composite image is more suitable for training a network, because the common expression is easier to identify than the micro expression, when the network is trained by the composite image, the network has the main task of identifying the micro expression image part in the composite image, namely only the micro expression distinguishing characteristic of a certain area needs to be extracted, so that the training difficulty of the model is reduced; thirdly, in the training process, the micro expression peak frame class label of the synthetic image is combined with a loss function of the network to guide the model to learn the regional characteristics of the micro expression peak frame image which are not replaced by the common expression image, and because the regions of the micro expression image which are replaced by the common expression image are random, namely, each part of the micro expression image is possibly replaced, the model can fully learn the micro expression distinguishing characteristics of each region of the face along with the increase of the training times, and does not only pay attention to certain specific regions (such as mouth, eyes and the like) where certain changes of the face are obviously intersected;

(3) the discriminative feature learning method for micro-expression recognition provided by the invention can realize automatic feature extraction by utilizing end-to-end training without manually designing a feature extractor, and is simple and efficient.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

fig. 2 is a structural diagram of a dual-flow convolutional neural network based on a class activation graph attention machine mechanism constructed in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

As shown in fig. 1, the method for learning distinctive features for micro expression recognition disclosed in the embodiment of the present invention specifically includes the following steps:

step (1): and extracting the initial frame and the peak frame of the video sequence sample in the micro-expression video library. In this embodiment, the method for extracting the start frame and the peak frame of each micro-expression video sequence sample by using the SMIC II database as a data source specifically includes the following sub-steps:

(1.1) transmitting the first frame F of the micro-expression video sequence ₁ As the initial frame of the micro expression image sequence;

in the above formula

Representing a subtraction operation of corresponding pixels, m representing a frame number, F _m Representing the m-th frame image, F ₁ The first frame image is a starting frame image;

(1.3) calculating the sum of pixel values of each difference image:

in the above formula D _m (i, j) represents a pixel value of the difference image at the coordinate (i, j) location;

(1.4) obtaining the frame number of the peak frame image: :

(1.5) the peak value frame image is the p frame image F in the micro-expression video sequence corresponding to the index number p _p 。

Step (2): normalizing the sizes of the initial frame images and the peak frame images obtained in the step (1) to be uniform into NxN pixels (N can be selected from 112-448), and performing Euler motion amplification on the normalized images by using different amplification factors (the amplification factor alpha can be selected from 2-20) to obtain a plurality of groups of images of the micro expression initial frame images and the peak frame images. In this example, the size of the image is set to 224 × 224 pixels, and euler motions with amplification factors of 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12 are used to amplify the pixels, so that a slight change in the face is amplified, and the sample size is amplified 10 times as large as the original size after euler motion amplification processing using ten different amplification factors.

And (3): and calculating the optical flow information between each group of micro expression peak frames and the initial frame to obtain an optical flow graph. The method specifically comprises the following substeps:

(3.1) calculating optical flow information between the peak frame and the initial frame thereof by using a Deepflow algorithm to obtain an optical flow graph U along the x-axis direction and the y-axis direction _x 、U _y Will U is _x And U _y Adding the squares of the optical flow values of all the positions and squaring to obtain another optical flow diagram U _z ：

Representing a light-flow diagram U _z The corresponding optical flow value at the position of the upper coordinate (i, j),

representing a light-flow diagram U _x The corresponding optical flow value at the position of the upper coordinate (i, j),

representing a light-flow diagram U _y The corresponding optical flow value at the upper coordinate (i, j) position;

(3.2) adding U _x 、U _y And U _z Linear change to [0-1]In the interval:

in the above formula

Representing the minimum optical flow value in the corresponding optical flow graph,

representing the maximum optical flow value in the corresponding optical flow graph,

for the optical flow graph after the linear change,

is the optical flow value at the position of coordinate (i, j) on the corresponding optical flow map;

(3.3) mixing

And

and stacking to form the final three-channel light flow graph U with the dimensions of 224 multiplied by 3.

And (4): and (4) respectively using the common expression images different from the categories of the micro expression peak frame images and the micro expression peak frame images to obtain a composite image. Specifically, an image with an expression category different from the peak frame is selected from a common expression image library, the image is cut, and the corresponding area of the peak frame image is replaced by the cut image block, so that a composite image containing two different expression category labels is obtained. In this embodiment, the common expression images are from a database Ferplus, which has 7 common expression categories, and only the common expression images of the same three categories of happy category, surprised category, and nausea are used. And for a certain category of the micro expression peak frame images, respectively using the common expression images and the micro expression peak frame images which are different from the category of the micro expression peak frame images to obtain a composite image, and further amplifying the sample size. For example, if the expression category of the micro expression peak frame is happy, the composite image is obtained by using the common expression images with the categories of surprise and nausea, and if the expression category of the micro expression peak frame is depressed, the composite image is obtained by using the common expressions with the categories of happy, surprise and nausea. The method for obtaining the composite image by the common expression image and the micro expression peak frame image comprises the following substeps:

(4.1) setting the peak frame image in the step (3) as G and the corresponding class label as l _G Selecting a common expression O, wherein the class label of the common expression O is different from the micro expression peak frame image, and the class label of the common expression O is l _O ；

(4.2) normalizing the scale of the common expression image O to 224 multiplied by 224 pixels, which is the same as the size of the micro expression peak frame image;

(4.3) generating coordinates R ═ C of the bounding box of the clipping region _x ,C _y ,C _h ,C _w ) The purpose is to remove the pixels in the clipping area corresponding to the micro expression peak frame image G and replace the pixels in the clipping area corresponding to the common expression O:

where N is the normalized scale in step (2), i.e. 224, δ is a hyper-parameter subject to a uniform distribution between 0 and 1, C _x ,C _y Obeying uniform distribution between 0 and N;

(4.4) generating a binary mask T e {0,1} by clipping the region bounding box R ^N×N The size of the mask T is NXN and is composed of 0 and 1, the value of the mask in the area corresponding to the cutting area boundary frame R is 0, and the rest value is 1;

(4.5) generating a composite image containing two different emoji class labels from the binary mask T:

in the above formula

For the resulting composite image, I is a mask of size N, with all values of 1.

And (5): and constructing a double-current convolutional neural network model based on a class activation graph attention machine mechanism, wherein the network model can be divided into a time flow branch and a space flow branch as shown in FIG. 2. The spatial stream branch comprises a feature extraction layer, a global average pooling layer, a full connection layer, a Softmax classification layer and a class activation graph generation layer in sequence, the temporal stream branch comprises an attention enhancement layer, a feature extraction layer, a global average pooling layer, a full connection layer and a Softmax classification layer in sequence, and finally the decision fusion layer is used for merging the output of the two streams of the Softmax classification layer, wherein the specific functions of the layers are as follows:

the characteristic extraction layer of the spatial stream branch extracts the input characteristics of the spatial stream branch to obtain a multi-channel characteristic diagram

The feature size is H × H, the number of channels is c, dThe feature extraction layer may use the feature extraction portion of any convolutional neural network in deep learning (e.g., ResNet, VGGNet, AlexNet, etc.). The multi-channel characteristic diagram in the example is

That is, the size of the feature map is 7 × 7, the number of channels is 512, and the feature extraction layer adopts the feature extraction part of ResNet-18 (i.e. the part from the first convolutional layer of ResNet-18 to the end of the last convolutional layer);

Conversion to c eigenvalues:

representing the value at the location of coordinate (i, j) on the nth channel profile;

xi in the above formula _n The nth characteristic value representing the full link layer output,

representing the weight of the n-th output neuron of the full connection layer connected with the j-th characteristic value output by the global average pooling layer;

the Softmax classification layer of the spatial flow branch fully connects the feature vectors output by the full connection layer to v output nodes corresponding to the expression classes, a v-dimensional feature vector is output, the number of each dimension in the vector represents the probability of belonging to the class, wherein v is the number of the classes; the micro-expression video library CASME II adopted in the embodiment has five micro-expression categories: happy, surprised, vomit, oppressed and others, v is 5;

the size of the class activation map corresponding to the nth class is H × H, and the size of the class activation map in this embodiment is 7 × 7. Because the label is provided during training, the class activation graph generation layer outputs the class activation graph of the class corresponding to the micro-expression peak value frame image label, but when the trained double-current convolutional neural network is used for micro-expression recognition, the class activation graph generation layer outputs the class activation graph of the class with the highest probability in the spatial flow branch Softmax classification layer;

the attention enhancement layer of the temporal streaming branch performs attention enhancement on the input of the temporal streaming branch by using the class activation map of the spatial streaming branch output. The size of the class activation graph is first aligned with the input of the time flow leg:

in the above equation a is the input of the time stream branch,

is the input after the attention mechanism enhancement, I is the mask of size N × N, whose values are all 1;

then the input with enhanced attention passes through a feature extraction layer, a global average pooling layer and a full connection layer of a time flow branch in sequence, the structures of the feature extraction layer, the global average pooling layer and the full connection layer are the same as those of the space flow, and finally the probability that the flow input belongs to each category is output through a Softmax classification layer;

the decision fusion layer adds the output of the time flow branch Softmax classification layer and the output of the space flow branch Softmax classification layer to obtain a category score of the whole double-flow convolutional neural network for input prediction, and takes the category corresponding to the maximum score as a final classification result;

and (6): training the double-current convolutional neural network constructed in the step (5) by using the data obtained in the steps (3) and (4), wherein the training comprises the following sub-steps:

(6.1) initializing network weights using a random initialization method;

(6.2) new image obtained in step (4)

Inputting into spatial stream branches of a double-stream convolutional neural network according to spaceThe output of the stream branch Softmax classification layer constructs a loss function of the spatial stream branch:

L _s ＝(-φ _s [l _G ]+log((∑ _j exp(φ _s [j])))×δ+(-φ _s [l _o ]+log(∑ _j exp(φ _s [j]) 1-delta) in the above formula _G 、l _O Respectively a category label corresponding to the micro expression peak value frame image G and a category label corresponding to the common expression O in the step (4), delta is a hyper-parameter phi in the step (4) _s [j]Represents the value, phi, of the spatial stream tributary Softmax classification layer output corresponding to the class label j _s [l _G ]Representing the class label l in the output of the Softmax classification layer of the spatial stream branch _G Value of (phi) _s [l _O ]Representing class label l in spatial stream branch Softmax classification layer output _O A value of (d);

(6.3) inputting the optical flow diagram U obtained in the step (3) into a time flow branch of the dual-flow convolutional neural network, and constructing a loss function of the time flow branch according to the output of the Softmax classification layer of the time flow branch:

L _sum ＝L _t +L _s

from the total loss function L of the double-flow convolutional neural network _sum Performing gradient calculation and weight updating on the double-current convolution neural network model by using a back propagation algorithm;

and (6.5) carrying out iterative training for multiple times (such as 50 times) to obtain a trained double-current convolutional neural network model.

And (7): extracting an initial frame and a peak frame from an input video sequence, carrying out size normalization and Euler motion amplification pretreatment (amplification is carried out by using a certain specific amplification factor during identification), further calculating optical flow information between the peak frame and the initial frame to obtain an optical flow diagram, respectively inputting the optical flow diagram and the preprocessed peak frame image into two branches of a trained double-current convolution neural network model, and extracting micro-expression characteristics with strong discriminative power for micro-expression classification identification.

Based on the same inventive concept, the embodiment of the invention discloses a discriminative characteristic learning system for micro expression recognition, which comprises:

the preprocessing module is used for extracting the initial frame and the peak frame of the video sequence sample in the micro-expression video library; normalizing the sizes of the images of the initial frame and the peak frame to be uniform into NxN pixels, and amplifying the normalized images by using different amplification factors to perform Euler motion to obtain a plurality of groups of images of the micro expression initial frame and the peak frame;

the image synthesis module is used for selecting an image with an expression type different from that of the peak frame from a common expression image library for each micro expression peak frame image, cutting the image, and replacing a corresponding area of the peak frame image with the cut image block to obtain a synthesized image containing two different expression type labels; the positions of the image blocks to be cut are randomly selected, and the sizes of the image blocks to be cut are controlled by the hyper-parameters which are uniformly distributed from 0 to 1;

and the micro-expression recognition module is used for extracting an initial frame and a peak frame from an input video sequence, carrying out size normalization and Euler motion amplification pretreatment on the initial frame and the peak frame, further calculating optical flow information between the peak frame and the initial frame to obtain an optical flow diagram, respectively inputting the optical flow diagram and the pretreated peak frame image into two branches of a trained double-current convolution neural network model, extracting and obtaining micro-expression characteristics with strong discriminative power, and using the micro-expression characteristics for micro-expression classification recognition.

Based on the same inventive concept, the differential feature learning system for micro expression recognition disclosed by the embodiment of the invention comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the differential feature learning method for micro expression recognition when being loaded to the processor.

The technical solutions described above only represent the preferred technical solutions of the present invention, and some possible modifications made to some parts by those skilled in the art all represent the principles of the present invention, and fall within the protection scope of the present invention.

Claims

1. A method for discriminative feature learning for microexpression recognition, the method comprising the steps of:

(4) for each micro expression peak value frame image, selecting an image of which the expression type is different from that of the peak value frame from a common expression image library, cutting the image, and replacing a corresponding area of the peak value frame image with an image block obtained by cutting to obtain a composite image containing two different expression type labels; the positions of the image blocks to be cut are randomly selected, and the sizes of the image blocks to be cut are controlled by the superparameters which are uniformly distributed from 0 to 1;

(5) constructing a double-current convolution neural network model based on a class activation graph attention mechanism; the model is divided into a time flow branch and a space flow branch, the space flow branch sequentially comprises a feature extraction layer, a global average pooling layer, a full connection layer, a classification layer and a class activation map generation layer, the time flow branch sequentially comprises an attention enhancement layer, a feature extraction layer, a global average pooling layer, a full connection layer and a classification layer, and finally, a decision fusion layer is used for combining the outputs of the two flow classification layers; the class activation graph generation layer outputs a class activation graph according to the feature graph output by the feature extraction layer of the spatial stream branch and the weight between the full connection layer and the global average pooling layer; the attention enhancement layer of the time stream branch utilizes the class activation map output by the spatial stream branch to carry out attention enhancement on the input of the time stream branch;

2. The method for learning distinctive features for micro expression recognition according to claim 1, wherein the step (1) comprises the following sub-steps:

(1.1) using a first frame of the micro-expression video sequence as an initial frame of the micro-expression video sequence;

in the above formula

(1.3) calculating the sum of pixel values of each difference image:

(1.4) obtaining the frame number of the peak frame image:

(1.5) the peak frame image is the p frame image F in the micro-expression video sequence corresponding to the frame number p _p 。

3. The method for learning distinctive features for micro expression recognition according to claim 1, wherein the step (4) comprises the following sub-steps:

(4.1) setting the micro-expression peak frame image in the step (3) as G, and setting the corresponding category label as l _G Selecting a common expression image O with a class label l different from the micro expression peak frame image _o ；

where δ is a hyper-parameter obeying a uniform distribution between 0 and 1, C _x And C _y Obeying uniform distribution between 0 and N;

in the above formula

For the resulting composite image, I is a mask of size N, whichA value of all 1, indicates a multiplication of the corresponding elements.

4. The method for learning the discriminative features for micro-expression recognition according to claim 1, wherein the spatial flow branch of the dual-flow convolutional neural network model based on the class-activation-graph attention machine system constructed in the step (5) has the following specific structure:

the feature extraction layer of the spatial flow branch extracts the features of the spatial flow branch input to obtain a multi-channel feature map

Conversion to c eigenvalues:

the Softmax classification layer of the spatial flow branch fully connects the feature vectors output by the full connection layer to v output nodes of corresponding expression classes, a v-dimensional feature vector is output, the number of each dimension in the vector represents the probability of belonging to the class, wherein v is the number of the classes;

and the class activation graph is H multiplied by H, the class activation graph generation layer outputs the class activation graph of the class corresponding to the micro expression peak frame image label during training, and the class activation graph generation layer outputs the class activation graph of the class with the highest probability in the spatial flow branch Softmax classification layer during micro expression recognition by using the trained double-current convolutional neural network model.

5. The method for learning the discriminative features for micro expression recognition according to claim 4, wherein the specific structure of the time-flow branch of the dual-flow convolutional neural network model based on the class activation graph attention machine constructed in the step (5) is as follows:

in the above formula, Upsample () is an upsampling function, the size of the class activation graph is changed from H × H to N × N by the upsampling function, and then the value on the class activation graph is mapped to be between 0 and 1:

in the above equation a is the input of the time stream branch,

is an input after the attention mechanism enhancement, I is a mask of size NxN, whose values are all 1, indicating multiplication of corresponding elements;

6. The method for learning distinctive features for micro expression recognition according to claim 1, wherein said step (6) comprises the following sub-steps:

(6.1) initializing network weights using a random initialization method;

L _s ＝(-φ _s [l _G ]+log((∑ _j exp(φ _s [j])))×δ+(-φ _s [l _O ]+log(∑ _j exp(φ _s [j])))×(1-δ)

in the above formula _G 、l _O Respectively corresponding to the micro expression peak frame image G and the general expression O for synthesis, delta being the hyper-parameter phi in the step (4) _s [j]Represents the value, phi, of the spatial stream tributary Softmax classification layer output corresponding to the class label j _s [l _G ]Representing the class label l in the output of the Softmax classification layer of the spatial stream branch _G Value of (phi) _s [l _O ]Representing the class label l in the output of the Softmax classification layer of the spatial stream branch _O A value of (d);

L _sum ＝L _t +L _s

7. A system for discriminative feature learning for micro-expression recognition, comprising:

8. A system for discriminative feature learning for micro expression recognition comprising at least one computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing a method for discriminative feature learning for micro expression recognition according to any of claims 1-6.