CN112200065B

CN112200065B - Micro-expression classification method based on action amplification and self-adaptive attention area selection

Info

Publication number: CN112200065B
Application number: CN202011070118.3A
Authority: CN
Inventors: 柯逍; 林艳; 王俊强
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2022-08-09
Anticipated expiration: 2040-10-09
Also published as: CN112200065A

Abstract

The invention relates to a micro-expression classification method based on action amplification and self-adaptive attention area selection. Firstly, acquiring a micro expression data set, and extracting a start frame and a peak frame; then inputting the extracted initial frame and peak frame into an action amplification network to generate an image after action amplification; then preprocessing the amplified image; and finally, identifying the preprocessed image by using a self-adaptive attention area selection method to obtain a final classification result.

Description

Micro-expression classification method based on action amplification and self-adaptive attention area selection

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a micro-expression classification method based on action amplification and self-adaptive attention area selection.

Background

The human body is the highest animal, sometimes disguising or hiding own emotion, in which case, people cannot obtain useful information from the macroscopic expression of the face. To be able to tap useful information from camouflaged facial expressions, ackerman discovered a transient, involuntary, rapid facial emotion, i.e., a micro-expression, that is provoked to involuntarily appear in the face when one tries to hide some kind of real emotion. A standard micro-expression lasts 1/5 to 1/25 seconds and usually appears only in a specific part of the face.

Micro-expressions have great prospects in the aspects of national security, criminal inquiries and medical applications, but the subtlety and conciseness of micro-expressions form great challenges for human eyes, so that in recent years, people put forward a lot of work of realizing automatic identification of micro-expressions by using computer vision and machine learning algorithms.

Disclosure of Invention

The invention aims to provide a micro-expression classification method based on action amplification and self-adaptive attention area selection, which can effectively classify micro-expression images.

In order to achieve the purpose, the technical scheme of the invention is as follows: a micro-expression classification method based on action amplification and self-adaptive attention area selection comprises the following steps:

step S1, acquiring a micro expression data set, and extracting a start frame and a peak frame;

step S2, inputting the extracted initial frame and peak frame into an action amplification network to generate an action amplified image;

step S3, preprocessing the amplified image, and dividing a training set and a test set according to an LOSO principle;

and step S4, recognizing the preprocessed image by using a self-adaptive attention area selection method to obtain a final classification result.

In an embodiment of the present invention, the step S1 specifically includes the following steps:

step S11, acquiring a micro expression data set, and cutting the image into 224 × 224 images after face alignment;

step S12, extracting the initial frame and the peak value frame directly according to the marked content for the micro expression data set with the initial frame and the peak value frame;

step S13, extracting the initial frame and the peak frame of the video sequence by using a frame difference method for the micro expression data set which is not marked with the initial frame and the peak frame; the frame difference method comprises the following steps: let P be { P ═ P _i Denotes an input image sequence, where p is 1,2 _i Representing the ith input picture with the first frame of the sequence as the starting frame, i.e. p _start ＝p ₁ The gray values of pixels corresponding to the first frame and the nth frame of the video sequence are recorded as f1(x, y) and fn (x, y), the gray values of the pixels corresponding to the two frames of images are subtracted, the absolute value of the subtraction is taken to obtain a difference image Dn, Dn (x, y) ═ fn (x, y) -f1(x, y) |, and the average inter-frame difference Dnavg of the difference image is calculated, wherein the calculation method comprises the following steps:

wherein, Dn.shape [0 ]]Representing the height of the differential image Dn, Dn]Representing the width of the difference image Dn, calculating the average inter-frame difference of all frames except the initial frame and the initial frame, and sequencing, wherein the frame with the maximum average inter-frame difference is the peak value frame p corresponding to the image sequence _apex 。

In an embodiment of the present invention, the step S2 specifically includes the following steps:

step S21, designing the encoder to start frame p _start Sum peak frame p _apex Extracting shape and texture features, wherein the encoder consists of a convolution layer and ResBlock, and if T (-) represents the encoder texture feature extraction module function, T is T (p), wherein T is { T } _start|apex Denotes texture features of the input frame, let S (-) denote the encoder shape feature extraction module function, then S (p), where S ═ S _start|apex Represents the shape characteristics of the input frame;

step S22, designing amplifier pair start frame p _start Sum peak frame p _apex Amplifying the shape feature, and activating the functional model through convolutional layer and activation function model of neural networkThe operational amplification effect of the pseudo-bandpass filter is enhanced to enhance the signal at a frequency with a large variation intensity and filter the noise at a frequency with a small variation intensity, where G (-) represents a function map formed by k3s1 convolution and the activation function ReLu in the amplifier, and H (-) represents a function map formed by k3s1 convolution and ResBlock in the amplifier, the final amplifier amplification result is:

M(s _start ,s _apex ,α)＝s _start +H(α·G(s _apex -s _start ))

where M (-) denotes the mapping function of the amplifier, α denotes the magnification, s _start Representing the shape feature of the start frame, s _apex Representing shape features of the peak frames;

step S23, designing a pyramid reconstruction fusion process when a decoder simulates Lagrange method to amplify motion, wherein the decoder part is also a small convolutional neural network, and the input of the neural network is texture characteristic t _start|apex And the amplified shape feature M(s) obtained by the amplifier _start ,s _apex α), in which a texture feature t is first aligned _start|apex Upsampling and connecting shape features M(s) _start ,s _apex Alpha) and texture features t _start|apex Here, it is equivalent to the shape feature s that needs to be strengthened _start|apex Superimposing the texture features t without magnification back after magnification by alpha _start|apex And then, after 9 ResBlock, the final output result is obtained by performing up-sampling once and two convolution layers of k3s 1.

In an embodiment of the present invention, the step S3 specifically includes the following steps:

step S31, performing sharpening on the magnified micro expression image to solve the problem of pixel blurring that may exist after the micro expression image is magnified, wherein the calculation method is as follows:

a(i,j)＝p(i,j)-k _τ ▽ ² p(i,j)

wherein k is _τ Is the coefficient, k, associated with the diffusion effect _τ ＝1；

Step S32, each timeThe method comprises the steps that a plurality of subjects are arranged under a data set, each subject represents a testee, each subject contains a plurality of micro-expression sequences which represent the micro-expression sequences generated by the testee, one subject of one data set is taken as a test set when the data set is divided according to the principle of leave-one-subject-out, all other subjects are combined together to be used as a training set, and the last data set obtains the subject _i A training set and a test set, wherein Sub _i Representing the number of subjects in a data set.

In an embodiment of the present invention, the step S4 specifically includes the following steps:

step S41, designing a self-adaptive attention area selection network to classify the input amplified and preprocessed micro-expression images, wherein the self-adaptive attention area selection network comprises three scales of sub-networks, the three scales of sub-networks have the same structure but different parameters, and each scale of sub-network comprises two modules which are respectively a classification module and an attention area selection module;

the classification module is composed of a convolution layer, an activation layer and a pooling layer and is used for extracting features of the input micro-expression image, and the calculation process is as follows:

c(X)＝u(w _i *X)

wherein X represents a vector representation of the input image, w _i Parameters representing the network layer, w _i X is the last extracted feature, and the u (-) function represents the last full connection layer and the softmax layer and is used for obtaining the probability result of the category corresponding to the last feature;

the attention area selection module is composed of two stacked fully-connected layers, and let e (-) denote the mapping function of the attention area selection module, and the calculation process is as follows:

[l _x ,l _y ,l _half ]＝e(w _i *X)

wherein l _x And l _y Area center point coordinates, l, representing the selection result of the attention area selection module _half Representing half of the side length of the attention selection area;

step S42, for each sub-network, passing through a classification module and an attention area selection module, where the input value of the next sub-network is the value of the position of the specific area to be cut after the positioning area is obtained by the last sub-network classification detection, and the specific cutting operation is implemented by a rectangular function, first determining the top left corner and the bottom right corner of the attention area:

l _x(tl) ＝l _x -l _half

l _y(tl) ＝l _y -l _half

l _x(br) ＝l _x +l _half

l _y(br) ＝l _y +l _half

wherein l _x(tl) Horizontal axis representing the coordinates of the upper left corner,/ _y(tl) Vertical axis representing the coordinates of the upper left corner,/ _x(br) Horizontal axis, l, representing the coordinates of the lower right corner _y(br) A vertical axis representing the coordinates of the lower right corner;

then, a mask N (-) of the attention area is selected, and the calculation process is as follows:

N(·)＝[v(x-l _x(tl) )-v(x-l _x(br) )]·[v(y-l _y(tl) )-v(y-l _y(br) )]

where N (·) is a two-dimensional square pulse function, where k is a very large positive number, resulting in the value of v (x) to be determined only by the positive and negative of x, y being the horizontal and vertical coordinates of the current image, x > 0, v (x) 1; x < 0, v (x) 0; if and only if _x(tl) ＜x＜l _x(br) And t is _y(tl) ＜y＜t _y(br) If N is 1, otherwise N is 0, the picture can be cropped by the mask matrix N, and finally the cropped result is calculated:

X ^att ＝XΘN(l _x ,l _y ,l _half )

where Θ denotes the element-by-element multiplication, X ^att Representing the clipped result;

step S43, during training, the parameters of the pre-trained VGGNet initialization classification module are used, the region with the highest response of the last layer of convolution of the classification module is used for initializing the attention region selection module, and finally the two modules are iteratively trained to converge to obtain the final result.

Compared with the prior art, the invention has the following beneficial effects:

1. the micro-expression classification method based on action amplification and self-adaptive attention area selection constructed by the invention can effectively classify micro-expression images and improve the classification effect of the micro-expression images.

2. The method generates the action amplification result between two frames in a convolution neural network mode, and has less noise and edge blurring compared with the traditional action amplification method, more robustness and better performance.

3. Aiming at the problem that local area attention needs to be carried out in a strict alignment face segmentation mode in the traditional micro expression recognition process, the invention provides a self-adaptive attention area discovery method.

Drawings

Fig. 1 is a schematic diagram of the principle of the present invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a micro-expression classification method based on motion amplification and adaptive attention area selection, which specifically includes the following steps:

step S1: acquiring a micro expression data set, and extracting a start frame and a peak frame;

step S2: inputting the extracted initial frame and the extracted peak frame into an action amplification network to generate an action amplified image;

step S3: preprocessing the amplified image, and dividing a training set and a testing set according to an LOSO principle;

step S4: and identifying the preprocessed image by using a self-adaptive attention area selection method to obtain a final classification result.

In this embodiment, the step S1 includes the following steps:

step S11: acquiring a micro expression data set, aligning the face, and uniformly cutting the face into 224 × 224 sizes;

step S12: for a micro expression data set with initial frame and peak frame labels, extracting the initial frame and the peak frame directly according to the labeled contents;

step S13: extracting the initial frame and the peak frame of the video sequence by using a frame difference method for the micro expression data set which is not marked by the initial frame and the peak frame;

step S14: the specific content of the frame difference method is that P is { P ═ P _i Denotes an input image sequence, where p is 1,2 _i Representing the ith input picture, let us start the first frame of the sequence, i.e. p _start ＝p ₁ The gray values of pixels corresponding to the first frame and the nth frame of the video sequence are recorded as f1(x, y) and fn (x, y), the gray values of the pixels corresponding to the two frames of images are subtracted, the absolute value of the subtraction is obtained, a difference image Dn is obtained, Dn (x, y) ═ fn (x, y) -f1(x, y) |, and the average inter-frame difference D of the difference image is calculatednavg, calculated as follows:

wherein, Dn.shape [0 ]]Shape [1 ] represents the height of the difference image Dn]Indicating the width of the differential image Dn. Calculating the average interframe difference of all frames except the initial frame and sequencing, wherein the frame with the maximum average interframe difference is the peak value frame p corresponding to the image sequence _apex ；

In this embodiment, step S2 specifically includes the following steps:

step S21: designing the encoder to input the start frame p _start Sum peak frame p _apex Extracting shape and texture features, wherein the encoder mainly comprises a convolution layer and ResBlock, and T (·) represents the texture feature extraction module function of the encoder, and T ═ T (p), where T ═ T · _start|apex Denotes texture features of the input frame, let S (-) denote the encoder shape feature extraction module function, then S (p), where S ═ S _start|apex Represents the shape characteristics of the input frame;

step S22: designing amplifier pair start frame p _start Sum peak frame p _apex The shape feature is amplified, mainly simulating the action amplification effect of a band-pass filter through the convolution layer of a neural network and an activation function, strengthening signals on frequencies with large variation intensity and filtering noise on frequencies with small variation intensity, wherein G (-) represents a function mapping formed by k3s1 convolution and an activation function ReLu in the amplifier, H (-) represents a function mapping formed by k3s1 convolution and ResBlock in the amplifier, and the final result obtained by amplifying the amplifier is that

M(s _start ,s _apex ,α)＝s _start +H(α·G(s _apex -s _start ))

Where M (-) represents the mapping function of the amplifier and α represents the magnification;

step S23: designing a pyramid reconstruction fusion process when a decoder simulates a Lagrange method to carry out motion amplification, wherein the decoder part is also oneA small convolutional neural network whose input is the texture feature t _start|apex And the roughly amplified shape feature M(s) obtained by the amplifier _start ,s _apex α), in which a texture feature t is first aligned _start|apex Upsampling and connecting shape features M(s) _start ,s _apex Alpha) and texture features t _start|apex Here, it is equivalent to the shape feature s that needs to be strengthened _start|apex Superimposing the texture features t without magnification back after magnification by alpha _start|apex Therefore, noise which may be introduced is suppressed, then, after 9 ResBlock, one-time up-sampling and two k3s1 convolutional layers are carried out to obtain a final output result, the ResBlock can perfectly solve the problem of gradient dispersion, so that the network can be well propagated backwards, and in addition, a result of action amplification between two frames is generated in a neural network mode, compared with the result of noise and edge blurring generated by an action amplification method of a traditional method, the amplification effect is more robust, and the performance is better;

in this embodiment, step S3 specifically includes the following steps:

step S31: the micro expression image after being amplified is sharpened, so that the problem of pixel blurring possibly existing after the micro expression image is amplified is solved, and the calculation mode is as follows:

a(i,j)＝p(i,j)-k _τ ▽ ² p(i,j)

wherein k is _τ Is a coefficient related to the diffusion effect. The coefficient should be reasonable, if k _τ Too large, the image contour will produce overshoot; otherwise if k _τ When the size is too small, the sharpening effect is not obvious, and the value k is taken in the algorithm _τ ＝1。

Step S32: according to the principle of leave-one-leave-out, when dividing the data set, one leave of one data set is taken as a test set at a time, and all the other leave are combined together to be taken as a training set, so that finally, the data set is subjected to one data setThe Sub can be obtained by the collection _i A training set and a test set, Sub _i Representing the number of subjects in a data set.

In this embodiment, step S4 specifically includes the following steps:

step S41: the self-adaptive attention area selection network is designed to classify the input amplified and preprocessed micro expression images, and mainly comprises three-scale sub-networks, wherein the three-scale sub-networks have the same structure but different parameters, and each scale sub-network comprises two modules which are respectively a classification module and an attention area selection module;

step S42: the classification module mainly comprises a plurality of convolution layers, an activation layer and a pooling layer and is used for extracting the characteristics of the input micro-expression image, and the calculation process is as follows

c(X)＝u(w _i *X)

Wherein X represents a vector representation of the input image, w _i Parameters representing some network layers, w _i X is the last extracted feature, and the u (-) function represents the last full connection layer and the softmax layer and is used for obtaining the probability result of the category corresponding to the last feature;

step S43: the attention area selection module mainly comprises two stacked fully-connected layers, wherein e (-) represents a mapping function of the attention area selection module, and the calculation process is as follows:

[l _x ,l _y ,l _half ]＝e(w _i *X)

step S44: for each sub-network, the sub-network passes through a classification module and an attention area selection module, the input value of the next sub-network is the value of the position of a specific area cut after the positioning area is obtained by classification detection of the previous sub-network, the specific cutting operation is realized by a rectangular function, and firstly, the upper left corner and the lower right corner of the attention area are determined:

l _x(tl) ＝l _x -l _half

l _y(tl) ＝l _y -l _half

l _x(br) ＝l _x +l _half

l _y(br) ＝l _y +l _half

N(·)＝[v(x-l _x(tl) )-v(x-l _x(br) )]·[v(y-l _y(tl) )-v(y-l _y(br) )]

where N (·) is a two-dimensional square pulse function, where k is a very large positive number, resulting in the value of v (x) to be determined only by the positive and negative of x, y being the horizontal and vertical coordinates of the current image, x > 0, v (x) 1; x < 0, v (x) 0; if and only if _x(tl) ＜x＜l _x(br) And t is _y(tl) ＜y＜t _y(br) When N is equal to 1, otherwise N is equal to 0, and the picture can be cropped by the mask matrix N. And finally, calculating a clipped result:

X ^att ＝XΘN(l _x ,l _y ,l _half )

where Θ denotes the element-by-element multiplication, X ^att Showing the clipped result. The advantage of using a rectangular function is that: the method has the advantages that the method can have a function equivalent to direct clipping, and in the optimization process, the rectangular function can be propagated reversely so as to optimize the parameters of the rectangular frame;

step S45: during training, parameters of a classification module are initialized by using pre-trained VGGNet, an attention region selection module is initialized by using a region with the highest response of the last layer of convolution of the classification module, and finally, the two modules are trained to converge iteratively to obtain a final result. By utilizing the self-adaptive attention area finding method, the area position finally determining the difference of the micro expressions is positioned from top to bottom through analyzing the same expression on different scales, so that the problem that the key part is difficult to identify and position is solved to a certain extent, and the classification effect is improved.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A micro-expression classification method based on action amplification and self-adaptive attention area selection is characterized by comprising the following steps:

step S3, preprocessing the amplified image, and dividing a training set and a test set according to the principle of leave-one-subject-out;

step S4, recognizing the preprocessed image by using a self-adaptive attention area selection method to obtain a final classification result;

the step S2 specifically includes the following steps:

step S21, designing the encoder to start frame p _start Sum peak frame p _apex Extracting shape and texture features, where the encoder consists of convolution layer and ResBlock, and T (-) represents the function of the texture feature extracting module of the encoder, and T is T (p), where T is { T } T (p) _start|apex Denotes texture features of the input frame, let S (-) denote the encoder shape feature extraction module function, then S (p), where S ═ S _start|apex Represents the shape characteristics of the input frame;

step S22, designing amplifier pair start frame p _start Sum peak frame p _apex Amplifying the shape characteristics, simulating the operation amplification effect of a band-pass filter through the convolution layer of the neural network and an activation function, strengthening the signals on the frequency with large variation intensity, and filtering the noise on the frequency with small variation intensity, wherein G (-) represents a function map formed by k3s1 convolution and an activation function ReLu in the amplifier, H (-) represents a function map formed by k3s1 convolution and ResBlock in the amplifier, and the final amplifier amplification result is as follows:

M(s _start ,s _apex ,α)＝s _start +H(α·G(s _apex -s _start ))

step S23, designing a pyramid reconstruction fusion process when a decoder simulates Lagrange method to amplify motion, wherein the decoder part is also a small convolutional neural network, and the input of the neural network is texture characteristic t _start|apex And the amplified shape feature M(s) obtained by the amplifier _start ,s _apex α), in which a texture feature t is first aligned _start|apex Upsampling and connecting shape features M(s) _start ,s _apex Alpha) and texture features t _start|apex Here, the shape feature s to be strengthened is _start|apex Superimposing the texture features t without magnification back after magnification by alpha _start|apex Therefore, noise which may be introduced is suppressed, and then after 9 ResBlock, the upsampling and two k3s1 convolutional layers are carried out again to obtain a final output result;

the step S4 specifically includes the following steps:

step S41, designing a self-adaptive attention area selection network to classify the input amplified and preprocessed micro-expression images, wherein the self-adaptive attention area selection network comprises three scales of sub-networks, the three scales of sub-networks have the same structure but different parameters, and each scale of sub-network comprises two modules which are a classification module and an attention area selection module respectively;

c(X)＝u(w _i *X)

[l _x ,l _y ,l _half ]＝e(w _i *X)

l _x(tl) ＝l _x -l _half

l _y(tl) ＝l _y -l _half

l _x(br) ＝l _x +l _half

l _y(br) ＝l _y +l _half

N(·)＝[v(x-l _x(tl) )-v(x-l _x(br) )]·[v(y-l _y(tl) )-v(y-l _y(br) )]

where N (·) is a two-dimensional square pulse function, where k is a very large positive number, resulting in the value of v (x) to be determined only by the positive and negative of x, y being the horizontal and vertical coordinates of the current image, x > 0, v (x) 1; x < 0, v (x) 0; if and only if _x(tl) ＜x＜l _x(br) And l _y(tl) ＜y＜l _y(br) If N is 1, otherwise N is 0, the picture can be cropped by the mask matrix N, and finally the cropped result is calculated:

X ^att ＝XΘN(l _x ,l _y ,l _half )

2. The micro-expression classification method based on motion amplification and adaptive attention area selection as claimed in claim 1, wherein the step S1 specifically comprises the following steps:

step S13, extracting the initial frame and the peak frame of the video sequence by using a frame difference method for the micro expression data set which is not marked with the initial frame and the peak frame; by frame differencingThe method comprises the following steps: let P be { P ═ P _i Denotes an input image sequence, where p is 1,2 _i Representing the ith input picture with the first frame of the sequence as the starting frame, i.e. p _start ＝p ₁ The gray values of pixels corresponding to the first frame and the nth frame of the video sequence are recorded as f1(x, y) and fn (x, y), the gray values of the pixels corresponding to the two frames of images are subtracted, the absolute value of the subtraction is taken to obtain a difference image Dn, Dn (x, y) ═ fn (x, y) -f1(x, y) |, and the average inter-frame difference Dnavg of the difference image is calculated, wherein the calculation method comprises the following steps:

wherein, Dn.shape [0 ]]Shape [1 ] represents the height of the difference image Dn]Representing the width of the difference image Dn, calculating the average inter-frame difference of all frames except the initial frame and the initial frame, and sequencing, wherein the frame with the maximum average inter-frame difference is the peak value frame p corresponding to the image sequence _apex 。

3. The micro-expression classification method based on action amplification and adaptive attention area selection according to claim 1, wherein the step S3 specifically comprises the following steps:

step S31, carrying out sharpening processing on the amplified micro expression image, wherein the calculation mode is as follows:

Step S32, under each data set, there are multiple subjects, each subject represents a testee, and each subject contains multiple micro-expression sequences, which represent the multiple micro-expression sequences generated by the testee, according to the principle of leave-one-subject-out, when dividing the data set, one subject of one data set is taken as the test set, and all the other subjects are combinedAnd taken together as a training set, and the last data set is taken as Sub _i A training set and a test set, wherein Sub _i Representing the number of subjects in a data set.