CN112818764A

CN112818764A - Low-resolution image facial expression recognition method based on feature reconstruction model

Info

Publication number: CN112818764A
Application number: CN202110055946.8A
Authority: CN
Inventors: 田锋; 经纬; 南方; 洪振鑫; 郑庆华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-05-18
Anticipated expiration: 2041-01-15
Also published as: CN112818764B

Abstract

The invention discloses a low-resolution image facial expression recognition method based on a feature reconstruction model, and belongs to the field of facial image expression recognition. The method comprises the steps of constructing a training and testing data set; then training a facial expression recognition model of the feature reconstruction model, extracting image expression features by using a feature extraction network with fixed parameters, then obtaining an expression feature generator and a feature discriminator by adopting a generation confrontation network mode training model, and using FSRG as inputImage reconstruction feature yields F_SR(ii) a Classifier pair feature F consisting of fully connected network and softmax function layer_SRClassifying, and re-weighting the sample loss by using the probability value of the correct category corresponding to the sample output by the softmax layer; the method is insensitive to the resolution of the input image, improves the identification accuracy rate under lower resolution, and has more stable identification effect on each resolution.

Description

Low-resolution image facial expression recognition method based on feature reconstruction model

Technical Field

The invention belongs to the field of facial image expression recognition, and particularly relates to a low-resolution image facial expression recognition method based on a feature reconstruction model.

Background

Facial expressions are one of the most direct, natural signals that humans express emotions. Facial expression recognition is a hot topic of researches such as human-computer natural interaction, computer vision, emotion calculation, image processing and the like, and is widely applied to the fields of human-computer interaction, remote education, security, intelligent robot development, medical treatment, animation production and the like.

Under different scenes, due to the change of equipment and environment and the imaging principle of a pinhole camera, the face images of people under the multi-person photographic scene have the problem of different resolutions, and the images can be compressed in network transmission and storage, so that the quality and the resolution of the images are reduced. The recognition accuracy of the algorithm can be severely impacted in low resolution scenarios. In order to more accurately recognize the expression of a person, it is necessary to reduce the influence of the resolution change. With the development of technologies such as deep learning and image super-resolution, when processing low-resolution input images, methods of performing super-resolution reconstruction on the images and then performing recognition are often used. The method of reconstructing an image has the following disadvantages, first: although the method is improved compared with the method of directly using the low-resolution image to recognize the expression, the method causes the problems of a large amount of calculation, unstable effect and the like. Secondly, the method comprises the following steps: since the expression recognition target is a human face, the problem of privacy disclosure is easily caused by high-resolution reconstruction of a human face image, and this point is increasingly emphasized in international research.

Disclosure of Invention

The invention aims to overcome the defects of large calculated amount and easy privacy disclosure of reconstructed face images and provides a low-resolution image facial expression recognition method based on a feature reconstruction model.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

a low-resolution image facial expression recognition method based on a feature reconstruction model comprises the following steps:

1) collecting facial expression images with resolution ratio more than or equal to 100x100 pixels and labeling expression types as original images I_HR(ii) a Carrying out 2-8 times integer multiplying factor down-sampling on the original image to obtain a corresponding low-resolution image, wherein the expression category label of the low-resolution image is consistent with the original image; dividing an original image and a corresponding low-resolution image into a training set and a test set;

2) training a neural network model by adopting a generative confrontation network method;

inputting an original image and low-resolution images of respective magnifications into a feature extractor E, and extracting and calculating a feature matrix F of the original image by the feature extractor E_HRAnd low resolution image feature matrix F of each magnification_LR；

Low resolution image feature matrix F_LRInputting the data into an expression feature generator FSRG, and outputting a generated reconstruction feature matrix F_SR；

Feature matrix F of the original image_HRAnd corresponding reconstructed feature matrix F of the low resolution image_SRInputting the data into a characteristic discriminator FSRD, comparing the difference of the two in a distribution space, and optimizing the characteristic discriminator FSRG through back propagation;

reconstruction of the expressive features F_SRInputting the samples into a double-layer fully-connected expression classifier C for classification, calculating the probability of the samples being classified into various categories by the expression classifier C, calculating a weight coefficient by using the probability value of each sample being correctly classified to carry out weight weighting on the loss of the samples, and accelerating the convergence of a neural network;

repeating the training process until a trained neural network model is obtained;

3) inputting the facial image of the expression to be recognized into a trained neural network model, and extracting an input image by a feature extractor EAn image feature matrix F, a feature generator FSRG for generating a reconstructed feature matrix F_SRAnd the classifier C calculates and outputs the class label of the recognition result.

Further, the feature extractor E in step 2) is formed by combining a plurality of convolution layers and nonlinear activation layers, and is a feature extraction part of the expression recognition model pre-trained by the original image data set.

Further, the process of extracting the features in the feature extractor E in step 2) is as follows:

extracting a three-dimensional feature tensor T for the input image I, wherein the size of the feature tensor T is w x h x n, w and h are the length and width of the feature tensor, and n is the number of channels;

calculating a covariance matrix M of the feature tensor T:

wherein ,f_iOne channel representing the characteristic tensor T,

for the mean value of the channels of the feature tensor, M ∈^n*nN is the number of channels of the feature tensor T;

correcting the characteristic value of the covariance matrix M to obtain a corrected covariance matrix M⁺：

M⁺＝M+λ*trace(M)*I (2)

Where λ is a coefficient greater than zero, I is an identity matrix, trace (M) is the trace of matrix M;

covariance matrix M for rectification⁺And performing pooling operation and logarithm of the characteristic value to obtain a characteristic matrix F.

Further, the corrected covariance matrix M⁺Performing pooling operation and logarithm of the characteristic value to obtain a characteristic matrix, wherein the process of obtaining the characteristic matrix is as follows:

F_cov＝WM⁺W^T (3)

wherein ,

for pooling parameter matrices, matrices

To F_covPerforming eigenvalue decomposition and eigenvalue correction to obtain a matrix F⁺The method comprises the following specific operations:

F_cov＝U₁Σ₁U₁ ^T (4)

F⁺＝U₁max(εI,Σ₁)U₁ ^T (5)

wherein max () is the maximum value of the corresponding elements of the two matrices;

to F⁺Performing eigenvalue decomposition and logarithm of the eigenvalue to obtain an eigen matrix F, specifically:

F⁺＝U₂Σ₂U₂ ^T (6)

F＝U₂log(Σ₂)U₂ ^T (7)

wherein, log (Σ)₂) Finger-to-eigenvalue matrix sigma₂Is logarithmic.

Further, the feature generator FSRG in step 2) is a full convolution network, and is composed of a convolution neural network and a nonlinear activation layer, and the process of reconstructing the feature matrix by the feature generator FSRG is as follows:

feature matrix F with low resolution images_LRFor input, the reconstructed feature matrix F is output_SRThe dimension of the matrix before and after reconstruction is consistent.

Further, the feature discriminator FSRD in step 2) compares the difference between the two in the distribution space, specifically:

the feature discriminator FSRD uses the feature matrix F corresponding to the same image_SR and F_HRAs an input, the corresponding scores are output, and the absolute value of the difference between the scores represents the Wasserstein distance of the two in the feature space.

Further, step 2) of trainingIn the process, the loss function of the feature generator FSRG is formed by the penalty L_GANFeature matrix F_SR and F_HRThe perceptual loss L between_PAnd two-norm loss L₂Composition is carried out;

against loss L_GANComprises the following steps:

wherein b is the size of the data batch;

loss of feature perception L_PComprises the following steps:

wherein ,C_FC() Represents the output of the last fully connected layer of classifier C;

two norm loss L₂Comprises the following steps:

the loss of the feature generator FSRG is a linear sum of the three:

L_FSRG＝L_GAN+λ₁L_P+λ₂L₂ (11)

wherein ,λ₁ and λ₂Are all weight coefficients greater than zero.

Further, in the training process of step 2), the loss function of the feature discriminator FSRD is:

wherein ,

theta is a random number between 0 and 1, and ensures data of each batchIn

Is F_SR and F_HRThe linear interpolation result of (2); p and k are each

The exponential parameter and the coefficient parameter of the term.

Further, the expression classifier C in the step 2) calculates that the sample belongs to each Class by using softmax_iThe loss of the real category is reweighted by using the probability value corresponding to the real category, wherein the probability value i is 1.

w＝(σ-logit)^r (13)

Where, logic is the probability that the sample output by the softmax function corresponds to its true class, and the parameters σ and r are set to 1.5 and 2, respectively.

Compared with the prior art, the invention has the following beneficial effects:

the low-resolution image facial expression recognition method based on the feature reconstruction model comprises the steps of constructing a training and testing data set, carrying out different-magnification down-sampling on a high-resolution facial expression image to generate a plurality of high-low-resolution image pairs with multiple magnifications, and simultaneously keeping a category label; then training a facial expression recognition model of the feature reconstruction model, and extracting high-resolution image expression features F by using a feature extraction network with fixed parameters_HRAnd corresponding low resolution image expressive features F_LR(ii) a Then, an expression feature generator FSRG and a feature discriminator FSRD are obtained by adopting a training model in a mode of generating an antagonistic network, and the FSRG is used for reconstructing features of an input image to obtain F_SR(ii) a Classifier C consisting of fully connected network and softmax function layer pairs feature F_SRClassifying, and re-weighting the sample loss by using the probability value of the correct category corresponding to the sample output by the softmax layer, so as to accelerate the convergence of the model; identification processComprises the following steps: the model extracts a feature matrix F of the input image, and then a feature generator FSRG generates a reconstructed feature matrix F_SRAnd calculating and outputting a class label of the recognition result by using the classifier C obtained by training. The invention provides a method for generating a network by combining deep learning countermeasure and reconstructing image characteristics to identify facial expressions. Compared with the traditional method, the method is insensitive to the resolution of the input image, and the identification accuracy under the lower resolution is improved; compared with a method for reconstructing an image, the method has more stable identification effect on each resolution, can avoid the problems of increased calculated amount and possible privacy disclosure caused by the reconstructed image, and has great industrial application value.

Drawings

FIG. 1 is an overall network of a low-resolution image facial expression recognition method based on a feature reconstruction model according to the present invention;

FIG. 2 is a network structure of a feature extractor of the present invention;

FIG. 3 is a network structure of the feature generation part of the present invention, wherein FIG. 3(a) is a feature generator network structure and FIG. 3(b) is a dense connection block structure;

fig. 4 is a network structure of the feature discriminator of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a method for generating a network by combining deep learning countermeasure and reconstructing image characteristics to identify facial expressions. Compared with the traditional method, the method disclosed by the invention is insensitive to the resolution of the input image, and the identification accuracy under a lower resolution is improved; compared with a method for reconstructing an image, the method has more stable recognition effect on each resolution, can avoid the problems of increased calculated amount and possible privacy disclosure caused by the reconstructed image, and has great industrial application value in the fields of education analysis and management and entertainment.

The invention is described in further detail below with reference to the accompanying drawings:

referring to fig. 1, fig. 1 is an overall network of the low-resolution image facial expression recognition method based on the feature reconstruction model of the present invention; the network comprises four main parts, namely a feature extractor, a feature generator, a feature discriminator and an expression classifier.

Referring to fig. 2, fig. 2 is a network structure of the feature extractor of the present invention; the feature extractor contains six convolutional layers (Conv Layer) with convolutional kernels of 3x3 and step size of 1. The number of output characteristic channels of each convolutional layer is 64, 96, 128 and 256 in sequence, each convolutional layer is followed by an activation layer, and the activation function is a ReLu function. After each of the first, second and fourth active layers there is a pooling Layer, using a maximum pooling (Max boosting Layer), with a pooling window size of 2x2, with a step size of 2.

Referring to fig. 3, fig. 3(a) is a feature generator network structure and fig. 3(b) is a dense connection block structure; the method is composed of three cascade of dense connection blocks and residual connection between input and output, wherein the structure of a single dense block is shown in fig. 3(b), and comprises five convolution layer-batch normalization layer (BN) combinations, and dense connection is adopted between groups and an lreul function is added to serve as an active layer.

Referring to fig. 4, fig. 4 is a network structure of the feature discriminator of the present invention, which is formed by cascading five rolling blocks and two fully connected layers, where the number of output channels of each rolling block is sequentially 8, 16, 32, 64, and 64; each convolution block is formed by alternately arranging two convolution layers and two active layers, wherein the convolution kernel of the former convolution layer is 3x3 and has the step length of 1, the convolution kernel of the latter convolution layer is 5x5 and has the step length of 2; the output dimensions of the last two fully connected layers are, in turn, 100 and 1 dimensions.

The invention discloses a low-resolution image facial expression recognition method based on a feature reconstruction model, which comprises the following implementation processes:

a model training part:

step 1: collecting facial expression images with resolution ratio more than or equal to 100x100 pixels and labeling expression types as original images I_HR(ii) a Adopting bicubic interpolation mode to make 2-8 times integer multiplying power down-sampling for original image (the length and width of image are changed into original resolution ratio)

To

) Obtaining a plurality of low resolution images (I)_LR-2To I_LR-8) (ii) a The expression category label of the low-resolution image is consistent with the original image;

step 2: feature extractor E using fixed parameter pre-training extracts feature matrix F of original resolution image_HRFeature matrix F of low-resolution image corresponding to each magnification_LRThe feature extractor E includes a convolutional layer and a nonlinear active layer. The method comprises the steps that a high-low resolution image pair is input at one time, for one image I, a corresponding three-dimensional feature tensor T is extracted by using a feature extractor, the size of the feature tensor T is w x h x n, w and h are the length and the width of the corresponding feature tensor, and n is the number of channels;

and step 3: calculating the covariance matrix of the respective feature tensors T:

wherein ,f_iOne channel representing the characteristic tensor T,

for the mean value of the channels of the feature tensor, M ∈^n*nAnd n is the number of channels of the feature tensor T.

And 4, step 4: in order to ensure the positive nature of the matrix, the eigenvalue correction is carried out on each covariance matrix:

M⁺＝M+λ*trace(M)*I (2)

wherein, λ is a coefficient larger than zero, and since the covariance matrix is symmetric and semi-positive, in order to reduce the influence of this operation on the feature matrix and ensure positive, the value of λ is 0.0001; i is the identity matrix.

And 5: for covariance matrix M⁺Performing pooling operation on the characteristic values and logarithm of the characteristic values to obtain a characteristic matrix, wherein the specific operation is as follows:

F_cov＝WM⁺W^T (3)

wherein ,

for pooling parameter matrices, the specific parameters are optimized by back-propagation learning

Step 6: for matrix F_covEigenvalue decomposition is performed and the matrix F is obtained by the following operation⁺The method comprises the following specific operations:

F_cov＝U₁Σ₁U₁ ^T (4)

F⁺＝U₁max(εI,Σ₁)U₁ ^T (5)

where max () is the maximum value element by element for both matrices.

And 7: f to the matrix⁺Taking logarithm of the eigenvalue to obtain a final eigen matrix F, and specifically operating as follows:

F_cov＝U₂Σ₂U₂ ^T (6)

F＝U₂log(Σ₂)U₂ ^T (7)

wherein, log (Σ)₂) Finger-to-feature matrix sigma₂Is logarithmic.

And 7: model structure initialization

The feature generator FSRG is a full convolution network, and the invention adopts ResNet-50 to realize the feature matrix F of low-resolution images_LRFor input, the reconstructed feature matrix F is output_SRThe dimension of the input and output characteristic matrix is kept consistent, so that the original pooling operation in ResNet is removed; the feature discriminator FSRD adopts a VGG-16 network to respectively use an original image feature matrix F_HRCharacteristic matrix F of low resolution image of corresponding multiplying power_SRIs input; the expression classifier C consists of two full-connection layers and a softmax function layer and outputs the probability of each classification column.

And 8: setting a loss function

During training, the loss function of the feature generator FSRG is determined by the penalty L_GANFeature matrix F_SR and F_HRThe perceptual loss L between_PAnd L2 distance loss L₂In which L is_GANComprises the following steps:

where b is the size of the data batch, L_GANIs used for restraining the feature perception loss, L, of the feature generator FSRG and the feature discriminator FSRD_PComprises the following steps:

wherein ,C_FC() Representing the output of the last fully connected layer of classifier C.

The loss of the feature generator FSRG is a linear sum of the three:

L_FSRG＝L_GAN+λ₁L_P+λ₂L₂ (10)

wherein ,λ₁ and λ₂Both are adjustable weight coefficients greater than zero, both coefficients being set to 0.1 in the present invention.

The loss function calculation mode of the feature discriminator FSRD is as follows:

wherein ,

theta is a random number between 0 and 1, and is ensured in the data of each batch

Is F_SR and F_HRAs a result of linear interpolation, p and k are respectively

The index parameter and the coefficient parameter of the term, p is 6 and k is 2, can obtain the best effect in the experiment.

The concrete operations of calculating the probability of the sample belonging to each category by using softmax and re-weighting the sample loss are as follows:

w＝(σ-logit)^r (12)

wherein, logic is the probability of the sample output by the softmax function corresponding to the real category, and parameters sigma and r are respectively set to 1.5 and 2; the penalty function for classifier C is set to cross entropy penalty.

And step 9: model training

The gradient was updated using an Adam optimizer, the learning rate was set to 0.00002, Adam's first order momentum parameter was 0.1, and second order momentum parameter was 0.999. The number of data set training iterations (Epoch) was set to 400 and the data batch size (batch size) was set to 16.

The model using part:

extracting an image feature tensor T by using a feature extractor E, and then performing feature reconstruction by using a feature generator FSRG to obtain corresponding reconstructed features F_SRThen, the classifier C calculates the probability that the sample belongs to each class, and classifies the sample into the class having the highest probability.

Referring to table 1, table 1 shows the average accuracy of expression recognition on the face image downsampled at each magnification of the RAF-DB data set by different methods, and the method provided by the present invention is significantly improved compared with a method for directly performing Bicubic interpolation amplification on a low-resolution image. Compared with a super-resolution method RCAN and Meta-SR for reconstructing images, the method has better effect on images with lower resolution and higher average identification accuracy of images with various scales. The method provided by the invention is obviously improved compared with a method for directly carrying out Bicubic interpolation amplification on the low-resolution image. Compared with the super-resolution method RCAN and Meta-SR of the reconstructed image, the method has better effect on the image with lower resolution and higher average identification accuracy of the image with each scale.

TABLE 1 mean rate of accuracy of expression recognition on RAF-DB data set for each-magnification downsampling face image by different methods

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A low-resolution image facial expression recognition method based on a feature reconstruction model is characterized by comprising the following steps:

3) inputting a facial image with an expression to be recognized into a trained neural network model, extracting a feature matrix F of the input image by a feature extractor E, and generating a reconstructed feature matrix F by a feature generator FSRG_SRAnd the classifier C calculates and outputs the class label of the recognition result.

2. The feature reconstruction model-based low-resolution image facial expression recognition method as claimed in claim 1, wherein the feature extractor E in step 2) is formed by combining a plurality of convolution layers and nonlinear activation layers and is a feature extraction part of an expression recognition model pre-trained by an original image data set.

3. The method for recognizing the facial expression of the low-resolution image based on the feature reconstruction model as claimed in claim 1, wherein the feature extraction process in the feature extractor E in the step 2) is as follows:

calculating a covariance matrix M of the feature tensor T:

wherein ,f_iOne channel representing the characteristic tensor T,

M⁺＝M+λ*trace(M)*I (2)

4. According to claimThe feature reconstruction model-based low-resolution image facial expression recognition method of claim 3, characterized in that the corrected covariance matrix M is corrected⁺Performing pooling operation and logarithm of the characteristic value to obtain a characteristic matrix, wherein the process of obtaining the characteristic matrix is as follows:

F_cov＝WM⁺W^T (3)

wherein ,

for pooling parameter matrices, matrices

F_cov＝U₁Σ₁U₁ ^T (4)

F⁺＝U₁max(εI,Σ₁)U₁ ^T (5)

F⁺＝U₂Σ₂U₂ ^T (6)

F＝U₂log(Σ₂)U₂ ^T (7)

wherein, log (Σ)₂) Finger-to-eigenvalue matrix sigma₂Is logarithmic.

5. The method for recognizing facial expressions of low-resolution images based on a feature reconstruction model according to claim 1, wherein the feature generator FSRG in the step 2) is a full convolution network, which is composed of a convolution neural network and a nonlinear activation layer, and the process of reconstructing the feature matrix by the feature generator FSRG is as follows:

6. The feature reconstruction model-based low-resolution image facial expression recognition method according to claim 1, wherein the feature discriminator FSRD in step 2) compares the difference between the two in the distribution space, specifically:

7. The method for recognizing facial expressions of low-resolution images based on feature reconstruction model as claimed in claim 1, wherein in the training process of step 2), the loss function of the feature generator FSRG is represented by the antagonistic loss L_GANFeature matrix F_SR and F_HRThe perceptual loss L between_PAnd two-norm loss L₂Composition is carried out;

against loss L_GANComprises the following steps:

wherein b is the size of the data batch;

loss of feature perception L_PComprises the following steps:

two norm loss L₂Comprises the following steps:

the loss of the feature generator FSRG is a linear sum of the three:

L_FSRG＝L_GAN+λ₁L_P+λ₂L₂ (11)

wherein ,λ₁ and λ₂Are all weight coefficients greater than zero.

8. The method for recognizing the facial expressions of the low-resolution images based on the feature reconstruction model as claimed in claim 1, wherein in the training process of the step 2), the loss function of the feature discriminator FSRD is:

wherein ,

Is F_SR and F_HRThe linear interpolation result of (2); p and k are each

The exponential parameter and the coefficient parameter of the term.

9. The feature reconstruction model-based low-resolution image facial expression recognition method as claimed in claim 1, wherein the expression classifier C in step 2) uses softmax to calculate that the sample belongs to each Class_iThe loss of the real category is reweighted by using the probability value corresponding to the real category, wherein the probability value i is 1.

w＝(σ-logit)^r (13)