CN112818764B

CN112818764B - Low-resolution image facial expression recognition method based on feature reconstruction model

Info

Publication number: CN112818764B
Application number: CN202110055946.8A
Authority: CN
Inventors: 田锋; 经纬; 南方; 洪振鑫; 郑庆华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2023-05-02
Anticipated expiration: 2041-01-15
Also published as: CN112818764A

Abstract

The invention discloses a low-resolution image facial expression recognition method based on a feature reconstruction model, and belongs to the field of facial image expression recognition. The invention includes constructing training and testing data sets; then training a facial expression recognition model of a feature reconstruction model, extracting image expression features by using a feature extraction network with fixed parameters, obtaining an expression feature generator and a feature discriminator by adopting a training model in a mode of generating an countermeasure network, and obtaining F by using FSRG as an input image reconstruction feature _SR The method comprises the steps of carrying out a first treatment on the surface of the Classifier pair feature F consisting of fully connected network and softmax function layer _SR Classifying, and re-weighting the sample loss by using the probability value of the correct category corresponding to the sample output by the softmax layer; the invention is insensitive to the resolution of the input image, improves the recognition accuracy under lower resolution, and has more stable recognition effect on each resolution.

Description

Low-resolution image facial expression recognition method based on feature reconstruction model

Technical Field

The invention belongs to the field of facial image expression recognition, and particularly relates to a low-resolution image facial expression recognition method based on a feature reconstruction model.

Background

Facial expression is one of the most direct, natural signals that humans express emotion. Facial expression recognition is a hot topic of research such as man-machine natural interaction, computer vision, emotion calculation, image processing and the like, and has wide application in the fields of man-machine interaction, distance education, safety, intelligent robot development, medical treatment, animation production and the like.

Under different scenes, due to the change of equipment and environment and the imaging principle of a pinhole camera, the problem of different resolutions of 'near-large-far-small' exists in face images of people in a multi-person photographing scene, and the images can be compressed in network transmission and storage, so that the quality and resolution of the images are reduced. The recognition accuracy of the algorithm may be severely affected in low resolution scenarios. In order to more accurately recognize the expression of the person, it is necessary to reduce the influence of the change in resolution. With the development of deep learning, image super-resolution and other technologies, when processing low-resolution input images, methods of performing super-resolution reconstruction on the images and then performing recognition are mostly adopted. The method of reconstructing an image has the following disadvantages, first: although the expression recognition is improved compared with the method of directly using the low-resolution image, the method has the problems of greatly increased calculated amount, unstable effect and the like. Second,: because the object of expression recognition is a human face, the problem of privacy leakage is easily caused by high-resolution reconstruction of a human face image, and the problem is more and more paid attention to in international research.

Disclosure of Invention

The invention aims to overcome the defects of large calculation amount and easy privacy leakage of reconstructed face images and provides a low-resolution image facial expression recognition method based on a feature reconstruction model.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a low-resolution image facial expression recognition method based on a feature reconstruction model comprises the following steps:

1) Collecting facial expression images with resolution ratio of more than or equal to 100x100 pixels and labeling expression typesAs an original image I _HR The method comprises the steps of carrying out a first treatment on the surface of the Downsampling the original image by an integer multiplying power of 2-8 times to obtain a corresponding low-resolution image, wherein the expression class label of the low-resolution image is consistent with that of the original image; dividing an original image and a corresponding low-resolution image into a training set and a testing set;

2) Training a neural network model by adopting a generated countermeasure network method;

inputting the original image and the low resolution image of each magnification into a feature extractor E, and the feature extractor E extracts and calculates a feature matrix F of the original image _HR And a low-resolution image feature matrix F of each magnification _LR ；

Low resolution image feature matrix F _LR Input into an expression feature generator FSRG, and output to generate a reconstructed feature matrix F _SR ；

Feature matrix F of original image _HR And a corresponding reconstructed feature matrix F of the low resolution image _SR Inputting the difference in the distribution space of the two characteristics into a characteristic discriminator FSRD, and optimizing the characteristic discriminator FSRG through back propagation;

reconstructing expression feature F _SR Inputting the samples into a double-layer fully-connected expression classifier C for classification, calculating the probability of classifying the samples into various categories by the expression classifier C, and re-weighting the loss of each sample by using the probability value calculated weight coefficient of each sample correctly classified so as to accelerate the convergence of a neural network;

repeating the training process until a trained neural network model is obtained;

3) Inputting a face image of an expression to be recognized into a trained neural network model, extracting a feature matrix F of the input image by a feature extractor E, and generating a reconstructed feature matrix F by a feature generator FSRG _SR The classifier C calculates and outputs a class label of the recognition result.

Further, the feature extractor E in step 2) is formed by combining a plurality of convolution layers and nonlinear activation layers, and is a feature extraction part of the expression recognition model pre-trained by the original image dataset.

Further, the feature extraction process in the feature extractor E in step 2) is as follows:

for an input image I, extracting a three-dimensional characteristic tensor T, wherein the size of the characteristic tensor T is w x h x n, w and h are the length and width of the characteristic tensor, and n is the channel number;

calculating a covariance matrix M of the characteristic tensor T:

wherein ,f_i Representing one channel of the feature tensor T,

m E is the average value of each channel of the characteristic tensor ^n*n N is the number of channels of the feature tensor T;

correcting the eigenvalue of the covariance matrix M to obtain a corrected covariance matrix M ⁺ ：

M ⁺ ＝M+λ*trace(M)*I (2)

Where λ is a coefficient greater than zero, I is the identity matrix, trace (M) is the trace of matrix M;

covariance matrix M for correction ⁺ And carrying out pooling operation and taking logarithm of the characteristic value to obtain a characteristic matrix F.

Further, for corrected covariance matrix M ⁺ The process of pooling operation and taking logarithm of the characteristic value to obtain the characteristic matrix is as follows:

F _cov ＝WM ⁺ W ^T (3)

wherein ,

for pooling the parameter matrix, matrix->

For F _cov Performing eigenvalue decomposition and eigenvalue correction to obtain matrix F ⁺ The specific operation is as follows:

F _cov ＝U ₁ Σ ₁ U ₁ ^T (4)

F ⁺ ＝U ₁ max(εI,Σ ₁ )U ₁ ^T (5)

wherein, max () is the maximum value of the corresponding elements of the two matrixes;

for F ⁺ The specific operation is as follows:

F ⁺ ＝U ₂ Σ ₂ U ₂ ^T (6)

F＝U ₂ log(Σ ₂ )U ₂ ^T (7)

wherein log (Σ ₂ ) Finger pair eigenvalue matrix Σ ₂ The operation of taking the logarithm of each element of (a).

Further, the feature generator FSRG in step 2) is a full convolution network, which is composed of a convolution neural network and a nonlinear activation layer, and the process of reconstructing a feature matrix by the feature generator FSRG is as follows:

feature matrix F of image with low resolution _LR For inputting and outputting reconstructed characteristic matrix F _SR The matrix before and after reconstruction remains dimensionally consistent.

Further, in step 2), the feature discriminator FSRD compares the differences between the two in the distribution space, specifically:

the feature discriminator FSRD respectively uses the feature matrix F corresponding to the same image _SR and F_HR As input, corresponding scores are output, and the absolute value of the difference between the scores represents the wasperstein distance of both in the feature space.

Further, during the training of step 2), the loss function of the feature generator FSRG is determined by countering the loss L _GAN Feature matrix F _SR and F_HR Perceptual loss L between _P And a two-norm loss L ₂ Composition;

countering loss L _GAN The method comprises the following steps:

where b is the size of the data batch;

loss of feature perception L _P The method comprises the following steps:

wherein ,C_FC () Representing the output of the last full-connection layer of the classifier C;

loss of two norms L ₂ The method comprises the following steps:

the loss of the feature generator FSRG is a linear sum of the three:

L _FSRG ＝L _GAN +λ ₁ L _P +λ ₂ L ₂ (11)

wherein ,λ₁ and λ₂ Are all weight coefficients greater than zero.

Further, in the training process of step 2), the loss function of the feature discriminator FSRD is:

wherein ,

θ is a random number between 0 and 1, ensuring that each batch of data is

Is F _SR and F_HR Linear interpolation results of (2); p and k are each->

An exponential parameter and a coefficient parameter of the term.

Further, the expression classifier C in step 2) calculates that the sample belongs to each Class using softmax _i I=1,..z, z is the total number of categories, the loss of which is re-weighted with the probability value corresponding to the real category therein, the specific operation being:

w＝(σ-logit) ^r (13)

where logit is the probability that the sample output by the softmax function corresponds to its true class, parameters σ and r are set to 1.5 and 2, respectively.

Compared with the prior art, the invention has the following beneficial effects:

according to the low-resolution image facial expression recognition method based on the feature reconstruction model, a training and testing data set is constructed, different multiplying power downsampling is carried out on high-resolution facial expression images to generate high-low resolution image pairs with multiple multiplying power, and category labels are reserved; then training a facial expression recognition model of the feature reconstruction model, and extracting high-resolution image expression features F by using a feature extraction network with fixed parameters _HR And corresponding low resolution image expressive features F _LR The method comprises the steps of carrying out a first treatment on the surface of the Then training a model by adopting a mode of generating an countermeasure network to obtain an expression feature generator FSRG and a feature discriminator FSRD, and reconstructing features by using the FSRG as an input image to obtain F _SR The method comprises the steps of carrying out a first treatment on the surface of the Classifier C pair feature F composed of fully connected network and softmax function layer _SR Classifying, re-weighting the sample loss by using the probability value of the correct category corresponding to the sample output by the softmax layer, and accelerating model convergence; the identification process is as follows: the model extracts a feature matrix F of the input image, and then a feature generator FSRG generates a reconstructed feature matrix F _SR And calculating and outputting the class labels of the recognition results by using the classifier C obtained through training. The invention provides a method for reconstructing image features by combining a deep learning countermeasure generation network to recognize facial expressions. Compared with the traditional method, the invention is insensitive to the resolution of the input image, and improves the lower resolutionThe recognition accuracy under the rate; compared with the method for reconstructing the image, the method has the advantages that the identification effect on each resolution is more stable, the problems of increased calculation amount and possible privacy leakage caused by reconstructing the image can be avoided, and the method has great industrial application value.

Drawings

FIG. 1 is an overall network of a low resolution image facial expression recognition method based on a feature reconstruction model of the present invention;

FIG. 2 is a network architecture of the feature extractor of the present invention;

FIG. 3 is a network structure of a signature generating part of the present invention, wherein FIG. 3 (a) is a signature generator network structure and FIG. 3 (b) is a dense connection block structure;

fig. 4 is a network structure of the feature discriminator of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a method for reconstructing image features by combining a deep learning countermeasure generation network to recognize facial expressions. Compared with the traditional method, the method is insensitive to the resolution of the input image, and improves the recognition accuracy under lower resolution; compared with the method for reconstructing the image, the method has more stable recognition effect on each resolution, can avoid the problems of increased calculation amount and possible privacy leakage caused by reconstructing the image, and has great industrial application value in the fields of education analysis, management and entertainment.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, fig. 1 is an overall network of a low-resolution image facial expression recognition method based on a feature reconstruction model according to the present invention; the network comprises four main parts, namely a feature extractor, a feature generator, a feature discriminator and an expression classifier.

Referring to fig. 2, fig. 2 is a network structure of the feature extractor of the present invention; the feature extractor comprises six convolutional layers (Conv Layer) each having a 3x3 convolutional kernel and a step size of 1. The number of output characteristic channels of each convolution layer is 64, 96, 128 and 256 in turn, each convolution layer is followed by an activation layer, and the activation function is a ReLu function. After each of the first, second and fourth activation layers there is a pooling Layer, using a maximum pooling Layer, with a pooling window size of 2x2 and a step size of 2.

Referring to fig. 3, fig. 3 (a) is a feature generator network structure, and fig. 3 (b) is a dense connection block structure; the structure of a single dense block is shown in fig. 3 (b), and the structure comprises five convolution layer-batch normalization layer (BatchNormalization, BN) combinations, wherein dense connection is adopted among groups and an LReLu function is added as an activation layer.

Referring to fig. 4, fig. 4 is a network structure of the feature discriminator of the invention, which is composed of five convolution blocks and two full-connection hierarchies, and the number of output channels of each convolution block is 8, 16, 32, 64 and 64 in sequence; each convolution block is formed by alternately arranging two convolution layers and two activation layers, wherein the convolution kernel of the former convolution layer is 3x3, the step length is 1, the convolution kernel of the latter convolution layer is 5x5, and the step length is 2; the output dimensions of the last two fully connected layers are sequentially 100 dimensions and 1 dimension.

The invention discloses a low-resolution image facial expression recognition method based on a feature reconstruction model, which comprises the following implementation processes:

model training part:

step 1: collecting facial expression images with resolution ratio of 100x100 pixels or more, and labeling expression types to serve as an original image I _HR The method comprises the steps of carrying out a first treatment on the surface of the Downsampling the original image by 2-8 times of integer multiplying power (the length and width of the image are changed into original resolution) by bicubic interpolation

To->

) Obtaining a plurality of low resolution images (I _LR-2 To I _LR-8 ) The method comprises the steps of carrying out a first treatment on the surface of the The expression category label of the low-resolution image is consistent with the original image;

step 2: feature extractor E, pre-trained using fixed parameters, extracts feature matrix F of original resolution image _HR Feature matrix F of low-resolution image corresponding to each magnification _LR The feature extractor E comprises a convolution layer and a nonlinear activation layer. One input is a high-low resolution image pair, for one image I, a feature extractor is used for extracting a corresponding three-dimensional feature tensor T, the size of the feature tensor T is w x h x n, w and h are the length and the width of the corresponding feature tensor, and n is the channel number;

step 3: calculating covariance matrices of the respective feature tensors T:

wherein ,f_i Representing one channel of the feature tensor T,

m E is the average value of each channel of the characteristic tensor ^n*n N is the number of channels of the feature tensor T.

Step 4: to ensure the positive nature of the matrix, eigenvalue correction is performed on each covariance matrix:

M ⁺ ＝M+λ*trace(M)*I (2)

wherein lambda is a coefficient larger than zero, and as the covariance matrix is symmetrically semi-positive, the lambda takes a value of 0.0001 in order to reduce the influence of the operation on the feature matrix and ensure positive quality; i is the identity matrix.

Step 5: for covariance matrix M ⁺ Carrying out pooling operation on the characteristic values and taking logarithms of the characteristic values to obtain a characteristic matrix, wherein the specific operation is as follows:

F _cov ＝WM ⁺ W ^T (3)

wherein ,

to pool the parameter matrix, the specific parameters are optimized by back propagation learning, matrix +.>

Step 6: for matrix F _cov Decomposing the eigenvalue and performing the following operation to obtain a matrix F ⁺ The specific operation is as follows:

F _cov ＝U ₁ Σ ₁ U ₁ ^T (4)

F ⁺ ＝U ₁ max(εI,Σ ₁ )U ₁ ^T (5)

where max () is the maximum value of two matrices element by element.

Step 7: f to matrix ⁺ Taking the logarithm of the eigenvalue to obtain a final eigenvector F, wherein the specific operation is as follows:

F _cov ＝U ₂ Σ ₂ U ₂ ^T (6)

F＝U ₂ log(Σ ₂ )U ₂ ^T (7)

wherein log (Σ ₂ ) Finger pair feature matrix Σ ₂ The operation of taking the logarithm of each element of (a).

Step 7: model structure initialization

The feature generator FSRG is a full convolution network, and is implemented by ResNet-50 in the invention, and features matrix F of low-resolution image is adopted _LR For inputting and outputting reconstructed characteristic matrix F _SR The dimension of the input and output feature matrix is consistent, so that the original pooling operation in ResNet is removed; the feature discriminator FSRD is adopted in the invention that the VGG-16 network respectively uses the original image feature matrix F _HR Feature matrix F of low-resolution image of magnification corresponding thereto _SR Is input; the expression classifier C consists of two full-connection layers and a softmax function layer, and outputs the probability of each classification list.

Step 8: setting a loss function

During training, the loss function of the feature generator FSRG is determined by countering the loss L _GAN Feature matrix F _SR and F_HR Perceptual loss L between _P And L2 distance loss L ₂ Composition, wherein L _GAN The method comprises the following steps:

where b is the size of the data batch, L _GAN Is to constrain the feature perception loss of the feature generator FSRG and the feature discriminator FSRD, L _P The method comprises the following steps:

wherein ,C_FC () Representing the output of the last fully connected layer of classifier C.

The loss of the feature generator FSRG is a linear sum of the three:

L _FSRG ＝L _GAN +λ ₁ L _P +λ ₂ L ₂ (10)

wherein ,λ₁ and λ₂ All are adjustable weight coefficients greater than zero, and in the invention, both coefficients are set to 0.1.

The loss function calculation mode of the feature discriminator FSRD is as follows:

wherein ,

θ is a random number between 0 and 1, ensuring that each batch of data is

Is F _SR and F_HR Is p and k are respectively +.>

The exponential parameters and coefficient parameters of the terms, p=6 and k=2 in the experiment, give the best results.

The concrete operation of calculating the probability that a sample belongs to each class using softmax and re-weighting the sample loss is:

w＝(σ-logit) ^r (12)

wherein, the logit is the probability that the sample output by the softmax function corresponds to the real category of the sample, and the parameters sigma and r are respectively set to be 1.5 and 2; the loss function of classifier C is set to cross entropy loss.

Step 9: model training

The gradient was updated using Adam optimizer, the learning rate was set to 0.00002, the Adam's one-order momentum parameter was 0.1, and the second-order momentum parameter was 0.999. The data set training iteration number (Epoch) was set to 400 and the data batch size (batch size) was set to 16.

Model use part:

extracting an image feature tensor T by using a feature extractor E, and carrying out feature reconstruction by using a feature generator FSRG to obtain a corresponding reconstructed feature F _SR The probability that the sample belongs to each class is then calculated by the classifier C, classifying the sample into the class with the highest probability.

Referring to table 1, table 1 shows that the expression recognition average accuracy of different methods on the face image downsampled by each multiplying power of the RAF-DB data set is obviously improved compared with the method for directly carrying out Bicubic interpolation amplification on the low-resolution image. Compared with the super-resolution method RCAN and the Meta-SR of the reconstructed image, the method has better effect on the image with lower resolution and higher average recognition accuracy of each scale image. Compared with the method for directly carrying out Bicubic interpolation amplification on the low-resolution image, the method provided by the invention has obvious improvement. Compared with the super-resolution method RCAN and the Meta-SR of the reconstructed image, the method has better effect on the image with lower resolution and higher average recognition accuracy of each scale image.

TABLE 1 average accuracy of expression recognition on down-sampled face images for each magnification of RAF-DB dataset by different methods

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The low-resolution image facial expression recognition method based on the feature reconstruction model is characterized by comprising the following steps of:

1) Collecting facial expression images with resolution ratio of 100x100 pixels or more, and labeling expression types to serve as an original image I _HR The method comprises the steps of carrying out a first treatment on the surface of the Both the length and width of the image are changed to the original resolution

To->

Obtaining a corresponding low-resolution image, wherein the expression category label of the low-resolution image is consistent with that of the original image; taking one part of the original image and the corresponding low-resolution image as a training set, and taking the other part of the original image and the corresponding low-resolution image as a test set;

during the training of step 2), the loss function of the feature generator FSRG is determined by countering the loss L _GAN Feature matrix F _SR and F_HR Perceptual loss L between _P And a two-norm loss L ₂ Composition;

countering loss L _GAN The method comprises the following steps:

where b is the size of the data batch;

loss of feature perception L _P The method comprises the following steps:

loss of two norms L ₂ The method comprises the following steps:

the loss of the feature generator FSRG is a linear sum of the three:

L _FSRG ＝L _GAN +λ ₁ L _P +λ ₂ L ₂ (11)

wherein ,λ₁ and λ₂ All are weight coefficients greater than zero;

2. The low resolution image facial expression recognition method based on a feature reconstruction model according to claim 1, wherein the feature extractor E in step 2) is composed of a plurality of convolution layers and a nonlinear activation layer, and is a feature extraction portion of an expression recognition model pre-trained by an original image dataset.

3. The low-resolution image facial expression recognition method based on the feature reconstruction model according to claim 1, wherein the feature extraction process in the feature extractor E in step 2) is as follows:

calculating a covariance matrix M of the characteristic tensor T:

wherein ,f_i Representing one channel of the feature tensor T,

mean value of the channels of the feature tensor, +.>

n is the number of channels of the feature tensor T;

M ⁺ ＝M+λ*trace(M)*I (2)

4. A low resolution image facial expression recognition method based on a feature reconstruction model as claimed in claim 3, wherein the covariance matrix M for correction ⁺ The process of pooling operation and taking logarithm of the characteristic value to obtain the characteristic matrix is as follows:

F _cov ＝WM ⁺ W ^T (3)

wherein ,

for pooling parameter matrix, < >>

Is an output matrix;

F _cov ＝U ₁ Σ ₁ U ₁ ^T (4)

F ⁺ ＝U ₁ max(εI,Σ ₁ )U ₁ ^T (5)

for F ⁺ The specific operation is as follows:

F ⁺ ＝U ₂ Σ ₂ U ₂ ^T (6)

F＝U ₂ log(Σ ₂ )U ₂ ^T (7)

5. The method for recognizing facial expression of a low-resolution image based on a feature reconstruction model according to claim 1, wherein the feature generator FSRG in the step 2) is a full convolution network, which is composed of a convolution neural network and a nonlinear activation layer, and the process of reconstructing a feature matrix by the feature generator FSRG is as follows:

6. The method for recognizing facial expression of a low-resolution image based on a feature reconstruction model according to claim 1, wherein the feature discriminator FSRD in step 2) compares the differences between the two in the distribution space, specifically:

7. The method for recognizing facial expressions of low-resolution images based on a feature reconstruction model according to claim 1, wherein the training process of step 2) includes the following steps:

wherein ,

θ is a random number between 0 and 1, ensuring +.>

Is F _SR and F_HR Linear interpolation results of (2); p and k are each->

An exponential parameter and a coefficient parameter of the term.

8. The facial expression recognition method of a low-resolution image based on a feature reconstruction model as set forth in claim 1, wherein the expression classifier C in step 2) calculates samples belonging to each Class using softmax _i I=1,..a., z, z is the total number of categories, and its loss is re-weighted by the probability value corresponding to the real category, specifically：

w＝(σ-logit) ^r (13)