CN110717434A

CN110717434A - Expression recognition method based on feature separation

Info

Publication number: CN110717434A
Application number: CN201910941100.7A
Authority: CN
Inventors: 谢龙汉; 杨烈
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-21
Anticipated expiration: 2039-09-30
Also published as: CN110717434B

Abstract

The invention discloses an expression recognition method based on feature separation. The method separates the features related to the expression from the features unrelated to the expression by using a feature separation framework based on a generated countermeasure network, and then ignores the features unrelated to the expression and only performs expression recognition according to the features related to the expression. The method provides a characteristic separation framework based on generation of a countermeasure network, which consists of a generator G and a discriminator D; wherein the generator G comprises two parts, an encoder Gen and a decoder Gde. In the training process, the framework enables the features really determining the expression to be converged to the expression feature part and the features irrelevant to the expression to be converged to the irrelevant feature part through partial feature exchange and the constraints of center loss, content loss, counter loss and classification loss. In the testing stage, the classifier is trained only according to the expression features to perform expression recognition. Through feature separation, the method can overcome the interference of other irrelevant factors in the expression recognition process, and improve the accuracy of expression recognition.

Description

Expression recognition method based on feature separation

Technical Field

The invention relates to the field of facial expression recognition, in particular to an expression recognition method based on feature separation.

Background

Facial expression is one of the important ways that humans express feelings and intentions. Facial expression recognition has high potential application value in social robots, medical instruments, fatigue driving monitoring and many other human-computer interaction systems, and therefore, a great deal of research is carried out on the facial expression recognition. The traditional method is based on artificially designed features or performs shallow learning to recognize facial expressions, the recognition energy is very limited, and with the development of deep learning, more and more researches adopt a deep learning method to recognize facial expressions, so that the development of the field of facial expression recognition is greatly promoted. Although deep learning has strong feature learning ability and achieves good effect in the task of facial expression recognition, some problems still exist in practical application, mainly including: (1) there are large individual differences between different subjects; (2) the facial pose and the illumination condition of the human face in an actual application scene are different; (3) occlusion problems and the like also exist in practical application scenes.

In recent years, facial expression recognition based on deep learning has been advanced, but the problem of individual difference still remains one of the important factors limiting the improvement of the accuracy rate of facial expression recognition. In the expression recognition task, it is generally desirable that the distance of the images of the same expression in the feature space is smaller than the distance between different expressions. However, due to the fact that different individuals have great differences in age, sex, hair style, skin color, growth, character and the like, the distance between different expressions of the same individual is smaller than the distance between the same expression of different individuals, so that the expression recognition of different individuals is incorrect.

Disclosure of Invention

In order to solve the problem of individual difference in the expression recognition process and avoid interference of irrelevant features on an expression recognition task, the invention provides a feature separation framework based on generation of a countermeasure network. The frame can separate the characteristics related to the expression from the characteristics unrelated to the expression, neglect the characteristics unrelated to the expression and only recognize the expression according to the characteristics related to the expression, thereby overcoming the interference of the unrelated factors and improving the accuracy of the expression recognition.

The purpose of the invention is realized by at least one of the following technical solutions.

An expression recognition method based on feature separation comprises the following steps:

s1, a feature separation frame is provided, and the frame is used for performing expression feature exchange on two input images;

s2, training a feature separation framework through the constraint of various loss functions to ensure that the features really determining the expression are converged into expression feature vectors during feature separation, and the features irrelevant to the expression are converged into irrelevant feature vectors;

and S3, after the training of the feature separation frame is finished, training a classifier according to the expression feature vectors separated by the feature separation frame, and recognizing the expression.

Further, the step S1 specifically includes the following steps:

s1.1, first, a feature separation framework based on generation of a countermeasure network is proposed, which is composed of a generator G and a discriminator D, where the generator G includes an encoder Gen and a decoder Gde;

s1.2, arbitrarily reading two images from a training set to serve as an input first sample and an input second sample, and then respectively performing feature extraction on the input first sample and the input second sample by using an encoder Gen;

s1.3, dividing the extracted feature vector into an expression feature vector and an irrelevant feature vector according to a set proportion to obtain a first expression feature vector, a second expression feature vector, a first irrelevant feature vector and a second irrelevant feature vector; then exchanging the first expression feature vector and the second expression feature vector, and keeping the first irrelevant feature vector and the second irrelevant feature vector unchanged;

s1.4, generating two new images, namely a first generation sample and a second generation sample, by using a decoder Gde according to the feature vectors recombined after exchange;

s1.5, inputting the generated first generated sample and the corresponding real sample, namely the third sample, into a discriminator D, inputting the generated second generated sample and the corresponding real sample, namely the fourth sample, into the discriminator D, and simultaneously carrying out true and false discrimination and classification.

Further, the step S2 includes the following steps:

s2.1, in order to make the characteristics irrelevant to the expression converge into irrelevant vectors, introducing content loss L_con(ii) a The first sample and the third sample and the second sample and the fourth sample are different in expression, and other characteristics are the same, and the generator G can converge characteristics which are irrelevant to the expression into an invariant characteristic vector which does not participate in exchange through the constraint of content loss between the constraint samples, namely the third sample and the fourth sample, and the generated image;

s2.2, in order to make the features related to the expression converge into the expression feature vector, introducing classification loss; loss of classification from real images during training of a feature separation framework

The discriminator D is optimized, so that the classification capability of the discriminator D is improved; while the loss of classification from the generated imageOptimizing a generator G to enable the generator G to realize expression exchange of the generated image through feature exchange, so that the information determining the expression of the generated image is converged into the expression feature vector participating in the exchange;

s2.3, in order to make the generated image very close to the real sample, thereby leading the content loss and the classification loss to better converge, introducing the counter loss L_adv；

S2.4, in order to reduce the distance of the expression features of the same type of samples in the feature space and improve the separation purity of the expression features, the expression recognition accuracy is improved, and the central loss L is introduced_cen；

S2.5, finally, according to the comprehensive loss function L of the generator G_GSum discriminant D's combined loss function L_DTo train a feature separation framework, L_G and L_DRespectively as follows:

wherein ,λ_cls,λ_con and λ_cenThe weights of the classification loss, the content loss and the center loss, respectively, in the final loss function, need to be determined by a large number of experiments.

Further, in step S2.1, content is lost L_conThe expression of (a) is as follows:

wherein G (x) represents a generated sample obtained from the input sample x, and

representing constrained samples corresponding to calculated content loss for the generated image, E_xRepresenting the mathematical expectation from the input sample x.

Further, in step S2.2,

and

are respectively:

where x represents the input sample, c represents the class of sample x, G (x) represents the generated sample from sample x, D_cls(c | x) represents the probability that the discriminator D recognizes the input sample x as the class c, E_x,cIndicating a mathematical expectation from the input sample x and the class c of the sample x.

Further, step (ii)In S2.3, the loss L is resisted_advThe expression of (a) is as follows:

L_adv＝E_x[logD_src(x_o)]+E_x[log(1-D_src(G(x_i)))]；

wherein ,x_i and x_oInput samples and constraint samples, G (x), respectively_i) Is based on input sample x_iResulting generated image, D_src(x) Probability of judging sample x as a true sample for discriminator D, E_xRepresenting the mathematical expectation from the input sample x.

Further, in step S2.4, the center loss L_cenThe expression of (a) is as follows:

wherein ,e_iRepresenting the expressive features, y, of the sample i_iA category of the sample i is represented,

representing a central feature vector of a corresponding category of the sample i in the k training; the initial value of the central feature vector of each expression is a random value, and in the training process, the updating process of the central feature vector of each expression is as follows:

wherein ,

represents the average distance between the jth sample and the corresponding central vector in the t iteration,

and

respectively representing the central vectors of the ith sample in the processes of the t iteration and the t +1 iteration, wherein alpha is the updated learning rate of the central vector, alpha belongs to (0,1), and the specific value of alpha needs to be determined through a large number of experiments.

Further, the step S3 specifically includes the following steps:

s3.1, after training of a feature separation frame is completed, firstly, performing feature extraction and separation on any input sample by using a trained encoder Gen, then training a simple convolutional neural network as a classifier according to the separated expression feature vector, wherein the classifier uses cross entropy loss as an optimization objective function;

and S3.2, after the training of the classifier is finished, reading in test samples from the test set, firstly, using the previously trained encoder Gen to extract and separate the features of each test sample, and then using the classifier to recognize the expressions according to the separated expression feature vectors.

Compared with the prior art, the invention has the advantages that:

the invention provides a feature separation framework based on generation of a countermeasure network, which can separate features related to expressions and features unrelated to the expressions, ignore the features unrelated to the expressions and only recognize the expressions according to the features related to the expressions, thereby overcoming the interference of the unrelated features to the expression recognition process and improving the accuracy of the expression recognition.

Drawings

Fig. 1 is a diagram of a feature separation framework according to an embodiment of the present invention.

Fig. 2 is a network structure diagram for expression recognition in the testing phase according to the embodiment of the present invention.

Detailed Description

The practice of the present invention will be further illustrated by the following examples and drawings, but the practice and protection of the present invention is not limited thereto.

s1, a feature separation frame is provided, and the frame is used for performing expression feature exchange on two input images, and the method specifically comprises the following steps:

s1.1, first, a feature separation framework based on generation of a countermeasure network is proposed, which is composed of a generator G and a discriminator D, wherein the generator G comprises an encoder Gen and a decoder Gde. In this example, the network structure of the generator is shown in table 1, and the network structure of the discriminator is shown in table 2.

TABLE 1 network architecture of generators

In table 1, the convolution module contains one convolution layer, one instance normalization layer, one ReLU activation function layer, and one Dropout layer, and h and w are the height and width of the input image, respectively.

TABLE 2 network architecture of arbiter

In table 2, the discriminating module includes a convolution layer, a leakage relu activation function layer, and a Dropout layer, and h and w are the height and width of the input image, respectively.

S1.2, arbitrarily inputting two images from a training set as a first sample and a second sample which are input, wherein the size of an input picture is 128 x 128 in the embodiment, and then respectively performing feature extraction on the input first sample and the input second sample by using an encoder Gen.

S1.3, dividing the extracted feature vector into an expression feature vector and an irrelevant feature vector according to a proportion, wherein the feature separation ratio in the example is 124:900, and obtaining a first expression feature vector, a second expression feature vector, a first irrelevant feature vector and a second irrelevant feature vector; and then exchanging the first expression feature vector and the second expression feature vector, wherein the first irrelevant feature vector and the second irrelevant feature vector are kept unchanged.

S2, training a feature separation framework through the constraint of various loss functions to ensure that the features really determining the expression are converged into expression feature vectors during feature separation, and the features irrelevant to the expression are converged into irrelevant feature vectors, and the specific steps are as follows:

s2.1, in order to make the characteristics irrelevant to the expression converge into irrelevant vectors, introducing content loss L_con(ii) a The first sample, the second sample, the third sample and the fourth sample are different in expression, other characteristics are the same, and the generator G can converge characteristics irrelevant to the expression into an invariant characteristic vector which does not participate in exchange through the constraint of content loss between the constraint samples, namely the third sample and the fourth sample, and the generated image; content loss L_conThe expression of (a) is as follows:

The discriminator D is optimized, so that the classification capability of the discriminator D is improved; while the loss of classification from the generated image

Optimizing a generator G to enable the generator G to realize expression exchange of the generated image through feature exchange, so that the information determining the expression of the generated image is converged into the expression feature vector participating in the exchange;

and

are respectively:

S2.3, in order to make the generated image very close to the real sample, thereby leading the content loss and the classification loss to better converge, introducing the counter loss L_adv(ii) a Against loss L_advThe expression of (a) is as follows:

L_adv＝E_x[logD_src(x_o)]+E_x[log(1-D_src(G(x_i)))]；

S2.4, in order to reduce the distance of the expression features of the same type of samples in the feature space and improve the separation purity of the expression features, the expression recognition accuracy is improved, and the central loss L is introduced_cen(ii) a Center loss L_cenThe expression of (a) is as follows:

wherein ,

represents the average distance between the jth sample and the corresponding central vector in the t iteration,and

respectively representing the central vectors of the ith sample in the processes of the tth iteration and the t +1 th iteration, wherein alpha is the learning rate (alpha belongs to (0,1)) updated by the central vectors, and the specific value of the central vectors needs to be determined through a large number of experiments;

wherein ,λ_cls,λ_con and λ_cenThe weights of the classification loss, the content loss and the central loss in the final loss function are respectively, and through a large number of experiments, the values of the weights in the example are respectively as follows: lambda [ alpha ]_cls＝3,λ_con＝10，λ_cen＝10。

And S3, after the training of the feature separation frame is finished, training a classifier according to the expression feature vectors separated by the feature separation frame, and recognizing the expression, wherein as shown in FIG. 2, the specific steps are as follows:

s3.1, after training of a feature separation frame is completed, firstly, performing feature extraction and separation on any input sample by using a trained encoder Gen, then training a simple convolutional neural network as a classifier according to the separated expression feature vector, wherein the network structure of the simple convolutional neural network is shown in a table 3, and the classifier uses cross entropy loss as an optimization objective function;

TABLE 3 network architecture of convolutional neural networks

The classification module in the table contains a convolution layer, a LeakyReLU activation function layer and a Dropout layer, where h and w are the height and width of the input image, respectively.

The frame can separate the features related to the expression from the features unrelated to the expression, neglect the features unrelated to the expression and only recognize the expression according to the features related to the expression, thereby overcoming the interference of the unrelated features to the expression recognition process and improving the accuracy of the expression recognition.

Claims

1. An expression recognition method based on feature separation is characterized by comprising the following steps:

2. The expression recognition method based on feature separation according to claim 1, wherein the step S1 specifically includes the following steps:

3. The expression recognition method based on feature separation according to claim 1, wherein the step S2 comprises the following steps:

To optimize the discriminator D and thereby improve the discriminator DA classification capability; while the loss of classification from the generated image

wherein ,λ_cls，λ_con and λ_cenThe weights of the classification loss, the content loss and the center loss, respectively, in the final loss function, need to be determined by a large number of experiments.

4. The method for recognizing expressions based on feature separation as claimed in claim 3, wherein in step S2.1, the content is lost by L_conThe expression of (a) is as follows:

wherein G (x) represents a generated sample obtained from the input sample x, andrepresenting constrained samples corresponding to calculated content loss for the generated image, E_xRepresenting the mathematical expectation from the input sample x.

5. The expression recognition method based on feature separation as claimed in claim 3, wherein in step S2.2,

and

are respectively:

where x represents the input sample, c represents the class of sample x, G (x) represents the generated sample from sample x, D_cls(c | x) represents the probability that the discriminator D recognizes the input sample x as the class c, E_x，cIndicating a mathematical expectation from the input sample x and the class c of the sample x.

6. The expression recognition method based on feature separation as claimed in claim 3, wherein in step S2.3, the confrontation loss L_advThe expression of (a) is as follows:

L_adv＝E_x[logD_src(x_o)]+E_x[log(1-D_src(G(x_i)))]；

7. The expression recognition method based on feature separation as claimed in claim 3, wherein in step S2.4, the center loss L is_cenThe expression of (a) is as follows:

wherein ,

and

8. The expression recognition method based on feature separation according to claim 1, wherein the step S3 specifically includes the following steps: