CN112036281B

CN112036281B - Facial expression recognition method based on improved capsule network

Info

Publication number: CN112036281B
Application number: CN202010860025.4A
Authority: CN
Inventors: 张会焱; 敖文刚; 刘宗敏
Original assignee: Chongqing Technology and Business University
Current assignee: Chongqing Technology and Business University
Priority date: 2020-07-29
Filing date: 2020-08-24
Publication date: 2023-06-09
Anticipated expiration: 2040-08-24
Also published as: CN112036281A

Abstract

The invention provides a facial expression recognition method based on an improved capsule network, which comprises the following steps: inputting sample pictures into an improved capsule network for training; inputting the live-action picture into an improved capsule network for identification, and extracting the facial expression in the live-action picture; sample pictures are input into the improved capsule network for training specifically comprising: s1, extracting a face region from a picture through a multitasking convolutional neural network; s2, marking the extracted face area to obtain the expression and the head posture of the face area; s3, inputting the expression and the head gesture of the face area into a generating countermeasure network, and generating the face area with the expression for the generating countermeasure network; s4, inputting the face area with the expression into the improved capsule network to train the improved capsule network, and accurately identifying the expression of the face under different postures without considering the posture condition of the human body, so that the identification accuracy is ensured, and meanwhile, the identification efficiency is effectively improved.

Description

Facial expression recognition method based on improved capsule network

Technical Field

The invention relates to a facial expression recognition method, in particular to a facial expression recognition method based on an improved capsule network.

Background

Facial expression recognition is widely applied to modern production and life, mainly depends on a deep convolutional neural network frame, and can recognize a human face to a certain extent, but facial expression recognition of the human face under different postures is difficult to realize, and angles of the human face need to be adjusted in the recognition process, so that the efficiency of facial expression recognition is reduced, on the other hand, facial expression recognition depends on respective characteristics of organs such as eyes, nose and mouth, and the existing facial expression recognition cannot recognize relative positions among the organs, so that recognition accuracy is low.

Therefore, in order to solve the above-mentioned problems, a technical means is needed to be proposed.

Disclosure of Invention

In view of the above, the present invention aims to provide a facial expression recognition method based on an improved capsule network, which can accurately recognize facial expressions under different gestures without considering the gesture conditions of a human body, so that recognition accuracy can be ensured and recognition efficiency can be effectively improved.

The invention provides a facial expression recognition method based on an improved capsule network, which comprises the following steps:

inputting sample pictures into an improved capsule network for training;

inputting the live-action picture into an improved capsule network for identification, and extracting the facial expression in the live-action picture;

sample pictures are input into the improved capsule network for training specifically comprising:

s1, extracting a face region from a picture through a multitasking convolutional neural network;

s2, marking the extracted face area to obtain the expression and the head posture of the face area;

s3, inputting the expression and the head gesture of the face area into a generating countermeasure network, and generating the face area with the expression for the generating countermeasure network;

s4, inputting the face area with the expression into an improved capsule network to train the improved capsule network.

Further, in step S3, the generating the face area with the expression for the countermeasure network specifically includes:

the generation countermeasure network includes an encoder and a decoder;

inputting the expression and the head gesture of the face area into an encoder for processing, and outputting the face picture characteristics, the expression and the gesture by the encoder;

inputting the facial picture characteristics, the expressions and the gestures into a decoder for processing, and outputting the facial picture with the expressions by the decoder;

constructing an objective function for generating an countermeasure network:

/>

wherein x is a face region, y is an expression label and a gesture label, D (x, y) is an expression of a discriminator, and the output is true or false; g (x, y) is an expression of the generator, and is output as a generated personFace pictures; d (G (x, y), y) is the result of the judging generator G generating the face picture by the judging device D; pd (x, y) is the joint probability of x and y; e (E) _{x,y～pd(x,y)} Is a desire for pd (x, y);

and judging the face picture with the expression output by the decoder by adopting an objective function for generating an countermeasure network, and outputting the face picture with the expression, wherein the judging result is true.

Further, the step S4 specifically includes:

the improved capsule network has a relu convolutional layer, an initial capsule layer prim_cap, a first convolutional capsule layer conv_cap1, a second convolutional capsule layer conv_cap2, and a classification capsule layer class_cap;

inputting the face picture with the facial expression of the countermeasure network into a relu convolution layer for processing, and outputting the local characteristics of the face picture;

the initial capsule layer prim_cap processes local features of the face picture output by the relu convolution layer, and outputs 32 capsules;

the first capsule convolution layer conv_cap1 processes the initial capsule layer and outputs 32 capsules;

the second capsule convolution layer conv_cap2 processes the 32 capsules output by the first capsule convolution layer conv_cap1 and outputs 32 capsules;

the classified capsule layer class_cap processes the 32 capsules output by the second capsule convolution layer conv_cap2 and outputs 7 capsules, the 7 capsules corresponding to the facial expression in 7.

Further, the 32 capsules output by the initial capsule layer prim_cap are sequentially realized from the first capsule convolution layer conv_cap1, the second capsule convolution layer conv_cap2 to the classified capsule layer class_cap through a T-EM route, and specifically include:

determining a voting matrix V of lower layer capsules i to higher layer capsules j _ij ：

V _ij ＝P _i ·W _ij ；

wherein ,P_i Is the gesture matrix of the capsule i of the lower layer, W _ij A view invariant matrix for lower layer capsules i to higher layer capsules j;

wherein the voting matrix V _ij The kth element in (a)

The capsules j belonging to the higher layer are determined by the T distribution:

wherein Γ (·) is a gamma function,

is element->

To mean mu _j Is the mahalanobis distance; />

For the desire of T distribution, +.>

For the degree of freedom of the T distribution, +.>

Is the variance of the T distribution, pi is the circumference ratio;

wherein ,

the loss function C for classifying I lower layer capsules into the J higher layer capsules is:

wherein ,R_ij The weight of the jth capsule belongs to the jth capsule;

gesture matrix P for higher layer capsules _j And an activation matrix a _j Voting matrix P through lower layer capsules _i And an activation matrix a _i Routing through T-EMThe process minimization formula (3) is obtained by:

initializing parameters:

wherein J is the number of capsules of a higher layer;

m step:

R _ij ＝R _ij ×a _i ,i＝[1,I]；

β _a and β_v The method is characterized in that a trainable variable is represented, lambda is a temperature coefficient, and the value is 0.01;

degree of freedom

By calculating the solution or the result of the following formula:

wherein ,/>

The degree of freedom of the T distribution in the previous calculation; />

E, step E: determining a route based on t distribution based on the parameters calculated in the M step:

；П _k the operator is a product;

after the times of the M step and the E step are set, an attitude matrix P of the capsule with a higher layer is obtained _j And the gesture matrix P _j Each element being a voting matrix V _ij Elements of (2)

Is a mean value of (c).

Further, training the capsule network by a propagation loss function, wherein the propagation loss function of the t-th capsule in the lower layer capsule activation higher layer capsule is:

L＝∑ _i≠t (max(0,m-(a _t -a _i ))) ² the method comprises the steps of carrying out a first treatment on the surface of the Wherein m is a variable gap, the initial value is 0.2, the maximum value is 0.9, a _t To activate the activation value of the parent capsule, a _i An activation value of the parent capsule that is inactive.

The invention has the beneficial effects that: according to the invention, the facial expressions under different postures can be accurately identified without considering the posture conditions of the human body, so that the identification accuracy is ensured, and the identification efficiency is effectively improved.

Drawings

The invention is further described below with reference to the accompanying drawings and examples:

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawing figures:

inputting sample pictures into an improved capsule network for training;

s4, inputting the face area with the expression into an improved capsule network to train the improved capsule network; according to the invention, the facial expressions under different postures can be accurately identified without considering the posture conditions of the human body, so that the identification accuracy is ensured, and the identification efficiency is effectively improved.

In this embodiment, in step S3, the generating the face region with the expression generated by the countermeasure network specifically includes:

the generation countermeasure network includes an encoder and a decoder;

inputting the expression and the head gesture of the face area into an encoder for processing, and outputting the face picture characteristics, the expression and the gesture by the encoder; the input of the encoder is 224 x 244 x 3 face pictures, the output is 50 face picture features f, and the encoder is composed of five convolution layers and one full connection layer, wherein: the core of the convolution layer is 5*5 with the relu activation function; the full connection layer is provided with a tanh activation function;

inputting the facial picture characteristics, the expressions and the gestures into a decoder for processing, and outputting the facial picture with the expressions by the decoder; the decoder consists of seven deconvolution layers, the convolution kernel is 5*5, the first six deconvolution layers have a relu activation function, and the last deconvolution layer has a tanh activation function;

constructing an objective function for generating an countermeasure network:

wherein x is a face region, y is an expression label and a gesture label, D (x, y) is an expression of a discriminator, and the output is true or false; g (x, y) is an expression of the generator, and is output as a generated face picture; d (G (x, y), y) is the result of the judging generator G generating the face picture by the judging device D; pd (x, y) is the joint probability of x and y; e (E) _{x,y～pd(x,y)} Is a desire for pd (x, y);

judging the facial picture with the expression output by the decoder by adopting an objective function for generating an countermeasure network, and outputting the facial picture with the expression, wherein the judging result is true; by the method, the facial expression in the sample picture can be accurately extracted, so that subsequent training and final recognition are facilitated.

In this embodiment, step S4 specifically includes:

the improved capsule network has a relu convolutional layer, an initial capsule layer prim_cap, a first convolutional capsule layer conv_cap1, a second convolutional capsule layer conv_cap2, and a classification capsule layer class_cap; the input of the Relu convolution layer conv_relu is a facial expression picture of 28 x 3, the output is a local feature of 14 x 32, and the facial expression picture consists of a coiler layer 5*5, a Batchnormal layer and a Relu layer;

the initial capsule layer prim_cap processes local features of the face picture output by the relu convolution layer, and outputs 32 capsules; the input of the initial capsule layer (prim_cap) is the local feature of the output of the Relu convolution layer, the output is 32 capsules, the initial capsule layer is composed of two paths of rolling machine layers 1*1, the step length is 1, and a gesture matrix and an activation matrix of the output capsules are respectively formed;

Specifically: the 32 capsules output by the initial capsule layer prim_cap are sequentially realized from a first capsule convolution layer conv_cap1, a second capsule convolution layer conv_cap2 to a classification capsule layer class_cap through a T-EM route, and the method specifically comprises the following steps:

V _ij ＝P _i ·W _ij ；

wherein the voting matrix V _ij The kth element in (a)

The capsules j belonging to the higher layer are determined by the T distribution: />

Wherein Γ (·) is a gamma function,

is element->

To the average valueμ _j Is the mahalanobis distance; />

For the desire of T distribution, +.>

For the degree of freedom of the T distribution, +.>

Is the variance of the T distribution, pi is the circumference ratio;

wherein ,

wherein ,R_ij The weight of the jth capsule belongs to the jth capsule;

gesture matrix P for higher layer capsules _j And an activation matrix a _j Voting matrix P through lower layer capsules _i And an activation matrix a _i Obtained by minimizing formula (3) through T-EM routing process, specifically:

initializing parameters:

wherein J is the number of capsules of a higher layer;

m step:

R _ij ＝R _ij ×a _i ,i＝[1,I]；

β _a and β_v The method is characterized in that a trainable variable is represented, lambda is a temperature coefficient, and the value is 0.01; />

Degree of freedom

By calculating the solution or the result of the following formula:

wherein ,/>

The degree of freedom of the T distribution in the previous calculation;

；П _k the operator is a product;

by iteration MAfter the setting times of the step and the step E, obtaining an attitude matrix P of the capsule of the higher layer _j And the gesture matrix P _j Each element being a voting matrix V _ij Elements of (2)

Is a mean value of (c). By the method, each capsule can be trained, so that the accuracy of final identification is ensured.

In this embodiment, the capsule network is trained by the propagation loss function, and a trainable variable β is obtained during the training process _a and β_v The propagation loss function of the t-th capsule in the lower layer capsules activated higher layer capsules is:

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A facial expression recognition method based on an improved capsule network is characterized by comprising the following steps of: comprising the following steps:

inputting sample pictures into an improved capsule network for training;

s4, inputting the face area with the expression into an improved capsule network to train the improved capsule network;

in step S3, the generating the face area with the expression by the countermeasure network specifically includes:

the generation countermeasure network includes an encoder and a decoder;

constructing an objective function for generating an countermeasure network:

judging the facial picture with the expression output by the decoder by adopting an objective function for generating an countermeasure network, and outputting the facial picture with the expression, wherein the judging result is true;

the step S4 specifically includes:

the classified capsule layer class_cap processes the 32 capsules output by the second capsule convolution layer conv_cap2 and outputs 7 capsules, wherein the 7 capsules correspond to the facial expression in 7;

the 32 capsules output by the initial capsule layer prim_cap are sequentially realized from a first capsule convolution layer conv_cap1, a second capsule convolution layer conv_cap2 to a classification capsule layer class_cap through a T-EM route, and the method specifically comprises the following steps:

V _ij ＝P _i ·W _ij ；

wherein the voting matrix V _ij The kth element in (a)

wherein Γ (·) is a gamma function,

is element->

To mean mu _j Is the mahalanobis distance; />

For the desire of T distribution, +.>

For the degree of freedom of the T distribution, +.>

Is the variance of the T distribution, pi is the circumference ratio;

wherein ,

wherein ,R_ij The weight of the ith capsule belonging to the jth capsule;

initializing parameters:

wherein J is the number of capsules of a higher layer;

m step:

R _ij ＝R _ij ×a _i ,i＝[1,I]；

/>

β _a and β_v Representing a trainable variable, λ being a temperature coefficient;

degree of freedom

By calculating the solution or the result of the following formula:

wherein ,/>

The degree of freedom of the T distribution in the previous calculation;

Π _k the operator is a product;

Is a mean value of (c).

2. The facial expression recognition method based on the improved capsule network according to claim 1, wherein: training the capsule network through a propagation loss function, wherein the propagation loss function of the t-th capsule in the lower-layer capsule activation higher-layer capsule is as follows: