CN112651301A

CN112651301A - Expression recognition method integrating global and local features of human face

Info

Publication number: CN112651301A
Application number: CN202011422314.2A
Authority: CN
Inventors: 张繁; 陆秀芹; 李小薪
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-04-13

Abstract

An expression recognition method integrating global and local features of a human face comprises a model feature extraction stage, wherein three branches are designed; the first branch uses the whole face picture to extract the global face features; the second branch cuts the face into two parts from top to bottom, and the third branch cuts the face into three parts from top to bottom, and extracts the local features of the face; and fusing the extracted global and local features, and inputting the fused global and local features into a softmax classifier to classify the facial expressions, so as to obtain an identification model integrating the global and local features of the human faces. Experiments show that the method for respectively extracting the global features and the local features and then fusing the global features and the local features can improve the accuracy of facial expression recognition.

Description

Expression recognition method integrating global and local features of human face

Technical Field

The invention belongs to the fields of facial expression recognition, image processing and artificial intelligence, and particularly relates to an expression recognition method integrating global and local characteristics of a human face.

Background

Human Face Expression Recognition (FER) is an important means for a machine to sense human emotion and interact with human, and is an important research direction of human-computer interaction. By analyzing the facial expression, the current emotion and potential intention of the person can be approximately reflected, and the social interaction between people is played a very important role. Meanwhile, the progress of the facial expression recognition technology can promote the development of multiple fields such as image processing, artificial intelligence and the like.

The traditional facial expression recognition method is that facial features are extracted manually and then input into a classifier to classify expressions, and is complex in design and low in recognition rate. The convolutional neural network integrates feature extraction and classification into an end-to-end network, and overcomes the defects of manual feature extraction to a certain extent. However, most of the existing researches adopt a single network to extract the global features of the human face, some unobvious details and features with low occurrence frequency are easy to ignore, part of the detail features of the human face are lost, the overfitting problem is easy to cause by directly training a deep network by using a small data set, and a network model is easy to degrade, so that the accuracy of the facial expression recognition is influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an expression recognition method integrating global and local features of a human face, which is used for solving the problems of incomplete human face feature extraction and overfitting, gradient disappearance and model degradation in deep network training in the prior art by dividing the human face into a plurality of parts and extracting the global and local features of the human face, and further improving the accuracy of human face expression recognition.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an expression recognition method integrating global and local features of a human face comprises the following steps:

step 1: preprocessing a sample image to obtain a human face gray level image with uniform size;

step 2: dividing the picture preprocessed in the step 1 into a training set and a testing set for training and testing a network model;

and step 3: extracting the Global face features by using a Global branch in a Facial Expression Recognition network model (GL-FER) integrating Global and Local face features, and extracting the detail features of the face by using two Local branches;

and 4, step 4: fusing the global features and the local features of the face extracted in the step (3), and inputting the fused features into a softmax classifier for training to obtain a recognition model integrating the global features and the local features of the face;

and 5: inputting the test sample data into the recognition model designed in the step 4, and realizing more accurate output of the expression recognition result;

further, in the step 1, the pretreatment process includes: the method comprises the steps of firstly carrying out face detection on sample data in a database by adopting an Adaboost algorithm based on Haar characteristics, then obtaining a gray level image with uniform illumination by using a gray level normalization method based on histogram equalization, then carrying out size normalization on the gray level image to obtain a gray level image with the same size, and finally carrying out pixel normalization processing on the image, so that the network convergence is convenient and rapid.

Still further, the original images of step 1 are all uniform in size to 64 × 64 pixels.

Furthermore, in step 3, the process of extracting Global and detail features of the face by integrating the Expression Recognition network model (GL-FER) of Global and Local features of the face includes:

the convolution of 7 x 7 is used for carrying out convolution on the face image, a BN and Relu activation functions are added, the network learning rate is accelerated, network overfitting is prevented, and the maximum pooling operation is used for outputting the image after dimension reduction processing;

designing a network residual Block Basic Block;

design 4 layers of Conv Block: conv Block2, Conv Block3, Conv Block4 and Conv Block5, wherein each layer of Conv Block consists of different number of network residual blocks;

designing a Global branch (Global branch) structure and two Local branch (Local branch-1, Local branch-2) structures after Conv Block 3;

after Conv Block5, performing average pooling on the global branch;

the Local branch-1 and the Local branch-2 cut the image from top to bottom from the fifth layer network structure, the Local branch-1 is cut into two blocks, and the Local branch-2 is cut into three blocks for extracting the detail features of the human face;

performing dimensionality reduction processing on Global features (Global features) and detail features (Local features-1 and Local features-2) extracted from the three branches by using maximum pooling;

performing dimensionality reduction operation on the extracted local feature output vector and the global feature output vector by using convolution of 1 x 1, and keeping the dimensionality of the feature vectors consistent;

the network residual block design comprises the following steps: and (3) performing convolution on the input image of the previous layer by using a convolution kernel of 1 x 1, adding BN and Relu activation functions, performing convolution by using a convolution kernel of 3 x 3, adding BN and Relu activation functions, performing convolution again by using a convolution kernel of 1 x 1, and using convolution of 1 x 1 between the input and the output of each network residual block to keep the output dimension consistent.

Preferably, in the GL-FER network model, the Conv Block2 is composed of three network residual blocks, the Conv Block3 is composed of 4 network residual blocks, the Conv Block4 is composed of 6 network residual blocks, and the Conv Block5 is composed of 3 network residual blocks.

The first three-layer network structures of the GL-FER three branches share Conv Block2 and Conv Block3 network layers, the network structures are divided into three branches when reaching Conv Block4, the first branch is Global branch, and Global features are extracted; the second and third branches are: and extracting detail features of the Local branch-1 and the Local branch-2.

The average pooling operation in GL-FER operates only on feature vectors in the global branch and global feature vectors in Local branch-1, Local branch-2. Local features in the Local branch-1 and the Local branch-2 are not subjected to average pooling, and the purpose is to reserve more information of the image and facilitate network learning of more detailed features.

In step 4, the GL-FER network model training process includes: and using a softmax function and a ternary Loss function (Triplet Loss) for metric learning as Loss functions in the training process, and continuously modifying network weight and bias in the back propagation process to optimize the network model.

Preferably, in the step 4, the global feature vector of the three branches in the GL-FER uses triplet loss, and the detail feature vector uses softmax loss.

The invention has the following beneficial effects: 1. the expression recognition method integrating the global and local features can simultaneously extract the global and local features of the face. The global features can represent the contour information and the spatial distribution of five sense organs of the face, the local features can be used for representing the detail information of the face, the advantages of the two are complementary, and the accuracy of facial expression recognition is improved. 2. Designing a residual error network structure and using a batch normalization technology, solving the problems of gradient dispersion and network degradation in the deep network training process, optimizing a network model and relieving the overfitting problem to a certain extent.

Drawings

Fig. 1 is a flowchart of an expression recognition method integrating global and local features of a human face.

Fig. 2 is a diagram of a GL-FER network architecture.

Fig. 3 is a diagram of a residual network architecture.

FIG. 4 is the recognition rate of the method of the present invention in the CK + database.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 4, an expression recognition method for integrating global and local features of a human face first preprocesses an original image, including gray level normalization, size normalization and pixel normalization, divides the preprocessed image into a training set and a test set, then constructs a GL-FER network model, inputs the pre-training set image into the GL-FER network model, and extracts global and local features of the human face; and fusing the global and local features, inputting the fused global and local features into a softmax classifier for training and recognizing the facial expressions to obtain a recognition model integrating the global and local features of the human faces. Tests show that the method for fusing the global and local characteristics of the human face can effectively improve the accuracy of facial expression recognition.

The main implementation process of the invention comprises pretreatment, GL-FER model construction and model test. Each is described in detail below.

The sample image preprocessing process is as follows:

the invention adopts images in a CK + database as sample data. The database contains 593 picture sequences from 123 subjects, each with a label of an action unit in the last frame, of which 327 picture sequences of 118 subjects mark seven basic emoji labels.

The method comprises the steps of firstly, carrying out face detection on sample data in a database by using an Adaboost algorithm based on Haar features, and judging whether the sample data is a face image. And then processing the gray level image in the database by using a histogram equalization-based method to obtain a gray level image with uniform illumination. And then, carrying out size normalization on the gray level image to obtain a gray level image with the same size, and finally carrying out pixel normalization processing on the image to scale the pixel value to 0-1, so that the network can be conveniently and rapidly converged.

And dividing the preprocessed picture into a training set and a testing set for later model training and testing.

The GL-FER network construction process is as follows: FIG. 2 is a diagram of a GL-FER network architecture.

The GL-FER backbone network uses Resnet50 and splits the part after the Conv Block3 into three branches with similar structure but different downsampling rates.

The leftmost branch is called a Global branch (Global branch) and is used to extract Global features. The middle (Local branch-1) and the right (Local branch-2) belong to Local branches and are used for extracting Local features, and the method specifically comprises the following steps:

2.1GL-FER layer one network architecture:

for the preprocessed characteristic diagram matrix C¹Convolution was performed using a 7 x 7 convolution kernel. As shown in the following formula,

is the ith output characteristic diagram of the 1 st convolutional layer,

the weighted value of the convolution kernel of the ith layer 2 characteristic diagram obtained by network training,

i.e. the bias of the current layer.

To C²The values a ═ of all neurons of the profile (x)₁,x₂……x_n) Using BN technique to force back to a standard normal distribution with mean 0 and variance 1, i.e.:

u_Ais an average value of the sample data a,

is the variance of the sample data a,

and beta is two regulating parameters introduced in the network training process and is utilized in forward and backward propagation

And beta changes the training weights. Relu (x) Max (0, x) is an activation function to alter the non-linearity of neurons.

And performing maximum pooling on the output characteristic graph. And m is the size of the image as shown in the formula III.

2.2 residual error network structure in GL-FER

The residual error network is to fit a potential identity mapping function to the network to make the network in an optimal state all the time, so that the performance of the network is not reduced along with the increase of the depth.

The feature x for an arbitrarily deep cell L is shown by the following equation_LFeature x, which can be expressed as shallow cell l_lIs added with a shape as

The residual function of (2).

Fig. 3 shows a residual net block in GL-FER. Each residual network contains three convolution operations followed by the BN technique and the ReLU activation function, with 1 × 1 convolution between the input and output of each residual network to keep the image dimensions consistent.

2.3GL-FER layer two network architecture:

the method specifically comprises the following steps: and designing a second-layer network structure Conv Block2 by using the basic network residual Block. Conv Block2 consists of three network residual blocks designed in section 2.2.

2.4GL-FER layer three network architecture:

the method specifically comprises the following steps: the third layer network structure Conv Block3 is designed using the underlying network residual Block. Conv Block2 consists of six network residual blocks designed in section 2.2.

2.5GL-FER layer four network architecture:

the method specifically comprises the following steps: the GL-FER is divided into three branches starting from the layer four network Conv Block 4. As shown in fig. 2. The leftmost branch is called the Global branch (Global branch) and is used to extract Global features. The middle (Local cache-1) and right (Local cache-2) branches are used to extract Local features. Each branch consists of four network residual blocks designed in section 2.2.

2.6GL-FER fifth layer network architecture:

the method specifically comprises the following steps: according to the branch structure in section 1.2.4, three network residual Block extraction features are designed on each branch in the Conv Block5 network layer.

After feature extraction, Global branch is averaged and pooled. The formula is shown in (5).

And (3) a characteristic diagram of the n-th layer with the size of m.

After feature extraction, the Local brancheche-1 longitudinally divides the image into two blocks from top to bottom, and extracts Local features.

After feature extraction, the Local cache-2 longitudinally divides the image into three blocks from top to bottom, and extracts Local features.

The Local features extracted by the Local cache-1 and the Local cache-2 are not subjected to average pooling, and the global features are subjected to average pooling according to a formula (5).

And performing maximum pooling operation on the global features and the detail features extracted from the three branches respectively, and compressing the image through 1-by-1 convolution to keep the dimensions of the feature vectors consistent.

2.7GL-FER network loss function

The global feature trains network parameters using a ternary Loss function (Triplet Loss).

As shown in the following formula, x^aIs a neuron, x^p、xⁿRepresents sum x^aSamples of the same class and samples of different classes, alpha representing x^aAnd xⁿThe sum of the distance between x^aAnd x^pThe minimum value of the distance therebetween. + represents [, ]]When the value of the internal is larger than zero, the value is taken as loss, and when the value is smaller than zero, the loss is zero.

Local features use the softmax loss function to train network parameters.

As followsIn the formula, m represents the total number of input samples, n represents the classification number, and w represents the network weight. p { yⁱJ is used to calculate if the true tag and the output tag are equal, and if so, p { y }ⁱJ ═ 1, otherwise the value is 0.

2.8 network feature fusion

And (3) the global feature vectors and the 5 local feature vectors output in the step 2.6 are connected in series to obtain fused features for expression recognition and classification.

x (i, j) and y (i, j) represent feature vector with same dimension at same position i, j through channel d, and z (i, j) represents feature vector after concatenation.

The GL-FER network test process is as follows: after the test sample is preprocessed in the first section, the test sample is input into a GL-FER network model trained in the 2.2 sections, and finally the accuracy rate of expression recognition shown in FIG. 4, namely 90.24%, is obtained. The diagonal bold data represents the correct recognition rate corresponding to each expression.

The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. An expression recognition method integrating global and local features of a human face is characterized by comprising the following steps:

and step 3: extracting the global face features by using a global branch in an expression recognition network model GL-FER integrating the global face features and the local face features, and extracting the detail face features by using two local branches;

and 5: and (4) inputting the test sample data into the recognition model designed in the step (4) to realize more accurate output of the expression recognition result.

2. The expression recognition method for integrating the global and local characteristics of the human face as claimed in claim 1, wherein in the step 1, the preprocessing process comprises: the method comprises the steps of firstly carrying out face detection on sample data in a database by adopting an Adaboost algorithm based on Haar characteristics, then obtaining a gray level image with uniform illumination by using a gray level normalization method based on histogram equalization, then carrying out size normalization on the gray level image to obtain a gray level image with the same size, and finally carrying out pixel normalization processing on the image, so that the network convergence is convenient and rapid.

3. The method for recognizing the expression of human face with global and local features integrated as claimed in claim 1 or 2, wherein the original images of step 1 are all uniform in size up to 64 × 64 pixels.

4. The expression recognition method for integrating the global and local characteristics of the human face as claimed in claim 1 or 2, wherein in the step 3, the process of extracting the global and detail characteristics of the human face by the expression recognition network model for integrating the global and local characteristics of the human face comprises:

designing a network residual Block Basic Block;

after Conv Block5, performing average pooling on the global branch;

5. The method for recognizing the expressions according to the global and local characteristics of the human faces as claimed in claim 1 or 2, wherein in the GL-FER network model, the Conv Block2 is composed of three network residual blocks, the Conv Block3 is composed of 4 network residual blocks, the Conv Block4 is composed of 6 network residual blocks, and the Conv Block5 is composed of 3 network residual blocks.

6. The expression recognition method for integrating the Global and local characteristics of the human face as claimed in claim 1 or 2, wherein the first three-layer network structures of the GL-FER branches share the network layers of Conv Block2 and Conv Block3, and are divided into three branches when reaching Conv Block4, wherein the first branch is Global branch, and Global characteristics are extracted; the second and third branches are: and extracting detail features of the Local branch-1 and the Local branch-2.

7. The method for recognizing the expressions of human faces according to claim 1 or 2, wherein the average pooling operation in GL-FER only operates on the feature vectors in the global branch and the global feature vectors in the Local branch-1, Local branch-2, and the Local features in the Local branch-1, Local branch-2, so as to keep more information of the images and facilitate the network to obtain more detailed features.

8. The method for recognizing expressions according to claim 1 or 2, wherein in step 4, the GL-FER network model training process comprises: and using the softmax function and the ternary loss function for metric learning as loss functions in the training process, and continuously modifying the network weight and the bias in the back propagation process to optimize the network model.

9. The method for recognizing expressions according to claim 1 or 2, wherein in step 4, the global feature vector of three branches in GL-FER uses triplet loss, and the detail feature vector uses softmax loss.