CN113642383A

CN113642383A - Face expression recognition method based on joint loss multi-feature fusion

Info

Publication number: CN113642383A
Application number: CN202110697155.5A
Authority: CN
Inventors: 苗壮; 林克正; 李靖宇
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-11-12

Abstract

The application relates to a facial expression recognition method based on joint loss multi-feature fusion, which comprises the following steps: detecting a human face to obtain a human face image; respectively extracting features of the face image through an improved ResNet network and a VGG network; reducing the dimension of the extracted features through a full connection layer; fusing the characteristics by adopting a weighted fusion method; and sending the facial expression to a Softmax layer for classification, and outputting facial expression categories. The method adopts two neural network architectures to extract the features, and fully fuses the extracted features. In the training process, a loss function combining cosine loss and cross entropy loss weighting is used, and the combined loss function can realize the functions of close combination among the same categories and large separation among different categories.

Description

Face expression recognition method based on joint loss multi-feature fusion

Technical Field

The invention relates to a facial expression recognition method, and belongs to the field of image recognition.

Background

Facial expression recognition is one of research hotspots in the field of computer vision, and the application field of the facial expression recognition is quite wide. The method comprises man-machine interaction, safe driving, intelligent monitoring, auxiliary driving, case detection and the like. The current facial expression recognition algorithm is mainly based on the traditional method and the deep learning method. The traditional face Feature extraction algorithm mainly includes Principal Component Analysis (PCA), Scale-Invariant Feature Transformation (SIFT), Local Binary Pattern (LBP), Gabor wavelet Transformation, Histogram Of oriented gradients (HOG), and the like, and with the development Of research depth and artificial intelligence technology, the Deep learning method is very different in the field Of image recognition, and the Deep Neural Network (DNN) is applied to expression recognition and obtains better performance.

However, the current expression recognition method is easily affected by picture noise and human interference factors to cause poor recognition rate, a single-channel neural network starts from the image global, local features of the image are easily ignored, and loss of the features is caused, and single feature extraction of a single network model is one of the reasons for low recognition rate.

Disclosure of Invention

The invention aims to solve the technical problem of single convolutional neural network feature loss in a facial expression recognition process, and provides a facial expression recognition method based on joint loss multi-feature fusion.

In order to achieve the purpose, the invention adopts the technical scheme that:

s1, carrying out face detection on the image to be recognized to obtain a face area;

s2, extracting the characteristics of the obtained face image through an improved ResNet network;

s3, extracting the characteristics of the obtained face image through a VGG network;

s4, sending the characteristics obtained in the steps S2 and S3 into a full connection layer for dimensionality reduction;

s5, fusing the features subjected to dimensionality reduction in the step S4 into new features in a weighting fusion mode;

and S6, sending the new features in the step S5 into the full-connection layer for dimension reduction, then performing class prediction on the features by using a Softmax layer, and outputting class information.

Further, the method for acquiring the face region by face detection in step S1 uses an MSSD network model, and includes:

s11, based on the SSD target detection network, the original basic network VGG-16 is changed into a lightweight network MobileNet.

And S12, fusing the 7 th depth separable convolutional layer (shallow layer feature) in the network in the step S11 with the feature map of the last 5 layers (deep layer feature), respectively readjusting the feature maps of the six layers into one-dimensional vectors, and then performing series fusion to realize multi-scale face detection.

And S13, extracting features of the target detection network through a basic network, and performing classification regression and bounding box regression on the meta-structure.

Further, the specific method for performing feature extraction on the acquired face image through the improved ResNet network in step S2 is as follows: and improving a residual block in the ResNet network, increasing convolution operation, reducing parameter quantity, modifying the number of network layers and introducing a pre-activation method. The step S2 includes:

s21, changing the face image X detected in S1 to (X)₁,x₂,...,x_n) Sending the global feature into a ResNet network, and obtaining a corresponding global feature f after processing a plurality of residual blocks_S＝(f_S ¹,f_S ²,...,f_S ^m) The convolution operation process is as follows:

wherein x_lAnd x_l+1Shown are the input and output of the ith residual unit, respectively. F is the residual function, and h (x)_l)＝x_lRepresenting an identity mapping, f is the RRelu activation function. The learning features from the superficial layer L to the deep layer L are

S22 obtaining the feature vector after the features are subjected to the flattening layer

Further, the specific content of the extraction features of the VGG network in step S3 is:

the VGG network adopts continuous 3 multiplied by 3 convolution kernels to replace a larger convolution kernel, the effect is better when a plurality of small convolution kernels are used for a given receptive field, nonlinear operation can be achieved through an activation function, a better network structure can be trained, and meanwhile cost cannot be increased. The network extraction feature process is as follows:

the face image detected in the S1 is subjected to a plurality of layers of convolution operation and maximum pooling operation of the VGG network to obtain the corresponding local feature f_V＝(f_V ¹,f_V ²,...,f_V ^k) (ii) a Obtaining feature vectors after features are subjected to flattening layer

Further, the specific method for reducing the dimension in step S4 is as follows:

s41, extracting the feature vector in the step S2

Input into two fully-connected layers f_c1-1And f_c1-2Performing dimensionality reduction, and adopting an RRelu activation function as follows:

the structures of all layers of the full connecting layer are as follows:

f_c1-1＝{s₁,s₂,...,s₅₁₂}

f_c1-2＝{s₁,s₂,...,s₇}

where s denotes the neuron of the current fully-connected layer, f_c1-1512 neurons in the population, f_c1-2There are 7 neurons in the tree, and the final output dimension of the fully-connected layer is a feature vector of 7

S42, extracting the feature vector in the step S3

Input into two fully-connected layers f_c2-1And f_c2-2The dimension reduction is carried out, and the structures of the layers are as follows:

f_c2-1＝{l₁,l₂,...,l₅₁₂}

f_c2-2＝{l₁,l₂,...,l₇}

where l denotes the neuron of the current fully-connected layer, f_c2-1512 neurons in the population, f_c2-2There are 7 neurons in the tree, and the final output dimension of the fully-connected layer is a feature vector of 7

Further, the step S5 is specifically:

characterizing in step S4

And

formation of new features F after weighted fusion_zSetting a weight coefficient k to adjust the characteristic proportion of the two channels, wherein the fusion process is as follows:

when k takes 0 or 1, it means that only one convolutional neural network extracts features.

Further, the Softmax activation function classification process in step S6 is as follows:

where Z is the output of the previous layer, the input of Softmax, and the dimensions C, y_iThe value of i represents the number of classes as the probability value of a certain class.

The invention has the advantages that:

1. the method adopts the double-convolution neural network to extract the characteristics, improves the basic network to obtain a network structure with better effect, and then adopts a weighting fusion mode to fuse the two characteristic vectors to obtain more effective characteristic information.

2. The local features and the global features are effectively fused in the convolutional neural network, and the fused features are input into a subsequent convolutional layer for continuous extraction in the process of feature extraction, so that the information of a feature map is enriched.

3. By adopting a new loss function-combined loss function and using the loss function after cosine loss and cross entropy loss weighting combination, the functions of close combination between the same categories and large separation between different categories can be realized. And enhancing the discriminability of the features extracted by the neural network.

Drawings

Fig. 1 is a network diagram of MSSD face detection.

Fig. 2 is a structural diagram of an improved ResNet network.

FIG. 3 is a flow chart of a facial expression recognition method based on joint loss multi-feature fusion.

Fig. 4 is an overall structure diagram of the neural network for extracting expressive features.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the case of the example 1, the following examples are given,

referring to fig. 1 to 4, embodiment 1 provides a facial expression recognition method based on joint loss multi-feature fusion,

the method comprises the following steps:

referring to fig. 1, the largest highlight in MobileNet is a deep separable convolution, which is composed of a deep convolution and a point convolution, and greatly speeds up training and recognition, so that a network is constructed by using the deep separable convolution. In the MSSD network, an input end passes through 1 standard convolutional layer with the convolutional kernel size of 3 multiplied by 3 and the step length of 2 and then passes through 13 depth separable convolutional layers, and a rear output end is connected with 4 standard convolutional layers with convolutional kernels respectively combined by 1 multiplied by 1 and 3 multiplied by 3 in an alternating mode and 1 maximum pooling layer, so that the standard convolutional layers of the network use the convolutional kernels with the step length of 2 to replace the pooling layers in consideration of the loss of part of effective characteristics of the pooling layers. The network shallow layer features have smaller receptive field, have more detailed information and have more advantages for detecting small targets, so the MSSD face detection network adopts a mode of fusing the shallow layer features and the deep layer features. The fusion of the shallow features and the deep features of layer 7 works best, so the network uses the fused features of layers 7, 15, 16, 17, 18, and 19. The network firstly readjusts the characteristic images of the six layers into one-dimensional vectors respectively, and then performs series fusion to realize multi-scale face detection.

In step S1, the image to be recognized uses some international facial expression public data sets, such as FER2013, CK +, Jaffe, etc., or a camera is used to acquire the image and the image is used for face detection and segmentation, and the specific steps are as follows:

Specifically, in the step S1, an image is acquired from a facial expression database or a camera, then a MSSD network is used to perform face detection on the image, a face area with the highest reliability is screened out, the interference of the background in the image is removed, and finally a face grayscale image with a size of 48 × 48 is acquired.

referring to fig. 2, the improvement of the network is to change the residual block into three convolutional layers, each convolutional layer has a convolutional kernel of 1 × 1, and the size of the convolutional kernel of the middle convolutional layer is not changed, so that a convolution operation is added, and the parameter amount of the network is greatly reduced. Pre-activation can be achieved by lifting the BN layer and the active layer to the convolutional layer, and the altered ResNet network will train faster and less error than the original ResNet network.

Step S2 specifically includes:

wherein x_lAnd x_l+1Shown are the input and output of the ith residual unit, respectively. F is the residual function, and h (x)_l)＝x_lRepresenting an identity mapping, f is the RRelu laserA live function. The learning features from the superficial layer L to the deep layer L are

S3, extracting the features of the obtained face image through a VGG network:

specifically, the VGG network adopts continuous 3 x 3 convolution kernels to replace large convolution kernels, the effect is better when a plurality of small convolution kernels are used for a given receptive field, nonlinear operation can be achieved through an activation function, a better network structure can be trained, and meanwhile cost cannot be increased. The VGG network is a basic structure, the size of a convolution kernel is 3 multiplied by 3, 0 padding is added on the periphery to be 1, so that the size of a feature graph obtained by the convolution kernel is guaranteed to be unchanged, then the size of the feature graph is reduced to half through a maximum pooling layer, the feature graph passes through five convolution layers in total, the number of channels of the five convolution kernels is respectively 64, 128, 256, 512 and 512, two branches are used for feature fusion, and the size is adjusted through the convolution pooling layer for fusion. Two channels are fused together after being transformed into feature vectors through a full connection layer, and a dropout layer is introduced in order to prevent overfitting. And then, the prediction result is transmitted to a following full connection layer and a subsequent softmax layer for classification prediction. The face image detected in the S1 is subjected to a plurality of layers of convolution operation and maximum pooling operation of the VGG network to obtain the corresponding local feature f_V＝(f_V ¹,f_V ²,...,f_V ^k) (ii) a Obtaining feature vectors after features are subjected to flattening layer

Step S4 specifically includes:

s41, extracting the feature vector in the step S2

the structure of each layer is as follows:

f_c1-1＝{s₁,s₂,...,s₅₁₂}

f_c1-2＝{s₁,s₂,...,s₇}

S42, extracting the feature vector in the step S3

Input two-layer full-connection layer f_c2-1And f_c2-2The dimension reduction is carried out, and the structures of the layers are as follows:

f_c2-1＝{l₁,l₂,...,l₅₁₂}

f_c2-2＝{l₁,l₂,...,l₇}

where l denotes the neuron of the current fully-connected layer, f_c2-1512 neurons in the population, f_c2-2The final output dimension of the feature vector with 7 dimensions is 7 in a full-connection layer of 7 neurons

Specifically, the features output by the two convolutional neural networks are respectively reduced to the features with the same dimensionality, and preparation is made for feature fusion.

referring to fig. 4, the overall network structure is to perform a clipping operation on the VGG19 network, and then merge the network with the improved ResNet network. Then, the shallow information and the deep information are combined together and input into the next convolution layer, so that the extracted characteristic information can be more complete. The network structure can better obtain image features beneficial to classification without increasing training time. Compared with the characteristics extracted through a single channel, the characteristics after fusion are easier to match with a real label, and the recognition effect is better. Characterizing in step S4

And

when k takes 0 or 1, it means a network with only one single channel.

The advantage of weighted fusion is that the proportion of different neural network output characteristics can be adjusted, and the optimal value of k is found to be 0.5 through a large number of experiments.

S6, sending the new features in the step S5 into a full connection layer, classifying the new features by utilizing a Softmax activation function, and outputting expressions;

the Softmax activation function classification process in step S6 is as follows:

wherein Z is the output of the previous layer, SoftmaxInput with dimension C, y_iThe value of i represents the number of categories for the probability value of a certain category, the expression is divided into 7 categories, namely anger (anger), disgust (disgust), fear (fear), happy (happy), hurt (sad), surprised (surrised) and neutral (Normal), and the final classification result is the category corresponding to the neuron node outputting the maximum probability value.

The invention is not described in detail, but is well known to those skilled in the art.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A facial expression recognition method based on joint loss multi-feature fusion is characterized by comprising the following steps:

2. The joint loss multi-feature fusion facial expression recognition method according to claim 1, wherein the step S1 comprises:

3. The method for recognizing facial expressions based on joint loss multi-feature fusion according to claim 2, wherein the step S2 includes:

4. The method for recognizing facial expressions based on joint loss multi-feature fusion according to claim 3, wherein in the step S3, the facial image detected in S1 is subjected to several layers of convolution operations and maximum pooling operations of VGG network to obtain corresponding local features f_V＝(f_V ¹,f_V ²,...,f_V ^k) (ii) a Obtaining feature vectors after features are subjected to flattening layer

5. The method for recognizing facial expressions based on joint loss multi-feature fusion according to claim 4, wherein the step S4 includes:

s41, extracting the feature vector in the step S3

the structure of each layer is as follows:

f_c1-1＝{s₁,s₂,...,s₅₁₂}

f_c1-2＝{s₁,s₂,...,s₇}

S42, extracting the feature vector in the step S4

f_c2-1＝{l₁,l₂,...,l₅₁₂}

f_c2-2＝{l₁,l₂,...,l₇}

6. The method for recognizing facial expressions based on joint loss multi-feature fusion as claimed in claim 5, wherein the weighted fusion calculation method in the step S5 is:

characterizing in step S4

And

7. The method for recognizing facial expressions based on joint loss multi-feature fusion according to claim 6, wherein in the step S6, the expression of the Softmax activation function is as follows:

where Z is the output of the previous layer, the input of Softmax, and the dimensions C, y_iThe value of i represents the number of categories for the probability value of a certain category, the expression is divided into 7 categories, namely anger (anger), disgust (disgust), fear (fear), happy (happy), hurt (sad), surprised (surrised) and neutral (Normal), and the final classification result is the category corresponding to the neuron node outputting the maximum probability value.