CN111523461A

CN111523461A - Expression recognition system and method based on enhanced CNN and cross-layer LSTM

Info

Publication number: CN111523461A
Application number: CN202010324539.8A
Authority: CN
Inventors: 陈瑞; 童莹; 齐宇霄; 陈乐�; 曹雪虹
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-11

Abstract

The invention discloses an expression recognition system and method based on enhanced CNN and cross-layer LSTM, the system comprises a feature enhanced CNN module, a cross-layer LSTM module and a full connection layer, the feature enhanced CNN module and the cross-layer LSTM module are cascaded for end-to-end training; the characteristic enhancement CNN module leads out a characteristic enhancement branch in the middle layer of the backbone CNN network and fuses the output of the characteristic enhancement branch with the output of the backbone CNN network; the cross-layer LSTM module inputs the output of the feature enhancement CNN module to the first layer of LSTM network on the basis of at least two layers of LSTM network cascade connection, and simultaneously, the output of the feature enhancement CNN module is cross-connected to the input end of the rear layer of LSTM network. The method is beneficial to obtaining accurate video sequence situation time information, effectively improves the accuracy of the non-constrained facial expression recognition, and has wide application prospects in the fields of human-computer interaction, intelligent education, patient monitoring and the like.

Description

Expression recognition system and method based on enhanced CNN and cross-layer LSTM

Technical Field

The invention relates to the technical field of expression recognition, in particular to an expression recognition system and method based on enhanced CNN and cross-layer LSTM.

Background

The human face expression contains rich emotional information, is one of the important modes of human emotion expression, and is an effective means for people to carry out non-language emotion communication. People can express their own emotion through facial expressions and also can accurately recognize the internal emotion change of the other party. Therefore, the method has important research value and application prospect for accurately identifying the facial expression, and is a research hotspot in the field of artificial intelligence in recent years.

The facial expression recognition system generally comprises four steps of image preprocessing, face detection and face region segmentation, expression feature extraction and expression classification, wherein the expression feature extraction and the expression classification are two key steps for system implementation. Common traditional facial expression feature extraction methods include LBP, HOG, SIFT, Gabor and their improved operators, and common traditional classifiers include Support Vector Machines (SVMs), Random Forest (RF), Gaussian Process (GP), Hidden Markov Models (HMMs), and the like.

However, as the requirements of facial expression recognition in practical applications become more and more extensive, the tested facial expression database gradually shifts from the database acquired in a simple experimental environment (the facial image is front and has no occlusion, and the subject has exaggerated expression emotion according to the requirements) to the database acquired in a complex real environment (the facial image is mixed and interfered by multiple factors such as real environment illumination, posture change, occlusion, accessories, and the like, and the subjects have natural expression emotion and emotion expression degrees which are different), which results in that the traditional machine learning algorithm is difficult to be competent for complex and variable non-constrained facial expression recognition. Therefore, the deep neural network with powerful learning ability is gradually applied to the unconstrained facial expression recognition, and has a remarkable effect. For example, MayyaV et al automatically recognize facial expressions using the DCNN network; the ConnieT et al adopts a mixed CNN-SIFT network to improve the facial expression recognition accuracy; bargal et al extract features of expression images by using three different networks of VGG13, VGG16 and Resnet, and integrate the features and use a Support Vector Machine (SVM) to realize classification; the method for recognizing the expression of the self-adaptive Gabor convolution kernel coding network is provided by the beam and the like, the traditional Gabor kernel is improved, and the recognition rate is improved.

The method is based on a static single-frame image, compared with the method, the video sequence can express more abundant expression change information and can reflect a motion process of a complete expression more accurately, so that the human face expression recognition research based on the video sequence has more practical value and is more challenging. Zhao et al proposed a peak-piloted-based expression recognition method (PPDN), which uses peak expression samples to supervise the intermediate feature changes of non-peak expression samples of the same type, to achieve expression intensity invariance; yu et al propose a deeper cascade peak-piloted weak expression recognition method (DCPN), which enhances the discrimination of features, and avoids overfitting by using a cascade fine tuning method; jung et al propose a joint fine tuning network (DTAGN) based on two different models, wherein one deep network extracts time variation features from a video sequence, and the other network extracts geometric form variation features from facial key points of a single frame image, thereby improving the accuracy of facial expression recognition of the video sequence.

Currently, a method commonly used for unconstrained expression recognition of video sequences is to combine CNN and long-short term memory (LSTM) networks to model the spatiotemporal changes of facial expressions in video. In order to obtain a better identification result, a deep CNN network is generally used to extract spatial information, and multiple LSTM networks are used to obtain time information in a cascade manner. This results in, on the one hand, an increase in the computational overhead of the network and, on the other hand, a gradient vanishing problem due to the deepening of the network layer number.

In summary, although facial expression recognition has achieved certain results, there are some disadvantages:

(1) the existing research is mostly directed at static single-frame images, the research on human face expression recognition based on a video sequence is not much, and the research results are mostly verified on a video database collected in an experimental environment, such as CK +, MMI, Oulu-CASIA and the like, the human face expression in the data is exaggerated and is less interfered by noise, and the reference value of the data to practical application is not large;

(2) the existing facial expression video data collected in a real environment are less, so that the number of training samples of a deep neural network is insufficient, and the network performance is seriously influenced; meanwhile, due to differences among individuals such as age, gender and race and intra-individual changes such as illumination, posture, shielding and accessories, the quality of the collected facial expression samples is uneven.

The difficulty is increased for designing a real-time and accurate non-constrained facial expression recognition system, and the existing facial expression recognition research based on the deep neural network still has a great rising space in performance improvement.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses an expression recognition system and method based on an enhanced CNN and a cross-layer LSTM, wherein a characteristic enhanced CNN module aims at obtaining accurate video sequence non-constrained facial expression characteristics, and a cross-layer LSTM module is helpful for reducing the risk of gradient disappearance, ensuring effective transmission of relevant information between frames and obtaining accurate video sequence list emotion time information; the two are cascaded to carry out end-to-end network training, the accuracy of the non-constrained facial expression recognition can be effectively improved, and the method has wide application prospects in the fields of human-computer interaction, intelligent education, patient monitoring and the like.

The technical scheme is as follows: the invention adopts the following technical scheme: an expression recognition system based on enhanced CNN and cross-layer LSTM is characterized by comprising a feature enhanced CNN module, a cross-layer LSTM module and a full connection layer; the system comprises a video sequence, a feature enhancement CNN module, a cross-layer LSTM module, a full-connection layer, a feature enhancement CNN module and a cross-layer LSTM module, wherein the video sequence is input into the feature enhancement CNN module, the feature enhancement CNN module is used for acquiring expression space information of the video sequence, the feature enhancement CNN module and the cross-layer LSTM module are cascaded for end-to-end training, a feature vector output by the feature enhancement CNN module is input into the cross-layer LSTM module, the cross-layer LSTM module is used for capturing expression time information of the video sequence, the feature vector output by the cross-layer LSTM module is input into the full-connection layer, the full-;

the characteristic enhancement CNN module comprises a backbone CNN network and a characteristic enhancement branch, wherein a characteristic enhancement branch is led out from the middle layer of the backbone CNN network, and the output of the characteristic enhancement branch is fused with the output of the backbone CNN network;

the cross-layer LSTM module comprises at least two cascaded LSTM networks, wherein the output of the feature enhancement CNN module is input to the first layer of LSTM network, and meanwhile, the output of the feature enhancement CNN module is connected to the input end of the rear layer of LSTM network in a cross mode.

Preferably, the feature enhancing branch of the feature enhancing CNN module includes: the input of the first layer of the convolution layer is connected to the middle layer of the backbone CNN network, the output of the convolution layer is connected to the input of the batch normalization layer, after the combination of the convolution layers and the batch normalization layer are cascaded, the output of the last layer of the batch normalization layer is connected to the input of the flat layer, and the output of the flat layer is connected to the output of the full connection layer of the backbone CNN network.

Preferably, the backbone CNN network of the feature-enhanced CNN module adopts a VGG-16 network.

Preferably, the feature enhancement branch comprises two convolutional layers, the convolutional layer of the first layer adopts convolution kernel with size of 7 × 7, and the convolutional layer of the second layer adopts convolution kernel with size of 1 × 1.

Preferably, the output of the feature enhancing tributary is connected to the output of the first layer fully connected layer of the backbone CNN network.

Preferably, the cross-layer LSTM module includes two cascaded layers of LSTM networks, where an output of the feature-enhanced CNN module is input to the first layer of LSTM network, and an output of the feature-enhanced CNN module is also cross-connected to an input of the second layer of LSTM network.

Preferably, the dimensions of the feature vectors output by the two layers of LSTM networks are 2048.

An identification method of an expression identification system based on enhanced CNN and cross-layer LSTM is characterized by comprising the following steps:

a, carrying out face detection on a face expression video, intercepting a face ROI (region of interest) area, and removing background interference;

b, dividing the preprocessed facial expression video into a plurality of video sequences by taking n frames as a group;

step C, sequentially inputting each group of video sequences into the expression recognition system based on the enhanced CNN and the cross-layer LSTM, and calculating the probability value of the group of video sequences belonging to various expressions through the full connection layer and the activation function;

and D, averaging the probability values of the video sequences belonging to the same expression, wherein the expression type corresponding to the maximum average probability value is the expression type label of the video.

Preferably, in the step B, n is less than or equal to 1/2 of the length of the facial expression video, and n/2 frames of images overlap between adjacent video sequences.

Has the advantages that: the invention has the following beneficial effects:

1. in the invention, the characteristic enhancement CNN module is an improvement of the traditional CNN network, a characteristic enhancement branch is led out from the middle layer of the backbone CNN network and is fused with the deep layer characteristics output by the backbone CNN network, aiming at accurately acquiring the video sequence non-constrained facial expression characteristics of different layers and enriching the expression information;

2. in the invention, the cross-layer LSTM module is an improvement of the traditional LSTM network, expression space information of a video sequence output by the characteristic enhancement CNN module is connected to the input end of the second layer LSTM in a cross mode on the basis of the two layers of LSTM networks, which is beneficial to reducing the risk of gradient disappearance, ensures effective transmission of relevant information between frames, can obtain more accurate expression time information of the video sequence, and simultaneously avoids the problem of gradient disappearance caused by deepening of the number of network layers;

3. compared with the recognition performance of a non-end-to-end CNN-LSTM network and an end-to-end CNN-LSTM network on a video data set, the invention carries out end-to-end network training by cascading the characteristic enhancement CNN module and the cross-layer LSTM module, can effectively improve the accuracy of the recognition of the non-constrained facial expression, is more convenient in the training and testing process, and has wide application prospects in the fields of human-computer interaction, intelligent education, patient monitoring and the like.

Drawings

FIG. 1 is an overall system block diagram of the present invention;

fig. 2 is a structural diagram of a feature enhanced CNN module in the present invention;

fig. 3 is a block diagram of a cross-layer LSTM module of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The invention discloses an expression recognition system based on enhanced CNN and Cross-layer LSTM, as shown in figure 1, comprising two parts of a Feature-enhanced CNN (Feature-enhanced CNN) module and a Cross-layer LSTM (Cross-layer LSTM) module; the feature enhancement CNN module is used for acquiring accurate expression space information of a video sequence, the cross-layer LSTM module is used for capturing expression time information of the video sequence, the two are cascaded for end-to-end training, the discrimination of the non-constrained face expression features can be effectively improved, and finally the learned depth semantic features are mapped into a sample mark space by using a full connection layer to realize classification.

Convolutional neural networks have enjoyed great success in visual recognition tasks in recent years, among which classical CNN networks are AlexNet, VGG, google lenet, ResNet, and the like. The training cost and the recognition accuracy of the network are comprehensively considered, and the VGG-16 network is used as the backbone network of the characteristic enhancement CNN module. Due to the fact that the number of network layers of the VGG-16 is limited, when the unconstrained facial expression data are processed, a sample is subjected to mixed interference of multiple factors such as real environment illumination, posture change, shielding and accessories, degrees of similar emotions of subjects are different due to individual culture differences, and therefore the unconstrained facial expression features extracted by the VGG-16 are not ideal. In view of the above, the present invention considers that the discrimination of the unconstrained facial expression features is improved by increasing the network width without increasing the network depth.

The invention has the specific improvement of the VGG-16 network: leading out a characteristic enhancement branch in the middle layer of the backbone CNN network for characteristic enhancement, and fusing the characteristic enhancement branch with deep characteristics output by the backbone CNN network for obtaining facial expression information of different layers; in addition, on the feature enhancement branch, only one layer of convolution kernel of 7 × 7 and one layer of convolution kernel of 1 × 1 are adopted for operation, and the combination of the two can ensure that rich expression information is obtained, and meanwhile, the complexity of the model is not increased, and the implementation framework is as shown in fig. 2.

As shown in fig. 2, the feature enhancement branch includes convolutional layers (Conv), Batch Normalization (BN) layers, and flat layers (Flatten), where an input of a convolutional layer in a first layer is connected to an intermediate layer of the backbone CNN network, an output of the convolutional layer is connected to an input of a batch normalization layer, after cascade connection of a plurality of convolutional layers and batch normalization layers, an output of a last batch normalization layer is connected to an input of a flat layer, an output of the flat layer is connected to a fully-connected layer of the backbone CNN network, and functions of each module layer are as follows:

the convolution layer of the first layer adopts convolution kernels with the size of 7 multiplied by 7, and aims to acquire more expression space information by using a larger receptive field and less increase the convolution depth of branches;

the convolution kernel with the size of 1 multiplied by 1 is used for the second layer of convolution layer, and the purpose is to carry out dimension compression on the input high-dimensional features, further fuse the features and reduce the complexity of the model;

after each convolution layer is connected with a batch normalization layer, normalization processing is carried out on the characteristics, so that the stability of characteristic distribution is improved, and the learning speed of the model is accelerated;

the flat layer is used for one-dimensional vectorization of the multi-dimensional features and fusion or cascade connection with a full connection layer of the backbone CNN network.

The parameter settings of the feature enhancement branch are shown in table 1 below.

TABLE 1

At present, the facial expression video data is processed by adopting a mode of cascade connection of a convolutional network and a multilayer LSTM network, so that the problem of gradient disappearance caused by deepening of the number of network layers can occur when an end-to-end mode is adopted for training, and the network cannot be trained. The direct connection channel is added in the CNN of the famous ResNet network, so that the problem that a deep network is difficult to train is solved. It is inspired by this, this idea is applied to the cross-layer LSTM module, and on the basis of the two-layer LSTM network, the video sequence emotion space information output by the feature enhancement CNN module is cross-connected to the input end of the second layer LSTM network, so as to ensure effective transmission of inter-frame related information and obtain more accurate expression time information of the video sequence, and at the same time, this also avoids the problem of gradient disappearance caused by deepening of the network layer number, and the implementation block diagram is as shown in fig. 3.

Based on the system, the invention also discloses an expression recognition method based on the enhanced CNN and the cross-layer LSTM, which comprises the following steps:

and A, carrying out face detection on the face expression video data, intercepting a face ROI (region of interest) area, and removing background interference.

And step B, dividing the preprocessed facial expression video into a plurality of video sequences by taking n frames as a group, wherein n is less than or equal to 1/2 of the minimum video length, and n/2 frames of image overlap exists between every two adjacent groups of video sequences.

If the divided video sequence has a length less than n frames, the last frame of the video sequence is used to complement the divided video sequence into n frames.

And step C, sequentially inputting each group of video sequences (n frames of images) into the expression recognition system based on the enhanced CNN and the cross-layer LSTM, and calculating the probability values of various expressions through the full-connection layer and the activation function.

And D, averaging the expression probability values of all groups of sequences belonging to the video, wherein the expression type corresponding to the maximum probability value is the final identification label.

The expression recognition system based on the enhanced CNN and the cross-layer LSTM has the following network training parameter determination process and training effect:

the experimental code of the invention is completed under a TensorFlow platform in a Ubuntu 16.4 system, and a host is provided with 2 GPU video cards with the model number of NVIDIAGTX 1080 Ti.

1. Database introduction

According to the invention, experimental simulation is carried out on four databases of AFEW, CK +, SFEW and FER2013, wherein the AFEW, the SFEW and the FER2013 are all non-constrained facial expression databases acquired in real environment, a sample is subjected to mixed interference of various factors such as environmental illumination, posture change, shielding, accessories, resolution, shooting angles, complex backgrounds and the like, and the degrees of similar emotions of testees are different due to individual culture difference; CK + is the restraint facial expression database of experimental environment collection, and the positive posture of face, nothing are sheltered from in the sample, and the experimenter requires the various emotions of exaggerated expression according to the experiment. Therefore, the recognition research of the non-constrained facial expressions is more challenging.

It should be noted that AFEW and CK + are face expression video databases used to verify the effectiveness of the cross-layer LSTM module and the expression recognition system based on enhanced CNN and cross-layer LSTM, and SFEW and FER2013 are face expression image databases used to verify the effectiveness of the feature-enhanced CNN module. The four databases are described in detail below.

(1) AFEW database

The AFEW (authorized Facial Expression in the wild) database is composed of video clips selected from different movies, a subject has spontaneous Facial Expression and is subjected to mixed interference of various factors such as real environment illumination, posture change, shading, accessories, shooting angles, resolution, complex backgrounds and the like, and the AFEW database is used as evaluation data in an EmotiW competition from 2013 and is finely adjusted by a master committee every year. The invention selects 2017 competition data AFEW7.0 to carry out experiments. The AFEW7.0 database is divided into three sections: training set (773 samples), validation set (383 samples) and test set (653 samples) in order to ensure that there is no overlap of subjects in the three data sets, thereby verifying the impact of face identity on recognition of facial expressions. The expression labels are respectively anger (anger), aversion (distust), fear (fear), happy (happy), neutral (neutral), sadness (sadness) and surprise (surrise).

(2) SFEW database

SFEW (static Facial Expression in the wild) is composed of static single-frame images in an AFEW database, and Expression key frames are selected to be obtained by calculating changes of face key points in a video. The SFEW database is also divided into three datasets: training set (958 samples), validation set (436 samples) and testing set (372 samples), and the expression categories are consistent with AFEW and are seven basic emotions of anger (anger), aversion (dispost), fear (fear), happy (happy), neutral (neutral), sadness (sadness) and surprise (surpride).

It should be noted that, the AFEW and SFEW datasets are both competition datasets, and their test set data tags are not disclosed to the outside, so the present invention uses a validation set for performance testing.

(3) FER2013 database

The FER2013 database is constructed by collecting facial expression images from the Internet by using the Google Image Search API, and comprises 28709 training samples, 3589 verification samples and 3589 test samples, wherein the size of the images is 48 × 48 pixels. FER2013 still contains seven basic classes of expressions: anger (anger), aversion (distust), fear (fear), happy (happy), neutral (neutral), injured (sandness), surprise (surrise).

(4) CK + database

The CK + database is the most extensive laboratory database used to evaluate facial expression recognition systems, containing 593 videos from 123 subjects. The video duration varies from 10 frames to 60 frames and varies from neutral expression up to the most exaggerated expression. Among them, 327 videos from 118 subjects were labeled with seven basic emotion labels using the Facial Action Coding System (FACS): anger (anger), slight (continent), disgust (distust), fear (fear), happy (happy), sadness (sadness), surprise (surrise). As CK + does not give a training set and a testing set, 327 videos are divided into video sequences with the length of 10 frames, 978 videos are obtained in total, 80% of the videos are taken for training, 20% of the videos are taken for testing, and the experimental results are obtained by cross validation for 5 times.

2. Network pre-processing and data amplification

The MTCNN is adopted to carry out face detection pretreatment on the three databases AFEW, SFEW and CK + so as to eliminate the influence of a complex background on face expression identification, and because a sample collected in the FER2013 database is an image subjected to face detection pretreatment, the FER2013 database does not have the operation; meanwhile, the four databases are scaled according to the invention so as to amplify the number of training samples.

3. Pre-training and fine-tuning of networks

Because the AFEW database has the highest complexity among the four databases, the network pre-training and fine-tuning method is mainly based on the AFEW database and specifically comprises the following operations:

firstly, adopting a VGG-FACE weight value as an initial weight value of a backbone CNN network;

then, fine adjustment is carried out on the characteristic enhancement CNN module by using part of samples in SFEW and FER 2013;

and finally, training the feature enhancement CNN module by using the training set of the AFEW and the amplified training sample to obtain the optimal network parameters, and directly training and testing the other three databases on the module.

4. Network performance analysis

(1) Performance analysis of cross-layer LSTM modules

The effectiveness of the cross-layer LSTM module is verified by adopting two facial expression video databases of AFEW and CK + here.

1) Selection of network layer number and parameters of cross-layer LSTM module

Experiments are performed here based on the network framework of the conventional CNN-LSTM, and an appropriate number of LSTM network layers is selected. CNN still employs the classical VGG-16 network. The evaluation criteria were Accuracy (Accuracy) and F1 score.

The calculation formula of the accuracy is as follows:

the calculation formula of the F1 score is as follows:

wherein the precision ratio

Recall rate

The F1 score can be regarded as a weighted average of the model precision and recall, the maximum value of the F1 score is 1, the minimum value is 0, and the larger the value is, the better the model is; TP (true Positive), TN (true negative), FP (false positive) and FN (false negative) are true positive and false negative, respectively.

Table 2 shows the experimental results of independent training of CNN networks and LSTM networks with different number of layers and parameters in AFEW database.

TABLE 2

The first two rows in table 2 are experimental results using a layer of LSTM network; the last three rows are experimental results of two-layer LSTM network cascade; the values in parentheses in the table represent the dimensions of the feature vector output by each layer of the LSTM network.

As can be seen from table 2, the F1 score for the two-layer LSTM network is 0.3279 maximum, and the F1 score for the one-layer LSTM network is 0.2954 maximum, showing that the two-layer LSTM network performs better than the one-layer LSTM network. In the two-layer LSTM network, different output parameters are respectively set, wherein the recognition accuracy of the CNN-LSTM (3000 ) is the highest, and the F1 score of the CNN-LSTM (2048 ) is the largest. According to two performance indexes of F1 score and accuracy in a comprehensive table, the LSTM (2048 ) network is selected as the basis of the cross-layer LSTM module, the accuracy of the LSTM module is only 0.27% lower than that of the CNN-LSTM (3000 ), and the F1 score is 2.1% higher than that of the CNN-LSTM (3000 ).

2) Performance verification of cross-layer LSTM modules

In order to verify the validity of only the cross-layer LSTM module, the CNN network is still replaced here with a conventional VGG-16. The LSTM network adopts the optimal structure and parameters in the table 2, namely two layers of LSTM networks, the output characteristic dimension of each layer of LSTM network is 2048, and the LSTM network is trained in an end-to-end mode. Experiments are performed on the AFEW and CK + databases to obtain the results of the unconstrained facial expression recognition as shown in table 3.

TABLE 3

As can be seen from table 3, the accuracy of facial expression recognition in end-to-end training is higher than that in non-end-to-end training (i.e., independent training, the same as table 2); meanwhile, when cross-layer connection is adopted, the accuracy of the non-constrained facial expression recognition can be further improved.

Tables 4 and 5 are the confusion matrices of the end-to-end CNN-cross-layer LSTM network on the AFEW and CK + databases, respectively, where the horizontal axis is the true label, the vertical axis is the predicted label, and the count unit is 1%.

TABLE 4

TABLE 5

As can be seen from tables 4 and 5, the number of correctly classified test samples is less than the absolute dominance in the AFEW database compared to the CK + database, and sometimes even the number of correctly classified videos is much smaller than the number of incorrectly classified videos. For example, videos labeled "surprise" are only 17.78% correctly classified, whereas videos misclassified as "angry" account for 40% and videos misclassified as "happy" account for 20%. Similarly, the video labeled "dislike" is only 12.5% for correctly classified, while 22.5% for video misclassified as "angry", 20% for video misclassified as "happy", and 17.5% for video misclassified as "sad". This is because the emotion of a person in real life is often not single, and a mixture of multiple emotions occurs. For example, three emotions of anger, disgust and sadness are usually accompanied with each other, and facial morphological changes of fear, surprise and joy have certain similarities; meanwhile, the non-constrained facial expression data is mixed and interfered by multiple factors such as age, gender, race, illumination condition, posture change, shielding, resolution, complex background and the like, so that the non-constrained facial expression data is difficult to distinguish correctly even if a deep learning computer is used. This also illustrates that correct recognition of unconstrained facial expressions is a very challenging research topic from another perspective.

(2) Performance analysis of feature-enhanced CNN modules

Based on the above optimal network parameter settings, experimental simulation was performed again on two static facial expression databases, SFEW and FER2013, and the results are shown in table 6.

TABLE 6

Tables 7 and 8 are confusion matrices of enhanced CNN and cross-layer LSTM based expression recognition systems on AFEW and CK + databases, respectively, where the horizontal axis is the true label, the vertical axis is the predicted label, and the count unit is 1%.

TABLE 7

TABLE 8

Compared with the traditional CNN network, on the SFEW database, the recognition rate of the feature-enhanced CNN module is improved by 0.97%, and on the FER2013 database, the recognition rate is improved by 1.14%. This further illustrates that the discrimination of the unconstrained facial expression features can be improved by increasing the network width without increasing the network depth.

(3) Performance analysis of expression recognition system based on enhanced CNN and cross-layer LSTM

The performance of the cross-layer LSTM module is analyzed, and a characteristic enhancement CNN module is added on the basis of the performance of the cross-layer LSTM module to further verify the effectiveness of the invention in video facial expression recognition.

Tables 9 and 10 are experimental results of the expression recognition system based on enhanced CNN and cross-layer LSTM on AFEW and CK + databases, and the parameter in parentheses represents the convolution kernel size of the full connection layer of the backbone CNN network with feature enhanced tributary fusion and the convolution kernel size of the first convolution layer in the feature enhanced tributary. For example, (Fc1, 5 × 5) indicates that the feature enhancement tributaries are merged with the first fully-connected (Fc1) layer of the backbone CNN network (see fig. 2), and the convolution kernel size of the first convolutional layer in the feature enhancement tributaries is 5 × 1024 (see table 1); similarly, (Fc2, 7 × 7) indicates that the feature enhancement tributaries are merged with the second fully connected (Fc2) layer of the backbone CNN network, and the convolution kernel size of the first convolutional layer in the feature enhancement tributaries is 7 × 1024.

TABLE 9

Watch 10

As can be seen from the data in tables 9 and 10, when the output of the feature enhanced tributary is merged with the output of the first fully connected layer of the backbone CNN network, and the convolution kernel size of the first convolutional layer in the feature enhanced tributary is 7 × 1024, the network performance is the best. At this time, the AFEW database had an F1 score of 0.3816, an accuracy of 41.25%, which was 2.44% higher than the official reference value for the race. Likewise, the highest accuracy of 97.47% was achieved on the CK + database.

(4) Comparison with advanced algorithms

To further illustrate the advancement of the proposed method of the present invention, experiments were performed on the CK + and SFEW databases, respectively, and compared with the existing advanced algorithm, the results are shown in tables 11 and 12.

TABLE 11

TABLE 12

As can be seen from the data in tables 11 and 12, compared with the prior advanced algorithm, the method provided by the present invention achieves the highest accuracy on both CK + and SFEW databases, which are 97.47% and 54.37%, respectively, further illustrating the rationality and superiority of the present invention.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An expression recognition system based on enhanced CNN and cross-layer LSTM is characterized by comprising a feature enhanced CNN module, a cross-layer LSTM module and a full connection layer; the system comprises a video sequence, a feature enhancement CNN module, a cross-layer LSTM module, a full-connection layer, a feature enhancement CNN module and a cross-layer LSTM module, wherein the video sequence is input into the feature enhancement CNN module, the feature enhancement CNN module is used for acquiring expression space information of the video sequence, the feature enhancement CNN module and the cross-layer LSTM module are cascaded for end-to-end training, a feature vector output by the feature enhancement CNN module is input into the cross-layer LSTM module, the cross-layer LSTM module is used for capturing expression time information of the video sequence, the feature vector output by the cross-layer LSTM module is input into the full-connection layer, the full-;

2. The enhanced CNN and cross-layer LSTM based expression recognition system of claim 1, wherein the feature enhancement branch of the feature enhanced CNN module comprises: the input of the first layer of the convolution layer is connected to the middle layer of the backbone CNN network, the output of the convolution layer is connected to the input of the batch normalization layer, after the combination of the convolution layers and the batch normalization layer are cascaded, the output of the last layer of the batch normalization layer is connected to the input of the flat layer, and the output of the flat layer is connected to the output of the full connection layer of the backbone CNN network.

3. The system of claim 2, wherein the backbone CNN network of the feature-enhanced CNN module is a VGG-16 network.

4. The CNN and LSTM-based expression recognition system of claim 3, wherein the feature enhancement branch comprises two convolutional layers, the first convolutional layer using convolution kernel of 7x7 size, and the second convolutional layer using convolution kernel of 1 x 1 size.

5. The enhanced CNN and cross-layer LSTM based expression recognition system of claim 4, wherein the output of the feature enhancement branch is connected to the output of the first fully-connected layer of the backbone CNN network.

6. The system of claim 1, wherein the cross-layer LSTM module comprises two cascaded layers of LSTM networks, wherein the output of the feature enhancement CNN module is input to a first layer of LSTM network, and the output of the feature enhancement CNN module is also connected across the input of a second layer of LSTM network.

7. The system of claim 6, wherein the feature vectors output by the two-layer LSTM network have dimensions of 2048.

8. The recognition method of the expression recognition system based on enhanced CNN and cross-layer LSTM as claimed in any one of claims 1 to 7, comprising the steps of:

9. The recognition method of the enhanced CNN and cross-layer LSTM-based expression recognition system of claim 8, wherein in step B, n is less than or equal to 1/2 of the length of the facial expression video, and there is n/2 frame image overlap between adjacent video sequences.