CN113903063A

CN113903063A - Facial expression recognition method and system based on deep spatiotemporal network decision fusion

Info

Publication number: CN113903063A
Application number: CN202111136083.3A
Authority: CN
Inventors: 陈宣池; 郑向伟; 张利峰; 郑法; 王涛
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-07

Abstract

The invention provides a facial expression recognition method based on deep space-time network decision fusion, which comprises the following steps: preprocessing each facial expression image of an original facial expression data set, extracting facial landmark vectors of each preprocessed facial expression image, and selecting a peak expression image of each preprocessed facial expression image; obtaining global time sequence characteristics of the facial expressions according to the facial marker point vectors, and carrying out facial expression classification on the global time sequence characteristics of the facial expressions to obtain a first facial expression classification result; obtaining the spatial characteristics of the facial expressions according to the selected peak expression images, and carrying out facial expression classification on the spatial characteristics of the facial expressions to obtain a second facial expression classification result; and performing decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result, and obtaining a final classification result by using the global time sequence characteristics and the spatial characteristics of the facial expression image to obtain a better facial expression recognition effect.

Description

Facial expression recognition method and system based on deep spatiotemporal network decision fusion

Technical Field

The disclosure belongs to the field of facial expression recognition in the technical field of emotion recognition, and particularly relates to a facial expression recognition method and system based on deep spatiotemporal network decision fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Facial expression is the most common way for human beings to express internal emotion and is a key factor for machine perception of human emotion. With the increase of computer computing power, facial expression recognition is becoming a research hotspot in the field of human-computer interaction. Facial Expression Recognition (FER) refers to the analysis of emotional changes in human mind by capturing Facial expressions and their changes via a computer. It has been widely used in many fields such as distance education, public safety, and commercial marketing. However, FER is a rather complicated face analysis task, and even human beings have difficulty in accurately recognizing the emotions of others by a single facial feature.

Current FER methods can be broadly divided into two categories: FER methods based on a single static image and FER methods based on a sequence of dynamic images. The FER method based on a single static image can effectively extract spatial features, but cannot utilize dynamic information in the change process of facial expression. The FER method based on the dynamic image sequence can capture the dynamic change process of the face of several continuous frames in the image sequence, and further extract the time sequence characteristics, but often ignores the spatial information of the image. Both the FER method based on a single static image and the FER method based on a dynamic image sequence only pay attention to facial expression characteristics of one face, and very high identification accuracy rate is difficult to obtain. Therefore, how to extract more efficient timing characteristics and spatial characteristics and apply them to FER task has become a key challenge.

The prior art has the following technical problems:

convolutional Neural Networks (CNN) have become an effective facial expression feature extraction model, but the too small data size of the facial expression database is a big obstacle to directly applying these CNN models to identify facial expressions. Some researchers propose that reducing the hidden layer depth of the neural network is a feasible method for overcoming the over-fitting problem, but the shallow network is difficult to achieve the ideal effect and is not beneficial to extracting deep features. How to reserve the spatial features in the hidden layer of the convolutional neural network in a targeted manner becomes a great technical difficulty.

The generation of the facial expression can be regarded as the dynamic change of key parts of the face, and the whole change process of the whole facial expression is formed through the whole expression of the dynamic change. The traditional facial expression recognition method usually adopts a manually made descriptor to extract time sequence features hidden in a facial image. With the wide application of the deep learning technology, researchers directly input a facial image sequence into a Recurrent Neural Network (RNN) to extract temporal features of facial expressions, so that a good effect is obtained. How to capture dynamic information of key parts of human faces in a facial expression video is a technical problem which needs to be solved currently.

Disclosure of Invention

In order to solve the technical problems existing in the background technology, the present disclosure provides a facial expression recognition method and system based on deep spatiotemporal network decision fusion, which divides a facial expression image sequence into four subsequences according to a face region, respectively constructs a BilSTM model for the four subsequences to extract local time sequence features, can capture local morphological features of facial expressions in more detail, and maximally utilize time sequence features of dynamic facial expressions; the shallow spatial feature map of the peak expression image is extracted by the VGG network, and the channel weight is distributed to the shallow spatial feature map by using SEnet, so that effective spatial features can be retained in a targeted manner, and the risk of model overfitting is reduced.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

the first aspect of the present disclosure provides a facial expression recognition method based on deep spatiotemporal network decision fusion, including:

preprocessing each facial expression image of an original facial expression data set, extracting facial landmark vectors of each preprocessed facial expression image, and selecting a peak expression image of each preprocessed facial expression image;

obtaining global time sequence characteristics of the facial expressions according to the facial marker point vectors, and carrying out facial expression classification on the global time sequence characteristics of the facial expressions to obtain a first facial expression classification result;

obtaining the spatial characteristics of the facial expressions according to the selected peak expression images, and carrying out facial expression classification on the spatial characteristics of the facial expressions to obtain a second facial expression classification result;

and performing decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.

A second aspect of the present disclosure provides a facial expression recognition system based on deep spatiotemporal network decision fusion, including:

the data preprocessing module is configured to preprocess each facial expression image of the original facial expression data set, extract a facial landmark vector of each preprocessed facial expression image, and select a peak expression image;

the time sequence feature extraction module is configured to obtain the global time sequence features of the facial expressions according to the facial marker point vectors, and perform facial expression classification on the global time sequence features of the facial expressions to obtain a first facial expression classification result;

the spatial feature extraction module is configured to obtain spatial features of the facial expressions according to the selected peak expression images, and perform facial expression classification on the spatial features of the facial expressions to obtain a second facial expression classification result;

and the decision fusion module is configured to perform decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.

A third aspect of the present disclosure provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first aspect above.

A fourth aspect of the present disclosure provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

the method comprises the steps of firstly preprocessing an original facial expression image sequence, wherein the preprocessing step comprises face cutting, gray level processing, data enhancement, facial marker point extraction and peak image frame selection; secondly, a time sequence feature extraction module based on a bidirectional Long Short-Term Memory network (BilSTM) is provided, facial mark points are respectively input into the BilSTM according to different face areas to extract local time sequence features, the local time sequence features are fused to obtain global time sequence features of the facial expression, and the global time sequence features are input into softmax to calculate a facial expression classification result; thirdly, a spatial feature extraction module based on VGG (Visual Geometry Group, VGG) and SENet (query and acquisition Networks, SENet) is provided, a shallow spatial feature map of a peak expression image is extracted by using a VGG network, channel weights are distributed to the shallow spatial feature map by using the SENet, the obtained weighted feature map is used as the spatial features of the facial expressions, and the spatial features are input into softmax to calculate the facial expression classification result; and fourthly, integrating the facial expression classification results of the time sequence feature extraction module and the spatial feature extraction module by adopting a weighted average fusion mode to obtain a final facial expression classification result. Finally, The performance of The method was evaluated using The public data sets CK + (The Extended Cohn-Kanade Dataset, CK +) and Ouu-CASIA (Ouu-CASIA NIR & VIS Facial Expression Database, Ouu-CASIA).

The present disclosure consists of 4 parts: the system comprises a data preprocessing module, a time sequence feature extraction module, a spatial feature extraction module and a decision fusion module. Through analysis, the generation of the facial expression can be often regarded as dynamic changes of key parts (eyebrows, eyes, a nose and a mouth) of the face, and the whole change process of the facial expression is formed through the whole expression of the dynamic changes. The invention provides a time sequence feature extraction module, which divides a facial expression image sequence into four subsequences according to a facial region, respectively constructs a BilSTM model aiming at the four subsequences to extract local time sequence features, can capture local morphological features of facial expressions in more detail, and maximally utilizes the time sequence features of dynamic facial expressions.

In addition, the spatial features are reserved in a hidden layer of the convolutional neural network for pertinence. The utility model provides a spatial feature extraction module, use VGG network to extract the shallow layer spatial feature picture of peak expression image and use SENet to its distribution channel weight, can pointed remain effective spatial feature, reduce the model and cross the fitting risk.

Finally, in order to fully integrate the information of the time sequence feature and the spatial feature, the facial expression classification results of the time sequence feature extraction module and the spatial feature extraction module are fused in a weighted average mode, and the final facial expression classification result is obtained and output.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flowchart of a facial expression recognition method based on deep spatiotemporal network decision fusion according to a first embodiment of the present disclosure;

FIG. 2 is a flowchart of an example of a facial expression recognition method based on deep spatiotemporal network decision fusion in an embodiment of the present disclosure;

FIG. 3 is a block diagram of the timing feature extraction module according to a second embodiment of the disclosure;

fig. 4 is a design diagram of a spatial feature extraction module in the second embodiment of the disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example one

As shown in fig. 1, the present embodiment provides a facial expression recognition method based on deep spatiotemporal network decision fusion, and the present embodiment is exemplified by applying the method to a server, it can be understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

step S1: aiming at an original facial expression data set D, carrying out face cutting, gray processing and data enhancement on each image in the data set D to obtain a data set D^*；

Step S2: for data set D^*Extracting the face of each gray-level human face imagePart of the landmark point vector, and from the data set D^*Selecting a peak expression image from each facial expression image sequence;

step S3: dividing the facial landmark point vector into four sub-vectors according to key parts of the face, such as eyebrows, eyes, a nose and a mouth; and inputting the facial expression data into four BilSTMs respectively, and extracting local time sequence characteristics of the facial expression. Obtaining global time sequence characteristic F of facial expression by fusing local time sequence characteristics_benmWill F_benmSending the facial expression classification result P into a softmax classifier to calculate_T；

Step S4: extracting shallow space features of the peak expression image by using a VGG network, distributing channel weights to the shallow space features by using SEnet to obtain a weighted feature map, and taking the weighted feature map as the space features F of the facial expression_sWill F_sSending the facial expression classification result P into a softmax classifier to calculate_S；

Step S5: to P_TAnd P_SAnd performing weighted fusion, and calculating and outputting a final facial expression classification result.

In step S1 of the embodiment, face clipping processing is performed on an image sequence in an original facial expression recognition data set, and an image irrelevant to a facial expression is removed to obtain a facial expression image sequence; carrying out gray level processing on the facial expression image sequence, wherein the aim is to only keep facial expression characteristics in order to avoid the influence of factors such as illumination, color and the like on a classification result; performing data enhancement on the image sequence after the gray processing, and expanding the data set by 14 times by adopting a rotating and overturning mode; and extracting facial marker points of each facial expression image in the expanded data set, and selecting a peak expression image of each facial expression image.

Initializing input original facial expression data set D ═ S₁,S₂,…,S_N]In which S is₁,S₂,…,S_NThe expression image sequence of the human face of the human subject is shown, wherein N belongs to N, N represents the number of the human subjects, and N is the number of the images in the expression image sequence of each human subject. For each image S in the sequence of human face expression images of each subject in the data set D_niCarry out human faceCutting and gray processing to obtain a gray face image S_ni ^*The method comprises the steps that I belongs to I, the I is the number of images contained in a face expression image sequence, the Dlib toolkit is used for carrying out face cutting and gray processing on the images, and the size of the cut face images is 64 multiplied by 64 pixels;

then, an off-line data enhancement method is used, each training image is rotated by an angle of { -15 °, -10 °, -5 °,0 °,15 °,10 °,15 ° }, and the rotated image is flipped on the X-axis and S is rotated_ni ^*Turning and rotating for 14 times to obtain a data set D^*。

Video data sets are usually stored in the form of frames that cut the video into image sequences. The data set D here includes N human facial expression image sequences (i.e., N human expression change video data is cut into image sequences), S₁,S₂,…,S_NThe sequence of emoticons (one for each person) of the N subjects is shown. Each human subject expression image sequence comprises a plurality of images S_niRepresenting any of the images.

In step S2 of the present embodiment, the present disclosure pairs the data set D using the Dlib toolkit_*Extracting 68 facial mark points from each gray level facial image, and selecting the last image in each facial expression image sequence as a peak expression image; for D^*Each gray level face image S in (1)_ni ^*Extracting its facial marker points

Where M represents the number of facial marker points, and from data set D^*Selecting peak expression image S from each facial expression image sequence_npf ^*。

In step S3 of the present embodiment, the facial landmark point vector is divided into four sub-vectors according to the key parts of the human face (eyebrows, eyes, nose, and mouth); the number of the facial marker points in the four sub-vectors is respectively 10, 12, 9 and 22; firstly, four sub-vectors are respectively input into four BilSTMs, andtaking local time sequence characteristics of the facial expression; then, the local time sequence characteristics are fused to obtain the global time sequence characteristics F of the facial expression_benm(ii) a Finally, F is mixed_benmSending the facial expression classification result into a softmax classifier to calculate a first facial expression classification result P_T。

For the design of a time sequence feature extraction module, a facial landmark point vector P is used_ni ^*Dividing the key parts (eyebrows, eyes, nose and mouth) of the face into four sub-vectors to obtain vector matrixes of the eyebrows, the eyes, the nose and the mouth which respectively correspond to V_eb、V_ey、V_noAnd V_mo(ii) a Inputting the four vector matrixes into a BilSTM respectively, and obtaining local time sequence characteristics of eyebrows, eyes, a nose and a mouth which correspond to F respectively on an output layer of the BilSTM_eb、F_ey、F_noAnd F_moAnd carrying out feature fusion on the local time sequence features to obtain global time sequence features F of the facial expression_benmSending the facial expression classification result into softmax to classify the facial expressions, and storing a first facial expression classification result P_T(0≤P_T(k)≤1)＝[P_T(1),P_T(2),…,P_T(K)]And K belongs to K, wherein K is the number of the facial expression classification categories.

By local temporal characteristics F of the eyebrows_ebFor example, the calculation formula of the BilSTM hidden layer of the t-th input image is as follows:

f_bt＝σ[w_bf(h_bt-1,x_bt)+b_bf] (1)

f_btis a forgetting gate which determines how much information in the previous state needs to be discarded through a sigmoid activation function, w_bfIs the calculated weight of the forgotten door, b_bfComputing bias, x, representing forgetting gate_btIs the vector of the input, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;

i_bt＝σ[w_bi(h_bt-1,x_bt)+b_bi] (2)

i_btis an input gate which determines the information that the node needs to retain at the current time, where σ is sigmoid laserThe activation function, tanh is the hyperbolic tangent activation function, w_biIs the calculated weight of the input gate, b_biRepresenting calculated offset, x, of input gate_btIs an input vector, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;

is a current alternative updating unit which contains all the updating information of the current time node, and the current updating unit c keeps specific information of how much_btDetermination of wherein

Is the calculated weight of the current alternative update unit,

representing the calculated offset, x, of the current candidate update unit_btIs an input vector, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;

o_bt＝σ[w_bo(h_bt-1,x_bt)+b_bo] (5)

c_btis a current updating unit which not only acquires the available information of the alternative updating unit, but also passes through a forgetting gate f_btThe last image c is obtained_bt-1And using the sigmoid activation function to determine the output of the current update unit.

o_btDenotes the output gate, will_btOutput information of control and c after tanh processing_btMultiplying to obtain the forward LSTM hidden layer output of the t-th input image

Wherein w_boIs the calculated weight of the output gate, b_boRepresenting the calculated offset, x, of the output gate_btIs an input vector, h_bt-1Showing the result of the output of the BilSTM hidden layer with the t-1 image as input.

And then output to LSTM hidden layer

The calculation formula of (a) is the same as the forward direction;

merging LSTM hidden layer outputs in BiLSTM forward and backward directions

And

obtaining a BilSTM hidden layer output h covering the forward and backward information_btIt is taken as a local timing feature F_eb。

Splicing local time sequence characteristics F respectively corresponding to eyebrows, eyes, nose and mouth_eb、F_ey、F_noAnd F_moObtaining a global time sequence characteristic matrix F_benm：

F_benm＝[F_eb；F_ey；F_no；F_mo] (10)

The feature matrix F_benmInputting into softmax function for facial expression classification, P_T(k) The probability that the expression prediction of the current tested belongs to the kth class is represented, K belongs to K, K is the number of the facial expression classification classes, z represents the input vector of the softmax function, and the formula is defined as:

wherein z is_jRepresenting the calculated output value of the jth class in the output vector, z_kRepresenting the class output value currently needed to be calculated, the penalty function can be defined as:

wherein K belongs to K, K is the number of face expression classification categories, T_kAn emoji tag value representing the current truth being tested;

in step S4 of the present embodiment, a peak expression image pixel matrix having dimensions of 64 × 64 × 1 is input. Firstly, a shallow spatial feature map U with dimensions of 8 × 8 × 512 is obtained through convolution operations of 2 3 × 03 × 164, 2 3 × 3 × 128, 4 3 × 3 × 256, and 1 3 × 3 × 512 in a VGG network. Then, 512 channels of the shallow spatial feature map are assigned with weights by using SENEt and weighted, so as to obtain a weighted feature map. Finally, the weighted feature graph is used as the spatial feature F of the facial expression_sAnd sending the facial expression classification result P into softmax to calculate a second facial expression classification result P_S。

For the design of the spatial feature extraction module, firstly, the peak expression image S is taken_npf ^*Shallow spatial feature graph U extracted by input VGG network^A×B×GWhere A × B × G is the spatial feature dimension, and G represents the number of feature map channels. Then, mutual dependence between the characteristic image channels by using SENETPerforming explicit modeling, automatically obtaining the weight of each feature map channel, and distributing the weights of the feature map channels to obtain the spatial feature F of the facial expression_s. Finally, the spatial feature F_sSending the facial expression into softmax for facial expression classification, and saving a second facial expression classification result as P_S(0≤P_S(k)≤1)＝[P_S(1),P_S(2),…,P_S(K)]。

The calculation process for assigning the feature map channel weights using SEnet is as follows:

(1) for the characteristic diagram U^A×B×GFor each profile channel g, an aggregate statistic V is calculated_gWhere G is G, then V_gThe calculation formula of (2) is as follows:

where A and B represent the length and width of the dimensions of the two-dimensional feature map in each feature channel g, u_gAnd (a, b) represents the g-th two-dimensional feature matrix in the shallow feature map U.

(2) By using V_gThe information training parameter w is used for assigning weights to the characteristic diagram channels, and the weight calculation quantity S of each characteristic diagram channel g_gThe calculation process of (2) is as follows:

S_g＝σ(w₂δ(w₁V_g)) (14)

where, δ represents the relu activation function,

r is a hyperparameter.

(3) Calculating the weight S of the characteristic diagram channel g obtained in the step (2)_gAnd the original spatial feature map U_gMultiplying to obtain a weighted feature map f_s(g) The formula is as follows:

f_s(g)＝S_gU_g (15)

(4) at this time, the spatial feature F of the facial expression_S＝[f_s(1),f_s(2),…,f_s(G)]Where G is equal to G, f_s(g) A two-dimensional feature map representing the g-th feature channel. At this time, F_SRepresenting a weighted spatial feature with dimensions a × B × G.

Spatial feature F_SInputting into softmax function for facial expression classification, P_S(k) The probability that the expression prediction of the current tested belongs to the kth class is represented, K belongs to K, K is the number of the facial expression classification classes, z represents the input vector of the softmax function, and the formula is defined as:

in step S5 of the present embodiment, for the design of the decision fusion algorithm, P is selected_S(k) And P_T(k) And performing weighted fusion, and calculating and outputting a final facial expression classification result. The calculation formula of the current facial expression classification result prediction (k) is as follows:

Prediction(k)＝argmax(αP_T(k)+(1-α)P_S(k)) (18)

wherein K belongs to K, K is the number of face expression classification categories, alpha is 0.5, P_T(k) Representing the first facial expression classification result, P_S(k) And representing a second facial expression classification result.

Example two

The embodiment provides a facial expression recognition system based on deep spatiotemporal network decision fusion, which comprises:

It should be noted that the data preprocessing module, the temporal feature extraction module, the spatial feature extraction module, and the decision fusion module correspond to steps S1 to S5 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first embodiment above.

Example four

The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the method for recognizing facial expressions based on deep spatiotemporal network decision fusion as described in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A facial expression recognition method based on deep spatiotemporal network decision fusion is characterized by comprising the following steps:

2. The method for facial expression recognition based on deep spatiotemporal network decision fusion as claimed in claim 1, wherein the preprocessing is performed on each facial expression image of the original facial expression dataset, specifically:

performing face clipping processing on each facial expression image of the original facial expression data set, and removing images irrelevant to facial expressions to obtain a facial expression image sequence;

carrying out gray level processing on the facial expression image sequence, and only keeping facial expression characteristics;

and performing data enhancement on the human face expression image sequence after the gray processing, and expanding the data set by 14 times by adopting a rotating and turning mode.

3. The method for recognizing facial expressions based on decision fusion of a deep space-time network as claimed in claim 1, wherein the global time series characteristics of the facial expressions are obtained according to the facial landmark vectors, and the global time series characteristics of the facial expressions are subjected to facial expression classification to obtain a first facial expression classification result, specifically:

dividing the facial landmark point vector into four sub-vectors according to eyebrows, eyes, a nose and a mouth;

inputting the four sub-vectors into four bidirectional long-time memory networks (BilSTMs) respectively, and extracting local time sequence characteristics of the facial expression respectively;

fusing local time sequence characteristics of the facial expressions to obtain global time sequence characteristics of the facial expressions;

and classifying the global features of the facial expressions by using a softmax classifier to obtain a first facial expression classification result.

4. The facial expression recognition method based on the deep space-time network decision fusion as claimed in claim 1, wherein the spatial features of the facial expressions are obtained according to the selected peak expression image, and the spatial features of the facial expressions are subjected to facial expression classification to obtain a second facial expression classification result, specifically:

extracting a shallow layer space characteristic diagram of the sewing expression image by using a super-resolution test sequence;

distributing channel weight to the shallow spatial feature map by using SEnet, and taking the weighted feature map as the spatial feature of the facial expression;

and classifying the spatial features of the facial expressions by using a softmax classifier to obtain a second facial expression classification result.

5. The method for recognizing the facial expressions based on the decision fusion of the deep space-time network as claimed in claim 3, wherein the four sub-vectors are respectively input into four bidirectional long-time memory networks (BilSTMs) to respectively extract the local time sequence characteristics of the facial expressions, and specifically comprises the following steps:

inputting the four sub-vectors into a BilSTM respectively, and obtaining local time sequence characteristics of eyebrows, eyes, a nose and a mouth which correspond to F respectively on an output layer of the BilSTM_eb、F_ey、F_noAnd F_mo；

f_bt＝σ[w_bf(h_bt-1,x_bt)+b_bf] (1)

f_btis a forgetting gate, and determines how much information needs to be lost in the previous state through a sigmoid activation function, w_bfIs the calculated weight of the forgotten door, b_bfComputing bias, x, representing forgetting gate_btIs the vector of the input, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;

i_bt＝σ[w_bi(h_bt-1，x_bt)+b_bi] (2)

i_btis an input gate and determines the information which needs to be reserved by the node at the current time, wherein sigma is a sigmoid activation function, tanh is a hyperbolic tangent activation function, and w_biIs the calculated weight of the input gate, b_biRepresenting calculated offset, x, of input gate_btIs an input vector, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;

Is the calculated weight of the current alternative update unit,

o_bt＝σ[w_bo(h_bt-1，x_bt)+b_bo] (5)

c_btis the current updating unit, acquires the available information of the alternative updating unit and passes through the forgetting gate f_btThe last image c is obtained_bt-1And using the sigmoid activation function to determine the output, o, of the current update unit_btDenotes the output gate, will_btOutput information of control and c after tanh processing_btMultiplying to obtain the forward LSTM hidden layer output of the t-th input image

Wherein w_boIs the calculated weight of the output gate, b_boRepresenting the calculated offset, x, of the output gate_btIs an input vector, h_bt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input and then outputting the result to the hidden layer

The calculation method of (2) is the same as that of the forward direction;

merging BilSTM forward and backward hidden layer outputs

And

obtaining a BilSTM hidden layer output h covering the forward and backward information_btIt is taken as a local timing feature F_eb；

6. The facial expression recognition method based on deep spatiotemporal network decision fusion as claimed in claim 4, characterized in that the SENet is used to assign channel weights to the shallow space feature map, and the specific process is as follows:

step (1): for the characteristic diagram U^A×B×GFor each profile channel g, an aggregate statistic V is calculated_gWhere G is G, then V_gThe calculation formula of (2) is as follows:

where A and B represent the length and width of the dimensions of the two-dimensional feature map in each feature channel g, u_g(a, b) represents the g-th two-dimensional feature matrix in the shallow feature map U;

step (2): by using V_gThe information training parameter w is used for assigning weights to the characteristic diagram channels, and the weight calculation quantity S of each characteristic diagram channel g_gThe calculation process of (2) is as follows:

S_g＝σ(w₂δ(w₁V_g)) (14)

where, δ represents the relu activation function,

r is a hyperparameter;

and (3): calculating the weight S of the characteristic diagram channel g obtained in the step (2)_gAnd the original spatial feature map U_gMultiplying to obtain a weighted feature map f_s(g) The formula is as follows:

f_s(g)＝S_gU_g (15)；

and (4): spatial features of facial expressions F_S＝[f_s(1)，f_s(2)，...，f_s(G)]，

Wherein G is E.G, f_s(g) A two-dimensional feature map representing a g-th feature channel;

at this time, F_SRepresenting a weighted spatial feature with dimensions a × B × G.

7. The method for facial expression recognition based on decision fusion of deep spatiotemporal network as claimed in claim 1, wherein the calculation formula for decision-level fusion of the first facial expression classification result and the second facial expression classification result is:

Prediction(k)＝argmax(αP_T(k)+(1-α)P_S(k)) (18)。

8. a facial expression recognition system based on deep spatiotemporal network decision fusion is characterized by comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for facial expression recognition based on deep spatiotemporal network decision fusion according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of a method for facial expression recognition based on deep spatiotemporal network decision fusion as claimed in any one of claims 1 to 7.