CN113903063A - Facial expression recognition method and system based on deep spatiotemporal network decision fusion - Google Patents

Facial expression recognition method and system based on deep spatiotemporal network decision fusion Download PDF

Info

Publication number
CN113903063A
CN113903063A CN202111136083.3A CN202111136083A CN113903063A CN 113903063 A CN113903063 A CN 113903063A CN 202111136083 A CN202111136083 A CN 202111136083A CN 113903063 A CN113903063 A CN 113903063A
Authority
CN
China
Prior art keywords
facial expression
facial
image
classification result
expressions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111136083.3A
Other languages
Chinese (zh)
Inventor
陈宣池
郑向伟
张利峰
郑法
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202111136083.3A priority Critical patent/CN113903063A/en
Publication of CN113903063A publication Critical patent/CN113903063A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a facial expression recognition method based on deep space-time network decision fusion, which comprises the following steps: preprocessing each facial expression image of an original facial expression data set, extracting facial landmark vectors of each preprocessed facial expression image, and selecting a peak expression image of each preprocessed facial expression image; obtaining global time sequence characteristics of the facial expressions according to the facial marker point vectors, and carrying out facial expression classification on the global time sequence characteristics of the facial expressions to obtain a first facial expression classification result; obtaining the spatial characteristics of the facial expressions according to the selected peak expression images, and carrying out facial expression classification on the spatial characteristics of the facial expressions to obtain a second facial expression classification result; and performing decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result, and obtaining a final classification result by using the global time sequence characteristics and the spatial characteristics of the facial expression image to obtain a better facial expression recognition effect.

Description

Facial expression recognition method and system based on deep spatiotemporal network decision fusion
Technical Field
The disclosure belongs to the field of facial expression recognition in the technical field of emotion recognition, and particularly relates to a facial expression recognition method and system based on deep spatiotemporal network decision fusion.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Facial expression is the most common way for human beings to express internal emotion and is a key factor for machine perception of human emotion. With the increase of computer computing power, facial expression recognition is becoming a research hotspot in the field of human-computer interaction. Facial Expression Recognition (FER) refers to the analysis of emotional changes in human mind by capturing Facial expressions and their changes via a computer. It has been widely used in many fields such as distance education, public safety, and commercial marketing. However, FER is a rather complicated face analysis task, and even human beings have difficulty in accurately recognizing the emotions of others by a single facial feature.
Current FER methods can be broadly divided into two categories: FER methods based on a single static image and FER methods based on a sequence of dynamic images. The FER method based on a single static image can effectively extract spatial features, but cannot utilize dynamic information in the change process of facial expression. The FER method based on the dynamic image sequence can capture the dynamic change process of the face of several continuous frames in the image sequence, and further extract the time sequence characteristics, but often ignores the spatial information of the image. Both the FER method based on a single static image and the FER method based on a dynamic image sequence only pay attention to facial expression characteristics of one face, and very high identification accuracy rate is difficult to obtain. Therefore, how to extract more efficient timing characteristics and spatial characteristics and apply them to FER task has become a key challenge.
The prior art has the following technical problems:
convolutional Neural Networks (CNN) have become an effective facial expression feature extraction model, but the too small data size of the facial expression database is a big obstacle to directly applying these CNN models to identify facial expressions. Some researchers propose that reducing the hidden layer depth of the neural network is a feasible method for overcoming the over-fitting problem, but the shallow network is difficult to achieve the ideal effect and is not beneficial to extracting deep features. How to reserve the spatial features in the hidden layer of the convolutional neural network in a targeted manner becomes a great technical difficulty.
The generation of the facial expression can be regarded as the dynamic change of key parts of the face, and the whole change process of the whole facial expression is formed through the whole expression of the dynamic change. The traditional facial expression recognition method usually adopts a manually made descriptor to extract time sequence features hidden in a facial image. With the wide application of the deep learning technology, researchers directly input a facial image sequence into a Recurrent Neural Network (RNN) to extract temporal features of facial expressions, so that a good effect is obtained. How to capture dynamic information of key parts of human faces in a facial expression video is a technical problem which needs to be solved currently.
Disclosure of Invention
In order to solve the technical problems existing in the background technology, the present disclosure provides a facial expression recognition method and system based on deep spatiotemporal network decision fusion, which divides a facial expression image sequence into four subsequences according to a face region, respectively constructs a BilSTM model for the four subsequences to extract local time sequence features, can capture local morphological features of facial expressions in more detail, and maximally utilize time sequence features of dynamic facial expressions; the shallow spatial feature map of the peak expression image is extracted by the VGG network, and the channel weight is distributed to the shallow spatial feature map by using SEnet, so that effective spatial features can be retained in a targeted manner, and the risk of model overfitting is reduced.
In order to achieve the purpose, the following technical scheme is adopted in the disclosure:
the first aspect of the present disclosure provides a facial expression recognition method based on deep spatiotemporal network decision fusion, including:
preprocessing each facial expression image of an original facial expression data set, extracting facial landmark vectors of each preprocessed facial expression image, and selecting a peak expression image of each preprocessed facial expression image;
obtaining global time sequence characteristics of the facial expressions according to the facial marker point vectors, and carrying out facial expression classification on the global time sequence characteristics of the facial expressions to obtain a first facial expression classification result;
obtaining the spatial characteristics of the facial expressions according to the selected peak expression images, and carrying out facial expression classification on the spatial characteristics of the facial expressions to obtain a second facial expression classification result;
and performing decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.
A second aspect of the present disclosure provides a facial expression recognition system based on deep spatiotemporal network decision fusion, including:
the data preprocessing module is configured to preprocess each facial expression image of the original facial expression data set, extract a facial landmark vector of each preprocessed facial expression image, and select a peak expression image;
the time sequence feature extraction module is configured to obtain the global time sequence features of the facial expressions according to the facial marker point vectors, and perform facial expression classification on the global time sequence features of the facial expressions to obtain a first facial expression classification result;
the spatial feature extraction module is configured to obtain spatial features of the facial expressions according to the selected peak expression images, and perform facial expression classification on the spatial features of the facial expressions to obtain a second facial expression classification result;
and the decision fusion module is configured to perform decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.
A third aspect of the present disclosure provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first aspect above.
A fourth aspect of the present disclosure provides a computer device.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first aspect.
Compared with the prior art, the beneficial effect of this disclosure is:
the method comprises the steps of firstly preprocessing an original facial expression image sequence, wherein the preprocessing step comprises face cutting, gray level processing, data enhancement, facial marker point extraction and peak image frame selection; secondly, a time sequence feature extraction module based on a bidirectional Long Short-Term Memory network (BilSTM) is provided, facial mark points are respectively input into the BilSTM according to different face areas to extract local time sequence features, the local time sequence features are fused to obtain global time sequence features of the facial expression, and the global time sequence features are input into softmax to calculate a facial expression classification result; thirdly, a spatial feature extraction module based on VGG (Visual Geometry Group, VGG) and SENet (query and acquisition Networks, SENet) is provided, a shallow spatial feature map of a peak expression image is extracted by using a VGG network, channel weights are distributed to the shallow spatial feature map by using the SENet, the obtained weighted feature map is used as the spatial features of the facial expressions, and the spatial features are input into softmax to calculate the facial expression classification result; and fourthly, integrating the facial expression classification results of the time sequence feature extraction module and the spatial feature extraction module by adopting a weighted average fusion mode to obtain a final facial expression classification result. Finally, The performance of The method was evaluated using The public data sets CK + (The Extended Cohn-Kanade Dataset, CK +) and Ouu-CASIA (Ouu-CASIA NIR & VIS Facial Expression Database, Ouu-CASIA).
The present disclosure consists of 4 parts: the system comprises a data preprocessing module, a time sequence feature extraction module, a spatial feature extraction module and a decision fusion module. Through analysis, the generation of the facial expression can be often regarded as dynamic changes of key parts (eyebrows, eyes, a nose and a mouth) of the face, and the whole change process of the facial expression is formed through the whole expression of the dynamic changes. The invention provides a time sequence feature extraction module, which divides a facial expression image sequence into four subsequences according to a facial region, respectively constructs a BilSTM model aiming at the four subsequences to extract local time sequence features, can capture local morphological features of facial expressions in more detail, and maximally utilizes the time sequence features of dynamic facial expressions.
In addition, the spatial features are reserved in a hidden layer of the convolutional neural network for pertinence. The utility model provides a spatial feature extraction module, use VGG network to extract the shallow layer spatial feature picture of peak expression image and use SENet to its distribution channel weight, can pointed remain effective spatial feature, reduce the model and cross the fitting risk.
Finally, in order to fully integrate the information of the time sequence feature and the spatial feature, the facial expression classification results of the time sequence feature extraction module and the spatial feature extraction module are fused in a weighted average mode, and the final facial expression classification result is obtained and output.
Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a flowchart of a facial expression recognition method based on deep spatiotemporal network decision fusion according to a first embodiment of the present disclosure;
FIG. 2 is a flowchart of an example of a facial expression recognition method based on deep spatiotemporal network decision fusion in an embodiment of the present disclosure;
FIG. 3 is a block diagram of the timing feature extraction module according to a second embodiment of the disclosure;
fig. 4 is a design diagram of a spatial feature extraction module in the second embodiment of the disclosure.
Detailed Description
The present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example one
As shown in fig. 1, the present embodiment provides a facial expression recognition method based on deep spatiotemporal network decision fusion, and the present embodiment is exemplified by applying the method to a server, it can be understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:
step S1: aiming at an original facial expression data set D, carrying out face cutting, gray processing and data enhancement on each image in the data set D to obtain a data set D*
Step S2: for data set D*Extracting the face of each gray-level human face imagePart of the landmark point vector, and from the data set D*Selecting a peak expression image from each facial expression image sequence;
step S3: dividing the facial landmark point vector into four sub-vectors according to key parts of the face, such as eyebrows, eyes, a nose and a mouth; and inputting the facial expression data into four BilSTMs respectively, and extracting local time sequence characteristics of the facial expression. Obtaining global time sequence characteristic F of facial expression by fusing local time sequence characteristicsbenmWill FbenmSending the facial expression classification result P into a softmax classifier to calculateT
Step S4: extracting shallow space features of the peak expression image by using a VGG network, distributing channel weights to the shallow space features by using SEnet to obtain a weighted feature map, and taking the weighted feature map as the space features F of the facial expressionsWill FsSending the facial expression classification result P into a softmax classifier to calculateS
Step S5: to PTAnd PSAnd performing weighted fusion, and calculating and outputting a final facial expression classification result.
In step S1 of the embodiment, face clipping processing is performed on an image sequence in an original facial expression recognition data set, and an image irrelevant to a facial expression is removed to obtain a facial expression image sequence; carrying out gray level processing on the facial expression image sequence, wherein the aim is to only keep facial expression characteristics in order to avoid the influence of factors such as illumination, color and the like on a classification result; performing data enhancement on the image sequence after the gray processing, and expanding the data set by 14 times by adopting a rotating and overturning mode; and extracting facial marker points of each facial expression image in the expanded data set, and selecting a peak expression image of each facial expression image.
Initializing input original facial expression data set D ═ S1,S2,…,SN]In which S is1,S2,…,SNThe expression image sequence of the human face of the human subject is shown, wherein N belongs to N, N represents the number of the human subjects, and N is the number of the images in the expression image sequence of each human subject. For each image S in the sequence of human face expression images of each subject in the data set DniCarry out human faceCutting and gray processing to obtain a gray face image Sni *The method comprises the steps that I belongs to I, the I is the number of images contained in a face expression image sequence, the Dlib toolkit is used for carrying out face cutting and gray processing on the images, and the size of the cut face images is 64 multiplied by 64 pixels;
then, an off-line data enhancement method is used, each training image is rotated by an angle of { -15 °, -10 °, -5 °,0 °,15 °,10 °,15 ° }, and the rotated image is flipped on the X-axis and S is rotatedni *Turning and rotating for 14 times to obtain a data set D*
Video data sets are usually stored in the form of frames that cut the video into image sequences. The data set D here includes N human facial expression image sequences (i.e., N human expression change video data is cut into image sequences), S1,S2,…,SNThe sequence of emoticons (one for each person) of the N subjects is shown. Each human subject expression image sequence comprises a plurality of images SniRepresenting any of the images.
In step S2 of the present embodiment, the present disclosure pairs the data set D using the Dlib toolkit*Extracting 68 facial mark points from each gray level facial image, and selecting the last image in each facial expression image sequence as a peak expression image; for D*Each gray level face image S in (1)ni *Extracting its facial marker points
Figure BDA0003282085550000091
Where M represents the number of facial marker points, and from data set D*Selecting peak expression image S from each facial expression image sequencenpf *
In step S3 of the present embodiment, the facial landmark point vector is divided into four sub-vectors according to the key parts of the human face (eyebrows, eyes, nose, and mouth); the number of the facial marker points in the four sub-vectors is respectively 10, 12, 9 and 22; firstly, four sub-vectors are respectively input into four BilSTMs, andtaking local time sequence characteristics of the facial expression; then, the local time sequence characteristics are fused to obtain the global time sequence characteristics F of the facial expressionbenm(ii) a Finally, F is mixedbenmSending the facial expression classification result into a softmax classifier to calculate a first facial expression classification result PT
For the design of a time sequence feature extraction module, a facial landmark point vector P is usedni *Dividing the key parts (eyebrows, eyes, nose and mouth) of the face into four sub-vectors to obtain vector matrixes of the eyebrows, the eyes, the nose and the mouth which respectively correspond to Veb、Vey、VnoAnd Vmo(ii) a Inputting the four vector matrixes into a BilSTM respectively, and obtaining local time sequence characteristics of eyebrows, eyes, a nose and a mouth which correspond to F respectively on an output layer of the BilSTMeb、Fey、FnoAnd FmoAnd carrying out feature fusion on the local time sequence features to obtain global time sequence features F of the facial expressionbenmSending the facial expression classification result into softmax to classify the facial expressions, and storing a first facial expression classification result PT(0≤PT(k)≤1)=[PT(1),PT(2),…,PT(K)]And K belongs to K, wherein K is the number of the facial expression classification categories.
By local temporal characteristics F of the eyebrowsebFor example, the calculation formula of the BilSTM hidden layer of the t-th input image is as follows:
fbt=σ[wbf(hbt-1,xbt)+bbf] (1)
fbtis a forgetting gate which determines how much information in the previous state needs to be discarded through a sigmoid activation function, wbfIs the calculated weight of the forgotten door, bbfComputing bias, x, representing forgetting gatebtIs the vector of the input, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
ibt=σ[wbi(hbt-1,xbt)+bbi] (2)
ibtis an input gate which determines the information that the node needs to retain at the current time, where σ is sigmoid laserThe activation function, tanh is the hyperbolic tangent activation function, wbiIs the calculated weight of the input gate, bbiRepresenting calculated offset, x, of input gatebtIs an input vector, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
Figure BDA0003282085550000101
Figure BDA0003282085550000102
is a current alternative updating unit which contains all the updating information of the current time node, and the current updating unit c keeps specific information of how muchbtDetermination of wherein
Figure BDA0003282085550000103
Is the calculated weight of the current alternative update unit,
Figure BDA0003282085550000104
representing the calculated offset, x, of the current candidate update unitbtIs an input vector, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
Figure BDA0003282085550000105
obt=σ[wbo(hbt-1,xbt)+bbo] (5)
Figure BDA0003282085550000106
cbtis a current updating unit which not only acquires the available information of the alternative updating unit, but also passes through a forgetting gate fbtThe last image c is obtainedbt-1And using the sigmoid activation function to determine the output of the current update unit.
obtDenotes the output gate, willbtOutput information of control and c after tanh processingbtMultiplying to obtain the forward LSTM hidden layer output of the t-th input image
Figure BDA0003282085550000111
Wherein wboIs the calculated weight of the output gate, bboRepresenting the calculated offset, x, of the output gatebtIs an input vector, hbt-1Showing the result of the output of the BilSTM hidden layer with the t-1 image as input.
And then output to LSTM hidden layer
Figure BDA0003282085550000112
The calculation formula of (a) is the same as the forward direction;
merging LSTM hidden layer outputs in BiLSTM forward and backward directions
Figure BDA0003282085550000113
And
Figure BDA0003282085550000114
obtaining a BilSTM hidden layer output h covering the forward and backward informationbtIt is taken as a local timing feature Feb
Figure BDA0003282085550000115
Figure BDA0003282085550000116
Figure BDA0003282085550000117
Splicing local time sequence characteristics F respectively corresponding to eyebrows, eyes, nose and moutheb、Fey、FnoAnd FmoObtaining a global time sequence characteristic matrix Fbenm
Fbenm=[Feb;Fey;Fno;Fmo] (10)
The feature matrix FbenmInputting into softmax function for facial expression classification, PT(k) The probability that the expression prediction of the current tested belongs to the kth class is represented, K belongs to K, K is the number of the facial expression classification classes, z represents the input vector of the softmax function, and the formula is defined as:
Figure BDA0003282085550000118
wherein z isjRepresenting the calculated output value of the jth class in the output vector, zkRepresenting the class output value currently needed to be calculated, the penalty function can be defined as:
Figure BDA0003282085550000119
wherein K belongs to K, K is the number of face expression classification categories, TkAn emoji tag value representing the current truth being tested;
in step S4 of the present embodiment, a peak expression image pixel matrix having dimensions of 64 × 64 × 1 is input. Firstly, a shallow spatial feature map U with dimensions of 8 × 8 × 512 is obtained through convolution operations of 2 3 × 03 × 164, 2 3 × 3 × 128, 4 3 × 3 × 256, and 1 3 × 3 × 512 in a VGG network. Then, 512 channels of the shallow spatial feature map are assigned with weights by using SENEt and weighted, so as to obtain a weighted feature map. Finally, the weighted feature graph is used as the spatial feature F of the facial expressionsAnd sending the facial expression classification result P into softmax to calculate a second facial expression classification result PS
For the design of the spatial feature extraction module, firstly, the peak expression image S is takennpf *Shallow spatial feature graph U extracted by input VGG networkA×B×GWhere A × B × G is the spatial feature dimension, and G represents the number of feature map channels. Then, mutual dependence between the characteristic image channels by using SENETPerforming explicit modeling, automatically obtaining the weight of each feature map channel, and distributing the weights of the feature map channels to obtain the spatial feature F of the facial expressions. Finally, the spatial feature FsSending the facial expression into softmax for facial expression classification, and saving a second facial expression classification result as PS(0≤PS(k)≤1)=[PS(1),PS(2),…,PS(K)]。
The calculation process for assigning the feature map channel weights using SEnet is as follows:
(1) for the characteristic diagram UA×B×GFor each profile channel g, an aggregate statistic V is calculatedgWhere G is G, then VgThe calculation formula of (2) is as follows:
Figure BDA0003282085550000121
where A and B represent the length and width of the dimensions of the two-dimensional feature map in each feature channel g, ugAnd (a, b) represents the g-th two-dimensional feature matrix in the shallow feature map U.
(2) By using VgThe information training parameter w is used for assigning weights to the characteristic diagram channels, and the weight calculation quantity S of each characteristic diagram channel ggThe calculation process of (2) is as follows:
Sg=σ(w2δ(w1Vg)) (14)
where, δ represents the relu activation function,
Figure BDA0003282085550000122
r is a hyperparameter.
(3) Calculating the weight S of the characteristic diagram channel g obtained in the step (2)gAnd the original spatial feature map UgMultiplying to obtain a weighted feature map fs(g) The formula is as follows:
fs(g)=SgUg (15)
(4) at this time, the spatial feature F of the facial expressionS=[fs(1),fs(2),…,fs(G)]Where G is equal to G, fs(g) A two-dimensional feature map representing the g-th feature channel. At this time, FSRepresenting a weighted spatial feature with dimensions a × B × G.
Spatial feature FSInputting into softmax function for facial expression classification, PS(k) The probability that the expression prediction of the current tested belongs to the kth class is represented, K belongs to K, K is the number of the facial expression classification classes, z represents the input vector of the softmax function, and the formula is defined as:
Figure BDA0003282085550000131
wherein z isjRepresenting the calculated output value of the jth class in the output vector, zkRepresenting the class output value currently needed to be calculated, the penalty function can be defined as:
Figure BDA0003282085550000132
wherein K belongs to K, K is the number of face expression classification categories, TkAn emoji tag value representing the current truth being tested;
in step S5 of the present embodiment, for the design of the decision fusion algorithm, P is selectedS(k) And PT(k) And performing weighted fusion, and calculating and outputting a final facial expression classification result. The calculation formula of the current facial expression classification result prediction (k) is as follows:
Prediction(k)=argmax(αPT(k)+(1-α)PS(k)) (18)
wherein K belongs to K, K is the number of face expression classification categories, alpha is 0.5, PT(k) Representing the first facial expression classification result, PS(k) And representing a second facial expression classification result.
Example two
The embodiment provides a facial expression recognition system based on deep spatiotemporal network decision fusion, which comprises:
the data preprocessing module is configured to preprocess each facial expression image of the original facial expression data set, extract a facial landmark vector of each preprocessed facial expression image, and select a peak expression image;
the time sequence feature extraction module is configured to obtain the global time sequence features of the facial expressions according to the facial marker point vectors, and perform facial expression classification on the global time sequence features of the facial expressions to obtain a first facial expression classification result;
the spatial feature extraction module is configured to obtain spatial features of the facial expressions according to the selected peak expression images, and perform facial expression classification on the spatial features of the facial expressions to obtain a second facial expression classification result;
and the decision fusion module is configured to perform decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.
It should be noted that the data preprocessing module, the temporal feature extraction module, the spatial feature extraction module, and the decision fusion module correspond to steps S1 to S5 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a method for facial expression recognition based on deep spatiotemporal network decision fusion as described in the first embodiment above.
Example four
The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the method for recognizing facial expressions based on deep spatiotemporal network decision fusion as described in the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (10)

1. A facial expression recognition method based on deep spatiotemporal network decision fusion is characterized by comprising the following steps:
preprocessing each facial expression image of an original facial expression data set, extracting facial landmark vectors of each preprocessed facial expression image, and selecting a peak expression image of each preprocessed facial expression image;
obtaining global time sequence characteristics of the facial expressions according to the facial marker point vectors, and carrying out facial expression classification on the global time sequence characteristics of the facial expressions to obtain a first facial expression classification result;
obtaining the spatial characteristics of the facial expressions according to the selected peak expression images, and carrying out facial expression classification on the spatial characteristics of the facial expressions to obtain a second facial expression classification result;
and performing decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.
2. The method for facial expression recognition based on deep spatiotemporal network decision fusion as claimed in claim 1, wherein the preprocessing is performed on each facial expression image of the original facial expression dataset, specifically:
performing face clipping processing on each facial expression image of the original facial expression data set, and removing images irrelevant to facial expressions to obtain a facial expression image sequence;
carrying out gray level processing on the facial expression image sequence, and only keeping facial expression characteristics;
and performing data enhancement on the human face expression image sequence after the gray processing, and expanding the data set by 14 times by adopting a rotating and turning mode.
3. The method for recognizing facial expressions based on decision fusion of a deep space-time network as claimed in claim 1, wherein the global time series characteristics of the facial expressions are obtained according to the facial landmark vectors, and the global time series characteristics of the facial expressions are subjected to facial expression classification to obtain a first facial expression classification result, specifically:
dividing the facial landmark point vector into four sub-vectors according to eyebrows, eyes, a nose and a mouth;
inputting the four sub-vectors into four bidirectional long-time memory networks (BilSTMs) respectively, and extracting local time sequence characteristics of the facial expression respectively;
fusing local time sequence characteristics of the facial expressions to obtain global time sequence characteristics of the facial expressions;
and classifying the global features of the facial expressions by using a softmax classifier to obtain a first facial expression classification result.
4. The facial expression recognition method based on the deep space-time network decision fusion as claimed in claim 1, wherein the spatial features of the facial expressions are obtained according to the selected peak expression image, and the spatial features of the facial expressions are subjected to facial expression classification to obtain a second facial expression classification result, specifically:
extracting a shallow layer space characteristic diagram of the sewing expression image by using a super-resolution test sequence;
distributing channel weight to the shallow spatial feature map by using SEnet, and taking the weighted feature map as the spatial feature of the facial expression;
and classifying the spatial features of the facial expressions by using a softmax classifier to obtain a second facial expression classification result.
5. The method for recognizing the facial expressions based on the decision fusion of the deep space-time network as claimed in claim 3, wherein the four sub-vectors are respectively input into four bidirectional long-time memory networks (BilSTMs) to respectively extract the local time sequence characteristics of the facial expressions, and specifically comprises the following steps:
inputting the four sub-vectors into a BilSTM respectively, and obtaining local time sequence characteristics of eyebrows, eyes, a nose and a mouth which correspond to F respectively on an output layer of the BilSTMeb、Fey、FnoAnd Fmo
By local temporal characteristics F of the eyebrowsebFor example, the calculation formula of the BilSTM hidden layer of the t-th input image is as follows:
fbt=σ[wbf(hbt-1,xbt)+bbf] (1)
fbtis a forgetting gate, and determines how much information needs to be lost in the previous state through a sigmoid activation function, wbfIs the calculated weight of the forgotten door, bbfComputing bias, x, representing forgetting gatebtIs the vector of the input, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
ibt=σ[wbi(hbt-1,xbt)+bbi] (2)
ibtis an input gate and determines the information which needs to be reserved by the node at the current time, wherein sigma is a sigmoid activation function, tanh is a hyperbolic tangent activation function, and wbiIs the calculated weight of the input gate, bbiRepresenting calculated offset, x, of input gatebtIs an input vector, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
Figure FDA0003282085540000031
Figure FDA0003282085540000032
is a current alternative updating unit which contains all the updating information of the current time node, and the current updating unit c keeps specific information of how muchbtDetermination of wherein
Figure FDA0003282085540000033
Is the calculated weight of the current alternative update unit,
Figure FDA0003282085540000034
representing the calculated offset, x, of the current candidate update unitbtIs an input vector, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input;
Figure FDA0003282085540000035
obt=σ[wbo(hbt-1,xbt)+bbo] (5)
Figure FDA0003282085540000036
cbtis the current updating unit, acquires the available information of the alternative updating unit and passes through the forgetting gate fbtThe last image c is obtainedbt-1And using the sigmoid activation function to determine the output, o, of the current update unitbtDenotes the output gate, willbtOutput information of control and c after tanh processingbtMultiplying to obtain the forward LSTM hidden layer output of the t-th input image
Figure FDA0003282085540000037
Wherein wboIs the calculated weight of the output gate, bboRepresenting the calculated offset, x, of the output gatebtIs an input vector, hbt-1Representing the output result of the BilSTM hidden layer with the t-1 image as input and then outputting the result to the hidden layer
Figure FDA0003282085540000038
The calculation method of (2) is the same as that of the forward direction;
merging BilSTM forward and backward hidden layer outputs
Figure FDA0003282085540000039
And
Figure FDA00032820855400000310
obtaining a BilSTM hidden layer output h covering the forward and backward informationbtIt is taken as a local timing feature Feb
Figure FDA0003282085540000041
Figure FDA0003282085540000042
Figure FDA0003282085540000043
6. The facial expression recognition method based on deep spatiotemporal network decision fusion as claimed in claim 4, characterized in that the SENet is used to assign channel weights to the shallow space feature map, and the specific process is as follows:
step (1): for the characteristic diagram UA×B×GFor each profile channel g, an aggregate statistic V is calculatedgWhere G is G, then VgThe calculation formula of (2) is as follows:
Figure FDA0003282085540000044
where A and B represent the length and width of the dimensions of the two-dimensional feature map in each feature channel g, ug(a, b) represents the g-th two-dimensional feature matrix in the shallow feature map U;
step (2): by using VgThe information training parameter w is used for assigning weights to the characteristic diagram channels, and the weight calculation quantity S of each characteristic diagram channel ggThe calculation process of (2) is as follows:
Sg=σ(w2δ(w1Vg)) (14)
where, δ represents the relu activation function,
Figure FDA0003282085540000045
r is a hyperparameter;
and (3): calculating the weight S of the characteristic diagram channel g obtained in the step (2)gAnd the original spatial feature map UgMultiplying to obtain a weighted feature map fs(g) The formula is as follows:
fs(g)=SgUg (15);
and (4): spatial features of facial expressions FS=[fs(1),fs(2),...,fs(G)],
Wherein G is E.G, fs(g) A two-dimensional feature map representing a g-th feature channel;
at this time, FSRepresenting a weighted spatial feature with dimensions a × B × G.
7. The method for facial expression recognition based on decision fusion of deep spatiotemporal network as claimed in claim 1, wherein the calculation formula for decision-level fusion of the first facial expression classification result and the second facial expression classification result is:
Prediction(k)=argmax(αPT(k)+(1-α)PS(k)) (18)。
8. a facial expression recognition system based on deep spatiotemporal network decision fusion is characterized by comprising:
the data preprocessing module is configured to preprocess each facial expression image of the original facial expression data set, extract a facial landmark vector of each preprocessed facial expression image, and select a peak expression image;
the time sequence feature extraction module is configured to obtain the global time sequence features of the facial expressions according to the facial marker point vectors, and perform facial expression classification on the global time sequence features of the facial expressions to obtain a first facial expression classification result;
the spatial feature extraction module is configured to obtain spatial features of the facial expressions according to the selected peak expression images, and perform facial expression classification on the spatial features of the facial expressions to obtain a second facial expression classification result;
and the decision fusion module is configured to perform decision-level fusion on the first facial expression classification result and the second facial expression classification result to obtain a final facial expression classification result.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for facial expression recognition based on deep spatiotemporal network decision fusion according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of a method for facial expression recognition based on deep spatiotemporal network decision fusion as claimed in any one of claims 1 to 7.
CN202111136083.3A 2021-09-27 2021-09-27 Facial expression recognition method and system based on deep spatiotemporal network decision fusion Pending CN113903063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111136083.3A CN113903063A (en) 2021-09-27 2021-09-27 Facial expression recognition method and system based on deep spatiotemporal network decision fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111136083.3A CN113903063A (en) 2021-09-27 2021-09-27 Facial expression recognition method and system based on deep spatiotemporal network decision fusion

Publications (1)

Publication Number Publication Date
CN113903063A true CN113903063A (en) 2022-01-07

Family

ID=79029812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111136083.3A Pending CN113903063A (en) 2021-09-27 2021-09-27 Facial expression recognition method and system based on deep spatiotemporal network decision fusion

Country Status (1)

Country Link
CN (1) CN113903063A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275070A (en) * 2023-10-11 2023-12-22 中邮消费金融有限公司 Video facial mask processing method and system based on micro-expressions

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275070A (en) * 2023-10-11 2023-12-22 中邮消费金融有限公司 Video facial mask processing method and system based on micro-expressions

Similar Documents

Publication Publication Date Title
CN109919830B (en) Method for restoring image with reference eye based on aesthetic evaluation
Lin Face detection in complicated backgrounds and different illumination conditions by using YCbCr color space and neural network
Kae et al. Augmenting CRFs with Boltzmann machine shape priors for image labeling
Miksik et al. Efficient temporal consistency for streaming video scene analysis
US20230081982A1 (en) Image processing method and apparatus, computer device, storage medium, and computer program product
CN111368672A (en) Construction method and device for genetic disease facial recognition model
CN112800903A (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
Sadeghi et al. HistNet: Histogram-based convolutional neural network with Chi-squared deep metric learning for facial expression recognition
CN111158491A (en) Gesture recognition man-machine interaction method applied to vehicle-mounted HUD
CN111476806A (en) Image processing method, image processing device, computer equipment and storage medium
Güçlü et al. End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
Minematsu et al. Simple background subtraction constraint for weakly supervised background subtraction network
Guo et al. Attribute-controlled face photo synthesis from simple line drawing
Reddi et al. CNN Implementing Transfer Learning for Facial Emotion Recognition
CN113903063A (en) Facial expression recognition method and system based on deep spatiotemporal network decision fusion
CN113449550A (en) Human body weight recognition data processing method, human body weight recognition method and device
Kakkar Facial expression recognition with LDPP & LTP using deep belief network
Chen Evaluation technology of classroom students’ learning state based on deep learning
Zhong et al. Unsupervised self-attention lightweight photo-to-sketch synthesis with feature maps
Zhao et al. Affective video classification based on spatio-temporal feature fusion
Sreemathy et al. Indian sign language interpretation using convolutional neural networks
Kartbayev et al. Development of a computer system for identity authentication using artificial neural networks
Nan et al. 3D RES-inception network transfer learning for multiple label crowd behavior recognition
Álvarez et al. Exploiting large image sets for road scene parsing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination