CN114360026A

CN114360026A - Natural occlusion expression recognition method and system with accurate attention

Info

Publication number: CN114360026A
Application number: CN202210025377.7A
Authority: CN
Inventors: 马昕; 姜美娟; 李贻斌
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-04-15

Abstract

The invention provides a natural occlusion expression recognition method and system with accurate attention, which are used for acquiring an image to be recognized and preprocessing the image; processing the preprocessed image to be recognized by utilizing a pre-trained natural occlusion expression recognition network to obtain an expression recognition result; the construction and training process of the natural occlusion expression recognition network comprises the steps of carrying out face key point detection on an occlusion expression image with a known expression after preprocessing, screening a plurality of interest points from the key points, generating a Gaussian map based on the interest points, and obtaining an occlusion indication map of a corresponding image; and calculating an attention descriptor according to the neuron activation value of the depth feature of the occlusion expression image, forcing the attention descriptor of the depth feature to be close to the occlusion indication map by constructing attention loss, and forcing the attention of the depth feature to be suitable for different expressions by constructing expression classification loss. The invention has high identification accuracy.

Description

Natural occlusion expression recognition method and system with accurate attention

Technical Field

The invention belongs to the technical field of expression recognition, and particularly relates to a natural occlusion expression recognition method and system with accurate attention.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, since facial expression recognition plays an important role in numerous applications such as emotion calculation, behavior prediction, human-computer interaction, mental health services and the like, research on the facial expression recognition is receiving wide attention, and great success is achieved. When expressions are recognized in the real world, it is common that human faces are shielded, such as facial part shields like sunglass shields, scarf shields or other random object shields like hand shields, hair shields and cup shields. The existing expression recognition method has a good recognition effect on an ideal non-occlusion facial expression, but when the method is popularized to a natural occlusion condition, the expression recognition accuracy rate is greatly reduced. Therefore, it is necessary to specifically study the recognition problem of facial expressions under real occlusion conditions.

The identification of facial expressions is a challenging problem because facial obstructions can cause the original expression images to lose a part of expression information and even cause misleading to expression identification. Recently, several approaches have been proposed to address this challenge, which can be largely classified into four categories: an occlusion image complementing method, an occlusion-free image providing privilege information method, a concern occlusion-free region method, and a feature reconstruction method.

The occlusion image complementing method reconstructs a face image by using a depth generation model. Some studies use pixel-level generative models as the bottom-most layer of the Deep Belief Network (DBN). The DBN can reconstruct a complete face from an occluded face and then predict expression classes from the complete face. In some researches, a generator and two discriminators are constructed based on a WGAN model, and the complement of an occlusion face image is realized by utilizing reconstruction loss, triple loss and antagonism loss. However, these methods do not reconstruct the face image well because of too many positions and types of occlusions.

The non-occluded image provides more information for expression recognition than the occluded image, and therefore the non-occluded image can be used to assist in learning of the occluded facial expression classifier. Some studies combine triple loss with knowledge distillation to extract privilege information from a teacher network trained on an unobstructed face to a student network trained on an obstructed face. Some researches introduce an unshielded face image matched with the shielded face image, and the unshielded face image is used as privilege information to guide the learning process of the shielding classifier from a label space and a feature space so as to help the shielding classifier to learn more robust feature representation and make better prediction. However, such methods require pairs of occlusion images-no occlusion images-and are only suitable for synthesizing occlusion facial expression datasets, and not for real occlusion facial expression recognition.

The blocking object can interfere the expression classification, and the blocking interference can be eliminated by enabling the expression classification network to pay attention to the non-blocking area. Some researches divide the face into different blocks, predict the shielding possibility of the corresponding block by using a regional gating unit, and then adjust the characteristics by the learned weight; some studies propose a regional attention network, which adjusts the importance of the facial region by using a self-attention module and a relational attention module to alleviate the problems of occlusion and the like. However, the unobstructed scores and weighted weights of the above methods, etc., are learned without any occlusion information, and may be biased, making it difficult for the network to accurately locate the unoccluded face region. Some studies have proposed an occlusion adaptive deep network. The global features are adjusted by 24 attention maps based on facial landmark points to direct the model to focus on important, non-occluded facial regions and filter out occluded regions. However, this approach is non-end-to-end, and the attention is intended to be uniquely determined by the location of the occlusion, unsupervised by the emoji tag, and not adaptive to the situation.

The feature reconstruction method uses a detection algorithm to find the occlusion region, so that the occlusion region can be reconstructed. Some studies propose a robust method, which extracts a group of Gabor-based partial face templates from a gallery image by using a monte-carlo method, and converts the templates into template matching distance features. Some studies reconstruct the occluded face region using RPCA and extract the census transform histogram features. Then, K-neighbor and support vector machines are applied for classification. However, these methods are not very generalizable.

For the problem of identifying facial expressions occluded in a natural scene, in the above-mentioned many explorations, the "method of focusing on an unoccluded area" has a significant effect. In such methods, how to obtain accurate attention is the focus and difficulty of research. At present, in such research, a network is allowed to learn attention weights without any occlusion information, the learned attention weights are not accurate enough, or occlusion information is used to directly weight depth features, and attention consisting of only the occlusion information is not suitable for attention of a human face expression recognition task, and is not accurate enough. For example, an attention map learned without any occlusion information may still be contained for a partially occluded region; for another example, the key expression information of a certain blocked aversive expression image is in the nose and mouth area, the key information of a certain blocked surprise expression image is in the eye and mouth area, and the blocking information can only indicate which area is blocked, and the key areas such as the eyes, the nose and the mouth without blocking are all looked at, and the key areas are not suitable for a specific expression.

Disclosure of Invention

In order to solve the problems, the invention provides a natural occlusion expression recognition method and system with accurate attention, and the recognition accuracy is high.

According to some embodiments, the invention adopts the following technical scheme:

a natural occlusion expression recognition method with accurate attention comprises the following steps:

acquiring an image to be identified, and preprocessing the image;

processing the preprocessed image to be recognized by utilizing a pre-trained natural occlusion expression recognition network to obtain an expression recognition result;

the construction and training process of the natural occlusion expression recognition network comprises the steps of carrying out face key point detection on an occlusion expression image with a known expression after preprocessing, screening a plurality of interest points from the key points, generating a Gaussian map based on the interest points, and obtaining an occlusion indication map of a corresponding image;

and calculating an attention descriptor according to the neuron activation value of the depth feature of the occlusion expression image, forcing the attention descriptor of the depth feature to be close to the occlusion indication map by constructing attention loss, and forcing the attention of the depth feature to be suitable for different expressions by constructing expression classification loss.

As an alternative embodiment, the preprocessing process includes detecting facial landmark points of an original image by using a face detection model, obtaining an aligned face image after similarity transformation, and adjusting the aligned face image to a specified pixel size.

As an alternative embodiment, the specific process of performing face keypoint detection includes performing preliminary keypoint detection of each face, obtaining coordinates and confidence scores thereof, selecting, as final keypoints, points covering the regions around the eyes, eyebrows, nose, and mouth, and points covering the regions between the cheeks and eyes and eyebrows of the face, among the preliminary keypoints.

As an alternative embodiment, the specific process of screening a plurality of interest points from the key points to obtain an occlusion indication map of a corresponding image includes setting a confidence score threshold, removing key points with confidence scores smaller than the threshold to obtain interest points, and generating a gaussian distribution map with coordinates of each interest point as a center, where, for those occluded points, the value of the gaussian distribution map is zero everywhere.

As an alternative embodiment, the generation process of the occlusion indication map comprises the following steps: calculating the statistical value of each Gaussian distribution map at the same spatial position to obtain an occlusion indication map of an input expression image:

wherein

Refers to a Gaussian distribution map generated for the mth point of interest of the input image i, M representing the number of points of interest, |, referring to the absolute value at the pixel level。

As an alternative embodiment, the specific process of calculating the attention descriptor is as follows:

attention descriptor

Wherein

Finger to input image x_iP-th mapping of the k-th layer activation, C^kRepresents the number of feature maps in the k-th layer of the deep network, with | to | referring to the pixel-level absolute value.

As an alternative embodiment, the attention loss function is:

wherein Q_S(x_i)＝vec(A_S(x_i) Is a vectorized form of an attention descriptor based on neuron activation values, Q_T(x_i)＝vec(resize(A_T(x_i) ) is a vectorization form of the occlusion indication map based on the interest points, before vectorization, the size of the occlusion indication map is firstly adjusted to H multiplied by W by a linear interpolation method so as to match the size of the attention descriptor extracted by the network;

the expression classification loss is:

wherein L is_EIs a loss of the classification of the expression,

is the expression predicted value, y_iIs the emoji label value and N is the number of emoji categories.

A natural occlusion expression recognition system with accurate attention, comprising:

the image preprocessing module is configured to acquire an image to be identified and preprocess the image;

the expression recognition module is configured to process the preprocessed image to be recognized by utilizing a pre-trained natural occlusion expression recognition network to obtain an expression recognition result;

the expression recognition module further comprises:

the occlusion indication image generation module is configured to detect face key points of an occlusion expression image with a known expression, screen a plurality of interest points from the key points, generate a Gaussian image based on each interest point, and obtain an occlusion indication image of a corresponding image;

an attention descriptor extraction module configured to calculate an attention descriptor from a neuron activation value of a depth feature that occludes the expression image;

an attention guidance module configured to force an attention descriptor of the depth feature to approach the occlusion indication map by constructing an attention loss, the attention of the depth feature being adapted to different expressions by constructing an expression classification loss.

An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the above method.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the above method.

Compared with the prior art, the invention has the beneficial effects that:

the invention introduces the shielding information into the network to guide the network to learn the expression characteristics with accurate attention, and represents the attention degree of the convolutional neural network to a specific space position by calculating the attention descriptor based on the neuron activation, thereby providing a channel for the introduction of the subsequent shielding information; by generating an occlusion indication map based on the points of interest, occluded face regions and non-occluded key regions are indicated.

The invention further proposes attention guidance based on two spatial attention maps, namely by building an attention loss to force an attention descriptor of a depth feature close to an occlusion indication map. At the same time, the attention of the depth features is forced to adapt to different expressions by constructing expression classification losses. Therefore, the depth features learned by the network have accurate attention suitable for the task of facial expression classification, namely the neuron activation values of the non-shielding areas beneficial to the specific facial expressions are small, the neuron activation values which do not contribute to the specific facial expressions are large, and the accuracy and the efficiency are good.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a propagation path of a natural occlusion expression recognition network with accurate attention according to the present embodiment;

FIG. 2 is a schematic drawing of an extracted attention descriptor of the present embodiment;

FIG. 3 is a schematic diagram of 68 face key points directly detected by SAN and 24 face key points further obtained from the 68 key points in this embodiment;

FIG. 4 is a schematic diagram of a point of interest according to this embodiment;

FIGS. 5(a) - (c) are confusion matrices for RAF-DB, AffectNet and FED-RO data sets of the present embodiment;

fig. 6 is a comparison of activation maps of the models Conv5_ x, the first row being a baseline and the second row being the method proposed in the present embodiment;

FIG. 7 is a graph of the effect of confidence threshold T on the accuracy of emotion recognition;

fig. 8 is a comparison of activation maps for Conv4_ x and Conv5_ x for natural occlusion expression recognition networks with accurate attention at different confidence score thresholds T.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention realizes the identification of the occlusion expression in the natural scene by using the natural occlusion expression identification network with accurate attention.

The method specifically comprises the following steps:

acquiring an image to be identified, and preprocessing the image;

and processing the preprocessed image to be recognized by utilizing the pre-trained natural occlusion expression recognition network to obtain an expression recognition result.

As shown in fig. 1, the solid line is the forward propagation path of the training phase or the testing phase, and the dotted line is the backward propagation path of the training phase.

The respective sections are specifically described below.

Firstly, generating an occlusion indication map based on interest points:

for an input expression image, the embodiment proposes to generate an occlusion indication map for the input expression image based on the facial interest point, wherein the occlusion indication map indicates an occluded facial region and an unoccluded key region. The occlusion indication graph leads occlusion information into an expression recognition network by guiding an attention descriptor of the depth feature in a training stage, so that the network-learned depth feature has more accurate attention.

In this embodiment, the occlusion indication map is generated mainly by an occlusion indication map generation module, and the generation of the occlusion indication map includes the following 6 steps:

(1) and (5) image preprocessing. Five facial landmark points of the original image are detected using a standard multi-tasking convolutional neural network (MTCNN), and after performing similarity transformation, an aligned face image is obtained and resized to 224x224 pixels.

(2) 68 keypoints were detected. The preprocessed expression images are input into a network, and 68 facial key point coordinates and confidence scores thereof of the expression images are obtained by using SAN trained on a 300W data set.

(3) 24 keypoints were calculated. Based on the directly detected 68 face keypoints and their confidence scores, 24 face keypoints and their confidence scores are calculated. Specifically, 16 points were selected from the 68 keypoints to cover the area around the eyes, eyebrows, nose, and mouth. In addition, 8 new points were further calculated based on the 68 keypoints to cover the cheek and the region between the eyes and the eyebrows. The confidence scores of these recalculated points are the minimum confidence scores for the points used to calculate them.

(4) And obtaining the interest points. The confidence score of a certain point reflects the occlusion degree of the position to a certain extent, and the lower the confidence score is, the more serious the occlusion is. And (4) manually setting a proper threshold, and removing the key points with the confidence scores smaller than the threshold to obtain the interest points. Specifically, the interest point is obtained by the following formula:

wherein p is_mRepresents the m-th point of interest, x_m,y_mCoordinates, s, representing the m-th point of interest_mIs the confidence score of the mth point of interest, ranging from 0 to 1, with T being the threshold.

In this embodiment, the threshold T is set to 0.2, thereby removing keypoints with confidence scores less than 0.2. The resulting points of interest are shown in FIG. 4, and it can be seen that the key points of the occlusion region are removed.

(5) 24 gaussians were generated. A gaussian distribution map centered on the point of interest coordinates is generated, for a total of 24. For those occluded points, the value of the Gaussian distribution map is everywhere zero.

(6) And synthesizing an occlusion indication map. Calculating the statistical value of the 24 Gaussian distribution maps at the same spatial position to obtain an occlusion indication map of an input expression image, wherein the calculation mode is shown as the following formula:

wherein

Refers to a gaussian distribution map generated for the mth point of interest of the input image i, M representing the number of points of interest, |, referring to the pixel level absolute value. As a result, A_T(x_i)∈R^224×224。

It should be noted that the occlusion indication map is generated and fixed for subsequent guidance of the attention descriptor.

The steps generate an occlusion indication image for occluding the face image, wherein the occlusion indication image contains occlusion information and indicates an occluded face area and an unoccluded key area.

The embodiment also provides an attention descriptor extraction module, and the extracted attention descriptor can provide a path for introducing occlusion information in the next step.

Recent studies have shown that the corresponding activation tensor F (x) for one CNN layer_i) The absolute value of a hidden neuron activation may indicate the importance of the neuron to a particular input, and the attention descriptor of the activation tensor may be constructed by counting these values over the channel dimension, as shown in figure 2.

An attention descriptor based on neuron activation values is calculated according to the following formula:

wherein

Finger to input image x_iP-th mapping of the k-th layer activation, C^kRepresenting the number of feature maps in the k-th layer of the deep network. The pixel level absolute value is referred to by | ·. As a result, attention descriptor A_S(x_i)∈R^H ^×W。

In the training stage, the extracted depth features of the occlusion expression images are input into an attention descriptor extraction module, an attention descriptor is calculated according to the neuron activation values of the depth features, the attention descriptor indicates an area where the depth features focus on, and a channel is provided for introducing subsequent occlusion information. Our goal is to make the attention descriptor of the depth feature more accurate, i.e. to make the depth feature have more accurate attention, by training.

In the above process, based on the occlusion indication map of the interest point, which indicates the occluded facial region and the non-occluded key region, the attention descriptor based on neuron activation is calculated, which represents the attention of the convolutional neural network to a specific spatial position.

Based on these two spatial attention maps, the present embodiment also proposes attention guidance, i.e. to force the attention descriptor close to the occlusion indication map by constructing a loss of attention, so that depth features taking into account occlusion information will have more accurate attention. Specifically, the attention loss function is as follows:

wherein Q_S(x_i)＝vec(A_S(x_i) Is a vectorized form of an attention descriptor based on neuron activation values, Q_T(x_i)＝vec(resize(A_T(x_i) ) is a vectorized form of the point of interest based occlusion indication map.

It can be seen that, before vectorization, the size of the occlusion indication map is first adjusted to H × W by linear interpolation to match the size of the attention descriptor extracted by the network. It is worth emphasizing that the normalization of the occlusion indication map is crucial for the success of the training.

The depth features used to extract the attention descriptors will also be used for expression classification, and the cross entropy loss function is used to compute the expression classification loss:

wherein L is_EIs a loss of the classification of the expression,

Joint expression classification loss L_EAnd attention loss L_QThe overall loss function is constructed. The expression classification loss and the classification loss are propagated reversely, so that the occlusion indicating graph guides an attention descriptor of the depth feature and the expression label supervises the depth feature, and the depth feature learned by the network has accurate attention suitable for occluding an expression classification task, namely the neuron activation value of an unoccluded area beneficial to specific expression recognition is large, and the neuron activation value of an occluded area or an area not beneficial to specific expression recognition is small. The complete objective function of the natural occlusion expression recognition network with accurate attention is:

L＝λ₁L_E+λ₂L_Q

wherein λ₁And λ₂Is a hyperparameter for adjusting the loss L_EAnd L_QWeight of (1), L_QAnd L_ERespectively, have been defined.

The parameters and specific settings described above may be adjusted or replaced.

Comparative tests were carried out as follows.

Expression data sets for experiments are introduced, and implementation of the experiments is described in detail. Then, the method provided by the embodiment is compared with the latest method, and experiments are also made on the recognition of different expressions to verify the effect of the method of the embodiment. Finally, an attention map visualization is carried out to show the superiority of the attention map obtained by the method of the embodiment, and an ablation experiment is also constructed to study the influence of different thresholds on the experiment effect.

First, the experimental data set includes:

the real facial expression data sets comprise facial expressions in the real world under various occlusion, posture, illumination intensity and other uncontrollable conditions, and the effectiveness of the method provided by the embodiment is verified on two largest real facial expression data sets RAF-DB and AffectNet. The method provided by the present embodiment is also evaluated in three real-world occlusion datasets: occlusion-AffectNet, occlusion-FERplus, and FED-RO, the color, shape, location, and occlusion ratio of the occlusions of these real-world occlusion datasets varying.

The RAF-DB contains 30,000 real facial expression images, annotated with basic or compound expressions by 40 independent markers. In this experiment, only images with seven basic expressions were used, including 12271 samples for training and 3068 samples for testing.

AffectNet is currently the largest dataset of true facial expressions. There are approximately 40 million pictures that are manually labeled with 7 expressions and price points and strength of arousal. Following the experimental setup therein, only images of neutral and six basic emotions were used, including 28 million images for training and 3500 images from the validation set for testing, as the test set was not publicly available. King et al published the Occlusion-AffectNet and Pose-AffectNet data sets, and they selected only images with challenging conditions as test sets: for Occlusion-AffectNet, each image is occluded by at least one Occlusion, such as wearing a mask, wearing glasses, etc., for a total of 682 images; for Pose-AffectNet, images with a Pose degree greater than 30 and 45 were collected, with the number of images being 1,949 and 985, respectively.

FED-RO is a facial expression dataset containing real-world occlusions. Contains 400 expression images, each of which is labeled as one of 7 basic expressions. Each picture has natural shading, such as sunglasses, medical masks, hands and hair, etc. The network model provided in this example was trained on joint training data of RAF-DB and AffectNet, and then tested on FED-RO.

FERplus is a real facial expression data set introduced in the ICML 2013 challenge. It consists of 28,709 training images, 3,589 verification images, and 3,589 test images. Each image was marked with one of 8 expressions with 10 independent markers. More recently, King et al collected images from the FERplus test set under Occlusion and large Pose (>30 ° and >45 °) and published Occlusion-FERplus and Pose-FERplus data sets. There were a total of 605 images for Occlusion-FERplus, and 1171 images for Pose-FERplus with an attitude greater than 30 and 634 images with an attitude greater than 45, respectively. The network model provided by the present embodiment was trained on the training data of FERplus and tested on these challenging data sets.

In the experiment, this example used ResNet50 as the backbone of the network, initializing the model with weights pre-trained on ImageNet. The mini-batch size was set to 32, momentum was 0.9, and weight attenuation was 0.0005. The initial learning rate was 0.01, and from the 20 th iteration, the learning rate dropped to 0.1 times the previous stage every 10 iterations. A total of 80 iterations were performed on the model. The optimization algorithm employs a random gradient descent method (SGD). During training, only random flipping is used for data enhancement. The setup was the same for all experiments. The deep learning software used for the experiment was Pytorch.

The real facial expression data sets RAF-DB and AffectNet contain facial expressions in the real world with various occlusions, poses, light intensities and other uncontrollable conditions. Since occlusion to some extent can be caused by occlusion, large pose, illumination, etc., the validity of the method provided by the present embodiment is verified with the real facial expression dataset. The method provided in this example was compared to the latest PG-CNN, gnacnn, RAN, OADN methods on a real facial expression dataset, and also to the ResNet-50 baseline pre-trained on the ImageNet dataset, and table 1 reports the comparison information. It can be seen that the method provided by this example has significant advantages over baseline, with about 12% and 9% excursions on RAF-DB and AffectNet, respectively. This shows that when the convolutional neural network is used for facial expression recognition, the performance can be significantly improved by using the method provided by the embodiment to fine-tune attention. It is also clear from table 3 that the performance of the network model provided by this embodiment is better than all other models on both RAF-DB and AffectNet data sets and 2% and 3% higher than the next highest OADN on RAF-DB and AffectNet, respectively, which again underscores that the method of this embodiment with accurate attention is effective.

The proposed model is also compared with the state of the art methods on real world occlusion datasets in this section to verify the robustness of the method of the present embodiment to occlusions. Specifically, the proposed model was compared with PG-CNN, gACNN, RAN, OADN methods on a real Occlusion dataset FED-RO, and with RAN and OADN methods on Occlusion-AffectNet, Pose-AffectNet (>30 ° and >45 °), Occlusion-FERplus, and Pose-FERplus (>30 ° and >45 °).

Table 2 shows the comparison results. It can be seen that the method of this embodiment has good performance on real world occlusion data sets. On FED-RO, the accuracy of the method of this example is 74.28% higher than 6% for the inferior OADN. On a challenging set of subtests for AffectNet and FERPlus: in the case of occlusion, the method of this embodiment is about 1% to 6% higher than RAN and OADN; in the case of posture changes, the method of the present embodiment is about 1% to 9% higher than RAN and OADN.

Table 1 comparison of performance on real facial expression data sets with the most advanced methods

TABLE 2 Performance comparison with the most advanced method on real world occlusion datasets

In the expression distinguishing component, in consideration of imbalance of expression categories, a confusion matrix is also adopted to specify detailed classification results of all expressions, and a diagonal line item represents the average recognition rate of each expression. FIG. 5 plots the confusion matrix for RAF-DB, AffectNet and FED-RO datasets. In the three data sets, the recognition rate of the method provided by the embodiment on the 'happy' expression is high. The "aversive" expression is one of the most confusable expressions in the method provided by the embodiment, but it is also common in other models, mainly due to the subtlety of the "aversive" expression and the imbalance of the data sample distribution.

Comparison of the baseline network with an attention map of a natural occlusion expression recognition network with accurate attention. Conv5_ x is the last convolution layer of the baseline network and the natural-occlusion expression recognition network with accurate attention, and is closely related to the final expression classification result. To further observe the effect of the method provided by the present embodiment, exploring the salient regions that each expression recognition model ultimately focuses on in the presence of an occlusion, an attention map of the baseline network and Conv5_ x of a natural occlusion expression recognition network with accurate attention is visualized in fig. 6. As can be seen from fig. 6, the baseline conv5_ x attention is intended to cover the key areas beneficial to expression recognition may not be full, such as the mouth area of row 1, column 1 and column 4, the left cheek area of row 1, column 3 and the eyebrow area, which may result in the omission of expression information. The baseline conv5_ x attention map may also focus on occlusion areas such as the row 1, column 2 book area and row 1, column 5 mustache area. Not only is the expression recognition useless by the concerned sheltered area, but also the expression recognition model can be misled. The attention of the conv5_ x of the network model provided by the embodiment tries to exclude the occlusion region, and the critical region outside the occlusion region, which is beneficial to situation recognition, is covered in a relatively full way. Obviously, compared with the baseline, the attention of the conv5_ x of the network model provided by the present embodiment is more accurate, and also more suitable for the area concerned when the human eye distinguishes the expression under the occlusion condition.

The effect of the confidence threshold T. The confidence scores of the keypoints are used to screen the points of interest in the unobstructed region. In order to study the effect of different thresholds on the experimental results, an ablation experiment was constructed, and the experimental results are shown in fig. 7. As can be seen from fig. 7, when T is 0.2, the network model provided in this embodiment obtains the best performance; when T is further increased, performance quickly degrades because some important facial areas that may not be occluded are also discarded; on the other hand, when T is less than 0.2, the performance of the network model provided by the present embodiment also starts to deteriorate because noise information from the occlusion region is also included, which reduces the purity of the feature.

To further observe the effect of the confidence threshold T, the region that the natural occlusion expression recognition network with accurate attention is finally focused on at different confidence thresholds T was explored, and the activation maps of Conv4_ x and Conv5_ x of the natural occlusion expression recognition network with accurate attention at different thresholds T are visualized in fig. 8. Wherein Conv5_ x is the last convolutional layer of the model before expression classification, and has a direct relationship with the expression classification result; while Conv4_ x is a convolutional layer directly guided by the occlusion indication map, the influence of the threshold T on the model can be visually seen through Conv4_ x. As can be seen from fig. 8, the first row in fig. 8 is the activation map for Conv4_ x, and the second row is the activation map for Conv5_ x. The first to third columns correspond to thresholds 0, 0.2 and 0.5, respectively.

It can be seen that when the threshold is 0, the occlusion region cannot be excluded; when the threshold value is 0.5, key areas such as the left cheek and the nose cannot be included; when the threshold value is 0.2, the shielded area can be excluded and the non-shielded area can be completely covered. It can be seen that when the threshold T is set to 0.2, the attention of the model is the most accurate.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A natural occlusion expression recognition method with accurate attention is characterized by comprising the following steps:

acquiring an image to be identified, and preprocessing the image;

2. The method as claimed in claim 1, wherein the preprocessing comprises using a face detection model to detect facial marker points of the original image, obtaining an aligned face image after similarity transformation, and adjusting to a specified pixel size.

3. The method as claimed in claim 1, wherein the specific process of face keypoint detection includes performing preliminary face keypoint detection, obtaining coordinates and confidence scores thereof, selecting, as final keypoints, points covering the area around the eyes, eyebrows, nose and mouth, and points covering the area between the cheeks and eyes and eyebrows.

4. The method as claimed in claim 1, wherein the step of screening a plurality of interest points from the key points to obtain an occlusion indication map of the corresponding image comprises setting a confidence score threshold, removing the key points with confidence scores smaller than the threshold to obtain the interest points, and generating a gaussian distribution map centered around the coordinates of each interest point, wherein the gaussian distribution map has values of zero everywhere for those occluded points.

5. The method for recognizing natural blocking expressions with accurate attention according to claim 1, wherein the generation of the blocking indication map comprises: calculating the statistical value of each Gaussian distribution map at the same spatial position to obtain an occlusion indication map of an input expression image:

wherein

Refers to a gaussian distribution map generated for the mth point of interest of the input image i, M representing the number of points of interest, |, referring to the pixel level absolute value.

6. The method for recognizing the naturally-occluded expression with accurate attention according to claim 1, wherein the specific process of calculating the attention descriptor is as follows:

attention descriptor

Wherein

7. A natural occlusion expression recognition method with accurate attention as claimed in claim 1, wherein the attention loss function is:

or, the expression classification loss is:

wherein L is_EIs a loss of the classification of the expression,

8. A natural occlusion expression recognition system with accurate attention, comprising:

the expression recognition module further comprises:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.