LU503919B1 - Facial expression recognition method based on attention-modulated contextual spatial information - Google Patents

Facial expression recognition method based on attention-modulated contextual spatial information Download PDF

Info

Publication number
LU503919B1
LU503919B1 LU503919A LU503919A LU503919B1 LU 503919 B1 LU503919 B1 LU 503919B1 LU 503919 A LU503919 A LU 503919A LU 503919 A LU503919 A LU 503919A LU 503919 B1 LU503919 B1 LU 503919B1
Authority
LU
Luxembourg
Prior art keywords
contextual
attention
convolution
facial expression
features
Prior art date
Application number
LU503919A
Other languages
French (fr)
Inventor
Datong Xu
Mingyan Cui
Huawei Tao
Yajun Fan
Xue Li
Chunhua Zhu
Weiliang Han
Xinying Guo
Shuguang Zou
Jing Yang
Hongliang Fu
Original Assignee
Univ Henan Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Henan Technology filed Critical Univ Henan Technology
Application granted granted Critical
Publication of LU503919B1 publication Critical patent/LU503919B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial expression recognition method based on attention-modulated contextual spatial information specifically. And the specific steps are as follows: S1: Acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image; S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes; S3: Train the attention-modulated contextual spatial information network model (ACSI) by using the pre-processed facial expression image; S4: Repeat the model training in step S3 until the set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression. The combination of contextual convolution and coordinated attention can significantly improve the performance of facial expression recognition. Compared with similar algorithms, ACSI has higher recognition performance on public expression datasets.

Description

DESCRIPTION LU503919
FACIAL EXPRESSION RECOGNITION METHOD BASED ON
ATTENTION-MODULATED CONTEXTUAL SPATIAL INFORMATION
TECHNICAL FIELD
The invention relates to the technical field of automatic expression recognition. Specifically, it relates to an expression recognition algorithm and is particularly related to an facial expression recognition method based on attention-modulated contextual spatial information.
BACKGROUND
Facial expression is rich in characteristic information, and facial expression recognition has been widely used in human-computer interaction, mental health assessment, and so on.
Traditional facial expression recognition methods can be divided into two categories. One is based on facial Action Unit (AU), which usually transforms the task of facial expression recognition (FER) into an AU detection task, and AU is a small but recognizable muscle action related to expression. However, it is difficult to detect local changes in the face using this method, and factors such as illumination or changes in posture will also reduce the performance of AU detection. The other is to realize facial expression recognition by artificially designing features to represent face images and training expression classifiers. However, in natural scenes, due to uncontrollable factors, the performance of facial expression recognition methods based on artificial design features is limited. In recent years, deep learning-based facial expression recognition has become a research hotspot. Related work has shifted from controlled laboratory scenes to natural scenes, and some progress has been made. The convolutional neural network (CNN) is the main model of facial expression recognition, and CNN has a strong generalization in the task of facial expression recognition. Since then, various improvement methods have appeared. Among these methods, on the one hand, to solve the problem of incomplete expression features, Zhao et al. designed a symmetric structure to learn the multiscale features in the residual block and keep the facial expression information at the granularity level, Li et al. proposed a slide-patch (SP), which slides the window on each feature map to extract the global features of facial expressions. Fan et al. proposed a hierarchical scale network (HSNet) for facial expression recognition, in which an expansion starting block is added to enhance the kernel scale information extraction. Liang et al. adopted a two-branch network for facial expressid#/503919 recognition, in which one branch used CNN to capture local edge information and the other branch used visual Transformer to obtain a better global representation. Mao et al. proposed to use convolutional kernels of different sizes to form pyramidal convolutional units to extract expression features and improve the nonlinear expression ability of the model. However, the above methods improve the completeness of the extracted expression features by adding an auxiliary network layer or adopting a branching structure. On the other hand, to solve the problem of fuzzy classification boundaries between expression classes, Xie et al. proposed a module called Salient Expressional region descriptor (SERD) to highlight the salient features related to expressions and improve the feature representation ability. Gera et al. proposed a new spatial-channel attention network (SCAN) to obtain the local and global attention of each channel and each spatial position, and to process the expression features in space and channel dimensions, instead of directly performing features dimension reduction and compression. Wang et al. designed an attention branch using an architecture like U-Net to highlight subtle local expression information. After extracting multiscale features, Song Yugin et al. used the CBAM attention mechanism to screen the expression features and improve the expression of effective expression features. The above method extracts more subtle deep facial expression features by adding network auxiliary layer or using branch structure, thus improving the performance of the model. However, these methods ignore the potential contextual relationship between local regions of the human face, and the complex network structure is not conducive to a lightweight model.
Chinese patent document (application number: 202010537198.2) discloses a facial expression recognition method based on deep residual network. Firstly, the multiscale features of the enlarged facial expression image are extracted through the deep residual network model, and then the extracted features are reduced in dimension and compressed, and the processed features are used for expression classification. There are three defects in this method: (1) the standard convolution kernel with fixed receptive field is used in the residual network, which cannot obtain a wide range of facial expression information; (2) the redundant information is removed by the feature dimension reduction and compression scheme, and some important information related to expression is lost; (3) it performs well on laboratory controlled data sets, but its recognition performance on uncontrolled data sets needs to be verified. The above points limit the completeness of expression features extracted by this method, and the representation ability of features needs to be improved.
Chinese patent document (application number: 202110133950.1) discloses a dynamic facial503919 expression recognition method and system based on a representation stream embedding network.
A differentiable representation stream layer is embedded in a convolutional neural network to extract dynamic expression features from a video sequence, and spatial attention weight is used to weigh the output features. This method has two defects: (1) only spatial attention is used and feature optimization is not carried out from the channel dimension; (2) it involves the collection and processing of video data, and the working steps are complicated, resulting in high operating cost.
The existing methods have the following shortcomings: 1) in the feature extraction stage, only the global or local features of facial expressions are considered, which limits the completeness of features; 2) in the feature processing stage, the feature is reduced in dimension and compressed, which leads to fuzzy classification boundaries between classes.
SUMMARY
The invention provides a facial expression recognition method based on attention-modulated contextual spatial information and proposes a new natural scene facial expression recognition model called attention-modulated contextual spatial information (ACSI) model, wherein the contextual convolutions is used to replace the standard convolutions in the residual network, with CoResNet18 and CoResNet50 are constructed to extract multiscale features and obtain more subtle expression information without increasing network complexity.
In CoResNet, coordinate attention is embedded in each residual block to pay attention to salient features, enhance the useful information related to expressions, and suppress redundant information in the input feature map, and effectively reduce the sensitivity of deep convolution to face occlusion and posture changes.
To solve the technical problems, the technical scheme adopted by the invention is that the facial expression recognition method based on attention-modulated contextual spatial information specifically comprises the following steps:
S1: Acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image;
S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes;
S3: Train the attention-modulated contextual spatial information (ACSI) network model by using the pre-processed facial expression image;
S4: Repeat the model training in step S3 until the set training times are reached, obtain th&}503919 trained deep residual network model, and use the trained deep residual network model to recognize facial expression.
By adopting the technical scheme, a facial expression recognition model is constructed based on attention-modulated contextual spatial information. Firstly, the convolution kernel with a low expansion rate is used to capture local contextual information; Secondly, the convolution kernel with a high expansion rate is used to merge global context information to extract discriminative local features and related global features of faces, and ensure the complementarity of expression feature information; and finally, attention weight is assigned to the extracted features by using a coordinate attention mechanism. The differences in feature between expression classes increases and the ability to represent features is strengthened. Experiments on the AffectNet-7 and RAF DB data sets verify the effectiveness of the ACSI model, and the proposed model has better recognition performance than similar models.
As the preferred technical scheme of the invention, Step S2 comprises the following specific steps:
S21: Replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network;
S22: Use the coordinated attention (CA) to construct a coordinated attention module, so as to assign attention weights to the multiscale features extracted from the CoResNet to strengthen the feature representation ability.
By adopting the technical scheme, firstly, the contextual convolution is used to replace the standard convolution in the convolution residual block, and the contextual convolution residual network (CoResNet) is constructed as the feature extraction part, and the convolution kernels with different expansion rates is are used to capture local and merge global contextual information; Secondly, the coordinated attention module is embedded in CoResNet as the feature processing part, and the attention weight is assigned to the extracted features, which highlights the salient features and increases the feature differences between expression classes. Finally, the
ACSI model is formed for facial expression recognition.
As the preferred technical scheme of the invention, Step S21 is specifically as follows:
S211: The contextual convolution block receives the input feature map M”, which applies convolution kernels D = {d,,d,,d,,…,d,} with different expansion rates at different levels
L={1,2,3,...,n}, that is, the convolution kernel on level’ has an expansion rate d,,Vie L;
S212: Qt different levels level of contextual convolution, contextual convolution outputsL 4503919 plurality of feature maps M*“, and each map has a width Ww" and a height H* for all iel;
S213: Maintain the residual structure and combine the correlation between layers to obtain the contextual convolution residual module;
S214: Adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network. The contextual convolution residual network (CoReNet) constructed in step S2 includes CoResNetl8 and
CoResNet50; In CoResNetl8, each contextual residual module consists of a contextual convolution residual module and a 1*1 standard convolution layer; In CoResNet50, each contextual residual module consists of a contextual convolution residual module and two 1*1 standard convolution layers; The contextual convolution residual module is used for multiscale features extraction, and the 1*1 standard convolution layer is used for channel transformation;
According to the size of the input features map, different levels of contextual convolution blocks are used in each contextual residual module. In the first contextual convolution residual module, a contextual convolution block with level 4 (leve/=4) is used; In the first contextual convolution residual module, the level is equal to 3, in the third contextual convolution residual module, the level is equal to 2, and in the last contextual convolution residual module, the /eve/ is equal to 1.
Here, when level=n, there are convolution kernels with expansion rates di=i,i=1,...,n-1,n in the contextual convolution block.
As the preferred technical scheme of the invention, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations is calculated by formulas (1) and (2); params =M" -K" -K"-M* (1),
FLOPs=M"-K"-K” MU W.-H (2); where,M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of convolution kernel, and finally, 7°“ and H™ represent the width and height of output feature mapping.
As the preferred technical scheme of the invention, Step S22 is specifically as follows:
S221: Write the feature extracted by CoResNet as X. First, code each channel along th&}503919 horizontal and vertical coordinate directions using the average pooling kernels of sizes (7.1 and (LW); The coded output y" of the Cth channel with height of # is calculated by formula (3): 1 ; vem=— À, x) 8);
W O<i<W
The coded output : of the C-th channel with the width of W is calculated by formula (4): w 1 ; yw =— X WG;
H 0<j<H
S222: Preform feature-aggregating on the two transformations in step S221 along two spatial directions and return a pair of directional perceptual attention diagrams;
S223: Connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; £=ö(F(y',»" ]): where, [ ’ ] represents the concatenation operation along the spatial dimension, 0 is a . d RE) nonlinear $891 activation function, and / € is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; In order to reduce the complexity of the model, an appropriate reduction rate ” is adopted to reduce the number of channels of / ; h CL xH w Chow
S224: Continue to decompose / into two separate tensors J" eR and J ER along the spatial dimension, and use two 1x1 convolution transformations % and "+ to h w transform / and into tensors with the same number of channels, and adopt formulas (6) and (7); ho h m =5(F.(/")) co) m =ö(F, (7 )) (7); where, 5 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output Z of the coordinated attention module is shown in formula (8): 20) =x (6 xml (Dx ME D) (8),
where, 2,1) is the output, XJ) is the input, and m (0) and MJ) are the attention 503919 weights.
The technical scheme is adopted to pay attention to the salient features and enhance the feature differences between expression classes, so that the coordinated attention mechanism is adopted, and the coordinated attention module (CA) is embedded in the contextual convolution residual network for feature processing to enhance the expression-related information in the input feature map and suppress redundant information. By embedding coordinated attention in the network, the long-distance dependence between input features can be captured in one spatial direction, and the position information of the expression-related face region in another spatial direction can be maintained, and then the obtained feature map is encoded into a pair of directional perceptual and position-sensitive attention maps, which are applied to the input feature map to enhance the subtle expression information; A CA module is added after each contextual convolution block and CoResNet to screen key scale features and emphasize salient face regions to enhance the feature representation ability, thus improving the recognition performance.
As the preferred technical scheme of the invention, Step S1 is specifically as follows: First, adjust the size of the input image to 256x256, and then crop its up, down, left, and right and the central parts to obtain five 224x224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.
As the preferred technical scheme of the invention, Step S3 is specifically as follows:
S31: Carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network (CoResNet);
S32: Embed an attention module in each contextual convolution residual module to pay attention to the salient scale features, and use coordinated attention to weight the extracted features for CoResNet output features, so as to capture the correlation of expression information in two spatial directions and keep the key area information of the face;
S33: Down sampling the attention-weighted features and classify the down sampled features.
As the preferred technical scheme of the invention, in step S3, the attention-modulated contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinated attention (CA) module, a global average pooling layer, an fc layer, and a softmax classification layer which are sequentially connected; The convolution layer performs a 3*3 standard convolution operation on the input facial expression image to extract features; The bn layer normalizes the extractéd/503919 features in batches to prevent the gradient from disappearing or exploding; Then the relu layer performs nonlinear activation on it; The maximum pool layer is used for feature dimension reduction; Four contextual convolution modules are used to extract multiscale face features from the reduced dimension features; The coordinated attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; The
CA module behind the CoResNet output feature layer performs attention weighting on the output features; the global average pooling layer and fc layer, and the facial expression features after down-sampling are classified by softmax classifier.
As the preferred technical scheme of the invention, the input of the softmax classifier is an arbitrary real number vector, and the output is a vector; here the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of softmax is formula (9): soft max(x ) = —— (9) ; 2e = where, x, represents the i-th element, soft max(x,) represents the output value of the i-th element in the value of soft max , and i represents the number of elements, that is, the number of classified categories; The output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.
As the preferred technical scheme of the invention, in Step S3, before training the attention-modulated contextual spatial information network model (ACSI) with facial expression dataset, a large-scale facial expression data set MS-CELEB-1M (including 10 million face images of nearly 100,000 subjects) with more than 10 million data is used as a training set to pretrain ACSI; and then the facial expression data sets AffectNet-7 and RAF DB are respectively input into the pretrained ACSI model, and the output value is obtained through forward propagation, and the loss value of ACSI model is calculated by using a cross entropy loss function according to the output value (predicted category probability); The calculation formula of cross entropy loss function is shown in formula (10): 12 loss = 7m Pos ) (10) ; where, P (x) refers to the real class probability and q(x) is the predicted class probability of the model;
In step S4, according to the loss value of the ACSI model calculated by the formula (10503919 the network weights are updated by backpropagation, and the training is repeated until the set training times are reached to obtain the trained attention-modulated contextual spatial information network model, namely ACSI model.
Compared with that the prior method, the facial expression recognition method based on attention-modulated contextual spatial information has the beneficial effect of: (1) Convolution kernels with different expansion rates are replaced by deconstructed contextual convolution blocks to replace some convolution layers in the residual network, and contextual spatial information of face images is accessed on multiple network layers to extract more robust multiscale expression features; At the same time, parameters and calculation costs like those of standard convolution layers with the same size can be maintained, (2) A new attention mechanism, namely coordinated attention, is used, which can capture the dependence relationship between discriminative local features in one spatial direction, and at the same time keep the accurate location information of key face regions along another spatial direction, thus reducing the sensitivity of deep network to occlusion and posture changes and strengthening the feature representation ability; (3) The effectiveness and reliability of the constructed model for facial expression recognition in uncontrolled environment are verified on two large-scale natural environment facial expression image data sets.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 is a flow chart of a facial expression recognition method based on attention-modulated contextual spatial information of the present invention;
Fig. 2 is a block diagram of an attention-modulated contextual spatial information network (ACSI) model in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;
Fig. 3 is a schematic diagram of a contextual convolution block in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;
Fig. 4 is a schematic structural diagram of a coordinated attention module in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;
Fig. 5 shows the t-SNE visualization results of the features extracted by the baseline methdd/503919 and the ACSISO model on the AffectNet-7 dataset; (a) there is a t-SNE visual schematic diagram features extracted by baseline method on AffectNet-7 data set; (b) it is a t-SNE visual schematic diagram extracted by the ACSISO model on the AffectNet-7 data set;
Fig. 6 shows the t-SNE visualization results of features extracted by the baseline method and the ACSISO model on RAF-DB in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention; (a) there is a t-SNE visual schematic diagram features extracted by baseline method on RAF-DB; (b) it is a schematic diagram feature of the t-SNE visualization results extracted by the ACSISO model on
RAF-DB;
Fig. 7 is a schematic diagram of attention visualization results on an example expression image in the RAF DB dataset in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
In the following, the technical scheme in the embodiment of the invention will be clearly and completely described with the figures attached in the embodiment of the invention.
Embodiment: As shown in Fig. 1, the facial expression recognition method based on attention-modulated contextual spatial information specifically comprises the following steps.
S1: Acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image;
And S1 is specifically as follows: First, adjust the size of the input image to 256x256, and then crop its up, down, left, and right and the central parts to obtain five 224x224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.
S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes; firstly, the contextual convolution is used to replace the standard convolution in the convolution residual block, and the contextual convolution residual network (CoResNet) is constructed as the feature extraction part, and the convolution kernels with different expansion rates are used to capture local and merge global contextual information; secondly, the coordinated attention module is embedded in CoResNet as the feature processing part, and the attention weight is assigned to the extracted features, which highlights the salient features and increases the feature differences between expression classes.
Finally, the ACSI model is formed for facial expression recognition.
Step S2 comprises the following specific steps: LU503919
S21: Replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network; In the task of deep facial expression recognition, multiscale features are very important, which can capture richer local details while describing global semantic information. contextual convolution blocks contain convolution kernels with different expansion rates, and multiscale features can be extracted through receptive fields with different sizes. In CNN, standard convolution only use convolution kernel with fixed receptive field, and its convolution kernel size is usually 3x3 , because increasing the convolution kernel size will increase the parameter quantity and calculation time; The learnable parameters (weights) of standard convolution and the number of floating-point operations can be calculated by formulas (1) and (2); Like the standard convolution layer, all convolution kernels in the contextual convolution block are independent and allowed to be executed in parallel. Unlike the standard convolution layer, the contextual convolution with the same size can integrate contextual information while maintaining a similar number of parameters and calculation cost. Therefore, the contextual convolution block can be used as a direct substitute for the standard convolution layer to better complete feature extraction;
As shown in Fig, 2, step S21 1s specifically as follows:
S211: The contextual convolution block receives the input feature map M”, which applies convolution kernels D={d.d,.d,.....d,} with different expansion rates at different levels
L={L 2,30} that is, the convolution kernel on level” has an expansion rate d,,VieL; The expansion rate from level! tO level” Increases in turn, so that more and more contextual information can be extracted. Among them, the convolution kernel with lower expansion rate is responsible for capturing the local details of the face from the input features map, while the convolution kernel with higher expansion rate is responsible for merging the global contextual information, thus helping the whole facial expression recognition process.
S212: Different levels leve] Of contextual convolution, contextual convolution outputs a plurality of feature maps 17“ , and each map has a width w° and a height #7 for all ielLs
S213: Maintain the residual structure and combine the correlation between layers to obtain the contextual convolution residual module;
S214: Adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network. The contextual convolution residual network (CoReNet) constructed in step S2 includes CoResNet18 atd/503919
CoResNet50; In CoResNet18, each contextual residual module consists of a contextual convolution residual module and a 1*1 standard convolution layer; In CoResNet50, each contextual residual module consists of a contextual convolution residual module and two 1*1 standard convolution layers; The contextual convolution residual module is used for multiscale features extraction, and the 1*1 standard convolution layer is used for channel transformation;
According to the size of the input features map, different levels of contextual convolution blocks are used in each contextual residual module. The schematic diagram of the contextual convolution block is shown in Fig. 3. In the first contextual convolution residual module, a contextual convolution block with level 4 (level=4) is used; In the first contextual convolution residual module, level is equal to 3, in the third contextual convolution residual module, level is equal to 2, and in the last contextual convolution residual module, level is equal to 1. Here, when level=n, there are convolution kernels with expansion rates di=i,i=1, ...,n-1,n in the contextual convolution block. Different from the previous work of the network cascade, this technical scheme directly integrates the contextual convolution into the widely used residual network and improves the residual blocks in ResNet18 and ResNet50, respectively to obtain the corresponding CoResNet18 and CoResNet50; CoResNet is mainly composed of four network layers, and each layer has different levels of contextual convolution residual blocks. Because the size of the feature map will decrease as the network layer is farther away from the input, this paper adjusts the level of contextual convolution blocks in each layer according to the size of the features map. In the first layer, CoConv4,, adopted, that is, the contextual convolution block of level = 4. In the second layer, it is CoConvs In the third layer, it is CoConv2 . However, since the resolution of the features map in the last input layer has been reduced to 7x 7 it is no longer reasonable to use contextual convolution at this time. So, only one standard convolution is used, which is also marked as CoConvl : The convolution parameters of different levels CoConv are shown in Table 1.
Table 1 Convolution parameters of the contextual convolution residual block LU503919 3x3,16,d, =1 3x3,16,d, =2 level=4 3x3,16,d, =3 3x3,16,d, = 4 56x56 3x3,64.d, =1 level=3 3x 3,32, d, =2 3x3,32,d, =3 28x28 been 5 level=2 3x3,128,d, =2 14x14
As the preferred technical scheme of the invention, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations is calculated by formulas (1) and (2); params =M“ -K" -K"-M* (1);
FLOPs = M” CK . K” ‚MH Jo HoH (2); where, M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of convolution kernel, and finally, 7°“ and H™ represent the width and height of output feature mapping.
S22: Use the coordinated attention (CA) to construct a coordinated attention module (Its structure is shown in Fig. 4) to assign attention weights to the multiscale feature extracted from the CoResNet to strengthen the features representation ability;
And step S22 is specifically as follows:
S221: Write the feature extracted by CoResNet as X. First, code each channel along the horizontal and vertical coordinate directions by using the average pooling kernels of sizes (7.1 h and (LW). The coded output Ye of the C-th channel with height of # is calculated by formula (3): 1 ; yihy=— X x. (hi) 8);
W 0sisW
The coded output Ÿ of the C-th channel with the width of W is calculated by formula (4):
w 1 . LU503919 yw =— X WG;
H 0<j<H
S222: Preform feature-aggregating on the two transformations in step S221 along two spatial directions and return a pair of directional perceptual attention diagrams;
S223: Connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; £=ö(F(y',»" ]): [> | [( [( mu. 6. where, represents the concatenation operation along the spatial dimension, is a . d RE) nonlinear “’8Moid activation function, and / € is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; In order to reduce the complexity of the model, an appropriate reduction rate ” is adopted to reduce the number of channels of / ; h CL xH w Chow
S224: Continue to decompose S into two separate tensors JF eR and J ER along the spatial dimension, and use two 1x1 convolution transformations ky, and rs, to h f wo, . transform / and into tensors with the same number of channels, and adopt formulas (6) and (7); h h m =5(F.(/")) co) m°=5(F (1) where, 6 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output Z of the coordinated attention module is shown in formula (8): .. Lo. hr: Woo. 2.00.1) =>, yx ml Gy xm! D) g) ., Lo. hf. w,. where, 2,1) is the output, XJ) is the input, and "”e G) and (J) are the attention weights.
To pay attention to the salient features and enhance the feature differences between expression classes, the coordinated attention mechanism is adopted, and the coordinated attention module (CA) is embedded in the contextual convolution residual network for feature processing, so as to enhance the expression-related information in the input feature map and suppress redundant information. As shown in figures, by embedding coordinated attention in the network, the long-distance dependence between input features can be captured in one spatial direction, and the position information of the expression-related face region in another spatial503919 direction can be maintained, and then the obtained feature map is encoded into a pair of directional perceptual and position-sensitive attention maps, which are applied to the input feature map to enhance the subtle expression information; a CA module is added after each contextual convolution block and CoResNet to screen key scale features and emphasize salient face regions to enhance the feature representation ability, thus improving the recognition performance.
S3: Train the attention-modulated contextual spatial information (ACSI) network by using the pre-processed facial expression image; And in step S3, the contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinated attention (CA) module, a global average pooling layer, an fc layer and a softmax classification layer which are sequentially connected; the convolution layer performs 3*3 standard convolution operation on the input facial expression image to extract features; the bn layer normalizes the extracted features in batches to prevent the gradient from disappearing or exploding; then the relu layer performs nonlinear activation on it; the maximum pool layer is used for feature dimension reduction; Four contextual convolution modules are used to extract multiscale face features from the reduced-dimension features; the coordinated attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; The CA module behind the CoResNet output feature layer performs attention weighting on the output features; The global average pooling layer and fc layer, and the facial expression features after downsampling are classified by softmax classifier.
Step S3 is specifically as follows:
S31: Carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network (CoResNet),
S32: Embed an attention module in each contextual convolution residual module to pay attention to the salient scale features, and use coordinated attention to weight the extracted features for CoResNet output features, so as to capture the correlation of expression information in two spatial directions and keep the key area information of the face;
S33: Down sample the attention-weighted features and classify the down sampled features with the softmax classifier.
The input of the softmax classifier is an arbitrary real number vector, and the output isL4&J503919 vector; wherein the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of Softmax is formula (9): soft max(x,) = LE 2e i=1 (9) ; where, x, represents the i-th element, soft max(x,) represents the output value of the i-th element in the value of soft max, and 7 represents the number of elements, that is, the number of classified categories, The output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.
In Step S3, before training the attention-modulated contextual spatial information network model (ACSI) with the facial expression dataset, a large-scale facial expression data set
MS-CELEB-1M (including 10 million face images of nearly 100,000 subjects) with more than million data is used as a training set to pre-train ACSI; And then the facial expression data sets AffectNet-7 and RAF DB are respectively input into the pre-trained ACSI model, and the output value is obtained through forward propagation, and the loss value of the ACSI model is calculated by using a cross entropy loss function according to the output value (predicted category probability); The calculation formula of cross entropy loss function is shown in formula (10): 1 N loss =——_}, p(x,)log(q,(x,))
Moss (10) ;
Where ? (x) refers to the real class probability and q(x) is the predicted class probability of the model;
S4: Repeat the model training in step S3 until the set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression. In step S4, according to the loss value of the ACSI model calculated by formula (10), the network weights are updated by backpropagation, and the training is repeated until the set training times are reached to obtain the trained attention-modulated contextual spatial information network, namely ACSI model.
Specific Application Embodiment: With the above technical scheme, to verify the effectiveness of the ASCP model proposed in this paper, experiments are carried out in two public facial expression databases, AffectNet and RAF-DB, both of which provide face images in natural scenes. Among them, the AffectNet database is one of the largest databases in the research field of facial emotion calculation, with about 440,000 face images, including
AffectNet-7 and AffectNet-8 (adding the "contempt" category); RAF-DB database includes LUS03919 basic facial expressions and 12 compound facial expressions, with a total of about 30,000 face images. As shown in Table 2, the face images of 7 basic facial expressions (happiness, surprise, sadness, anger, disgust, fear, and neutrality) in AffectNet-7 and RAF-DB databases are used as training sets. Because the test set is not available, it is tested on the corresponding verification set to evaluate the performance of the proposed model.
In the image pre-processing stage in step SI, first, adjust the size of the input image to 256 x 256, and then crop its up, down, left, and right and the central parts to obtain five 224 x 224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5. The model is implemented by pytorch, and the model is trained on NVIDIA
GeForce GTX 1650 GPU. During the training process, the SGD algorithm is used to optimize.
The momentum is set to 0.9, and the initial learning rate is 0.01. The learning rate is reduced to 0.1 every 20 iterations, the total number of iterations is 60, and the batch size is 16.
Table 2 Detailed information on the experimental data set, including expression category, number of training sets and number of test sets. oma [EEE TE
Dataset
The experimental results of this facial expression recognition method based on attention-modulated contextual spatial information on AffectNet-7 and RAF-DB verification sets are shown in Table 3, in which CoResNet18 and CoResNet50 (the baseline model in this article) are contextual convolution residual networks. CoResNet18 CA a and CoResNet50_CA a are to embed coordinated attention modules behind the feature output layers of CoResNetl8 and
CoResNet50, respectively. CoResNet18 CA b and CoResNetS0 CA b are to embed the coordination attention module in each contextual convolution residual block of th&/503919 corresponding CoResNet;
Table 3 The recognition accuracy of the ACSI model on AffectNet-7 and RAF-DB verification sets
Ce
Model
As can be seen from Table 3, on the AffectNet-7 verification set, the facial expression recognition accuracy of ACSI18 is increased by 1.70% compared to CoResNet18, and by 1.36% and 1.30% compared to CoResNet18 CA a and CoResNet18 CA b, respectively. The accuracy of facial expression recognition of ACSISO is increased by 2.03% compared with CoResNet50, and by 0.80% and 0.25% compared with CoResNet50 CA a and CoResNet50 CA b, respectively. On the RAF DB verification set, the accuracy of facial expression recognition of
ACSI18 is increased by 1.89% compared with CoResNet18, and by 1.23% and 1.14% compared with CoResNet18 CA a and CoResNet18 CA b respectively. The accuracy of facial expression recognition of ACSISO is increased by 1.79% compared with CoResNet50, and by 0.35% and 0.06% compared with CoResNet50 CA a and CoResNet50 CA b, respectively. The above experimental results show the effectiveness and generalization of the proposed algorithm.
To further illustrate the effectiveness of the contextual spatial information (ACSI) network model constructed in the facial expression recognition method based on attention-modulated contextual spatial information, the performance of the constructed contextual spatial information (ACSI) network model is compared with other similar models in recent years on data sets
AffectNet-7 and RAF-DB, as shown in Table 4 and Table 5. As can be seen from Table 4, the
ACSISO increases by 1.61% compared to FMPN, by 0.97% compared to OADN on AffectNet-7, by 0.75% compared with Ensemble CNN, and by 0.52% compared to DDA-Loss method. As can be seen in Table 5, the ACSIS0 proposed in this paper is increased by 2.5% compared witlkW503919
FSN, by 0.91% compared with CNN, by 0.76% compared with DLP-CNN and by 0.33% compared with pACNN in RAF DB.
The results show that the recognition accuracy of the model proposed in this paper has been improved on AffectNet-7 and RAF-DB, and it is competitive compared to similar models.
Because these models cannot solve the problem of limited feature completeness or fuzzy classification boundaries between classes, the recognition performance is low. The model proposed in this paper can extract multiscale facial expression features by using contextual convolution. Embedding the coordinated attention module in the network can make the network pay attention to more discriminating expression features, and the correlation between layers can be better combined through the residual structure, which finally improves the recognition performance.
Table 4 Comparison of performance of models on AffectNet-7
Table 5 Comparison of performance of models on RAF-DB
To prove the interclass differences of expression features extracted by the ACSI model, in this section the t-SNE visualization is performed on the features extracted by the ACSI model on
AffectNet-7 and RAF-DB verification sets, and the results are shown in Figs. 5 and 6. Figs. 2-6 all show 7 basic facial expression classifications, including anger, disgust, fear, happiness/503919 sadness, surprise, and neutral. As can be seen in the figures, compared to the baseline model, the features extracted by the ACSISO model have the distribution characteristics of relative dispersion between classes and relative aggregation within classes.
To further study the function of the attention module in the model, class activation map (CAM) method is used to visualize the attention diagram generated by the attention in this paper.
The class activation map method is used to visualize the activated parts of different expressions, and map the weights of the output layer to the convolution features map to identify the importance of different regions of the face image. Specifically, the facial activation region is visualized for the proposed network ACSI through CAM, and the attention map is obtained. To display the attention region on the original image, the attention map is generally adjusted to the same size as the input image, and the attention map is visualized to the original image through
COLORMAP JET color mapping. When the technical scheme is used, the specific steps are as follows: firstly, the visual attention map is adjusted to the same size as the input image, and the attention map 1s visualized to the original image through color mapping; Fig. 7 shows the attention map of different expression images in RAF DB. There are 7 columns in the diagram, and each column shows one of the seven expressions. From left to right, there are anger, disgust, fear, happiness, sadness, surprise, and neutral respectively. It can be clearly seen from Fig. 7 that the attention module used in this paper makes the network focus on the more discriminating face region in the presence of occlusion and posture changes; The results show that the combination of contextual convolution and coordinated attention can significantly improve the performance of facial expression recognition. Compared with similar algorithms, ACSI has higher recognition performance on public expression datasets.
The above is only the preferred embodiment of the invention, and it is not used to limit the invention. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the invention should be included in the protection scope of the invention.

Claims (10)

CLAIMS LU503919
1. A facial expression recognition method based on attention-modulated contextual spatial information specifically, where, it comprises the following specific steps: S1: acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image; S2: construct an attention-modulated contextual spatial information network model ACSI for facial expression recognition in natural scenes; S3: train the contextual spatial information network model ACSI by using the preprocessed facial expression image; S4: repeat the model training in step S3 until set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression.
2. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 1, where, step S2 comprises the following specific steps: S21: replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network; S22: use the coordinate attention to construct a coordinate attention CA module, to assign attention weights to the multiscale features extracted from the context convolution residual network CoResNet constructed in step S21 to strengthen the features representation ability.
3 The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S21 1s specifically as follows: S211: the contextual convolution block receives the input feature map M”, which applies convolution kernels D={d.d,.d,.....d,} with different expansion rates at different levels L={123...n} , that is, the convolution kernel on level has an expansion rate d,,VieL; S212: at different levels Jeve] of contextual convolution, contextual convolution outputs a plurality of feature maps 17“ , and each map has a width w° and a height #7 for all ielLs S213: maintain the residual structure and combine the correlation between layers to obtain the contextual convolution residual module; S214: adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network.
4. The facial expression recognition method based on attention-modulated contextual spatill/503919 information specifically, as claimed in claim 3, wherein, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations are calculated by formulas (1) and (2); params =M" -K" -K"-M* (1); FLOPs = M” CK . K” ‚MH Jo HoH (2); where, M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of the convolution kernel, and finally, 7°“ and H”“ represent the width and height of the output feature mapping.
5. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S22 1s specifically as follows: S221: write the feature extracted by CoResNet as X; first, code each channel along the horizontal and vertical coordinate directions by using the average pooling kernels of sizes (7.1 h and (GW). the coded output Ye of the C-th channel with height of h is calculated by formula (3): 1 (3); Vi =— X x. (hi) W O<i<W the formula (3) calculates the coded output of the Cth channel when the height in the horizontal coordinate direction is #2, and sums the input features along the width 7; the coded output X : ofthe C-th channel with the width of W is calculated by formula (4): w 1 . 4); yiw=— 3 xm H 0<j<H the formula (4) calculates the coded output of the Cth channel when the height in the vertical coordinate direction is W, and 0<JSH gums the input features along the heights j; S222: preform feature-aggregating on the two transformations in step S221 along two spatial directions, and return a pair of directional perceptual attention diagrams; S223: connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; f= S(F( yy )) L | (5);
[> | [( ;( qu 6, 508919 where represents the concatenation operation along the spatial dimension, is a sigmoid | ferro nonlinear activation function, and is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; S224: continue to decompose J into two separate tensors / "eR and J € rR along the spatial dimension, and use two 1x1 convolution transformations ky, and rs, to transform J ’ and I" into tensors with the same number of channels, and adopt formulas (6) and (7); m =5(F,(/")) (6); w_ w m= 5(, (7 a where 5 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output 7 of the coordinated attention module is shown in formula (8):
z(t.) =x, MX MM D) a, where z.(1,]) is the output, x (1,7) is the input, and m} (Ci) and m} 0) are the attention weights.
6. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, wherein, Step S1 is specifically as follows: first, adjust the size of the input image to 256 x 256, and then crop its up, down, left and right and the central parts to obtain five 224 x 224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.
7. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S3 is specifically as follows: S31: carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network CoResNet: S32: embed an attention module in each contextual convolution residual module to pay attention to the salient scale features, and use coordinate attention to weight the extracted features for CoResNet output features to capture the correlation of expression information in two spatial directions and keep the key area information of the face; S33: down sample the attention-weighted features and classify the down sampled features.
8. The facial expression recognition method based on attention-modulated contextual spatiaV503919 information specifically, as claimed in claim 7, where, in step S3, the attention-modulated contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinate attention CA module, a global average pooling layer, an fc layer, and a softmax classification layer which are sequentially connected; the convolution layer performs 3*3 standard convolution operation on the input facial expression image to extract features; the bn layer normalizes the extracted features in batches to prevent the gradient from disappearing or exploding; then the relu layer performs nonlinear activation on it; the maximum pool layer is used for feature dimension reduction; four contextual convolution modules are used to extract multiscale face features from the reduced-dimension features; the coordinate attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; the CA module behind the CoResNet output feature layer performs attention weighting on the output features; the global average pooling layer and fc layer, and the facial expression features after down sampling are classified by softmax classifier.
9. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 8, wherein, the input of the Softmax classifier is an arbitrary real number vector and the output is a vector; wherein the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of softmax is formula (9): soft max(x,) = LE 2e i=1 (9) ; where, x, represents the I -th element, soft max(x,) represents the output value of the I -th element in the value of soft max , and I represents the number of elements, that is, the number of classified categories; the output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.
10. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 8, where, in step S3, before training the attention-modulated contextual spatial information network with facial expression datasets, a large-scale facial expression data set MS-CELEB-1M with more than 10 million data is used as a training set to pre-train ACSI; and then the facial expression data sets AffectNet-7 and RAF_DB are respectively input into the pre-trained ACSI model, and the output value is obtained through forward propagation, and the loss value of the ACSI model is calculated by using a cross-entropy loss function according to the output value; the calculation formula of cross entropy loss functid/503919 is shown in formula (10): 1 N loss =—— 3 p(x,)log(q,(x,)) im (10) ; where, ? (x) refers to the real class probability and q(x) is the predicted class probability of the model; in step S4, according to the loss value of the ACSI model calculated by formula (10), the network weights are updated by back propagation, and the training is repeated until the set training times are reached, to obtain the trained attention-modulated contextual spatial information network, namely ACSI model.
LU503919A 2022-03-29 2023-02-01 Facial expression recognition method based on attention-modulated contextual spatial information LU503919B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210317680.4A CN114758383A (en) 2022-03-29 2022-03-29 Expression recognition method based on attention modulation context spatial information

Publications (1)

Publication Number Publication Date
LU503919B1 true LU503919B1 (en) 2023-10-06

Family

ID=82326864

Family Applications (1)

Application Number Title Priority Date Filing Date
LU503919A LU503919B1 (en) 2022-03-29 2023-02-01 Facial expression recognition method based on attention-modulated contextual spatial information

Country Status (3)

Country Link
CN (1) CN114758383A (en)
LU (1) LU503919B1 (en)
WO (1) WO2023185243A1 (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758383A (en) * 2022-03-29 2022-07-15 河南工业大学 Expression recognition method based on attention modulation context spatial information
CN116311105B (en) * 2023-05-15 2023-09-19 山东交通学院 Vehicle re-identification method based on inter-sample context guidance network
CN116758621B (en) * 2023-08-21 2023-12-05 宁波沃尔斯软件有限公司 Self-attention mechanism-based face expression depth convolution identification method for shielding people
CN117041601B (en) * 2023-10-09 2024-01-12 海克斯康制造智能技术(青岛)有限公司 Image processing method based on ISP neural network model
CN117055740A (en) * 2023-10-13 2023-11-14 福建省东南造物科技有限公司 Digital screen glasses adopting air non-inductive interaction technology and application method of digital screen glasses
CN117523267A (en) * 2023-10-26 2024-02-06 北京新数科技有限公司 Small target detection system and method based on improved YOLOv5
CN117437519B (en) * 2023-11-06 2024-04-12 北京市智慧水务发展研究院 Water level identification method and device for water-free ruler
CN117496243B (en) * 2023-11-06 2024-05-31 南宁师范大学 Small sample classification method and system based on contrast learning
CN117197727B (en) * 2023-11-07 2024-02-02 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117235604A (en) * 2023-11-09 2023-12-15 江苏云幕智造科技有限公司 Deep learning-based humanoid robot emotion recognition and facial expression generation method
CN117252488B (en) * 2023-11-16 2024-02-09 国网吉林省电力有限公司经济技术研究院 Industrial cluster energy efficiency optimization method and system based on big data
CN117649579A (en) * 2023-11-20 2024-03-05 南京工业大学 Multi-mode fusion ground stain recognition method and system based on attention mechanism
CN117612024B (en) * 2023-11-23 2024-06-07 国网江苏省电力有限公司扬州供电分公司 Remote sensing image roof recognition method based on multi-scale attention
CN117671357B (en) * 2023-12-01 2024-07-05 广东技术师范大学 Pyramid algorithm-based prostate cancer ultrasonic video classification method and system
CN117423020B (en) * 2023-12-19 2024-02-27 临沂大学 Dynamic characteristic and context enhancement method for detecting small target of unmanned aerial vehicle
CN117746503B (en) * 2023-12-20 2024-07-09 大湾区大学(筹) Face action unit detection method, electronic equipment and storage medium
CN117576765B (en) * 2024-01-15 2024-03-29 华中科技大学 Facial action unit detection model construction method based on layered feature alignment
CN117668669B (en) * 2024-02-01 2024-04-19 齐鲁工业大学(山东省科学院) Pipeline safety monitoring method and system based on improvement YOLOv (YOLOv)
CN117676149B (en) * 2024-02-02 2024-05-17 中国科学技术大学 Image compression method based on frequency domain decomposition
CN117809318B (en) * 2024-03-01 2024-05-28 微山同在电子信息科技有限公司 Oracle identification method and system based on machine vision
CN117894058B (en) * 2024-03-14 2024-05-24 山东远桥信息科技有限公司 Smart city camera face recognition method based on attention enhancement
CN117893975B (en) * 2024-03-18 2024-05-28 南京邮电大学 Multi-precision residual error quantization method in power monitoring and identification scene
CN117912086B (en) * 2024-03-19 2024-05-31 中国科学技术大学 Face recognition method, system, equipment and medium based on broadcast-cut effect driving
CN117935060B (en) * 2024-03-21 2024-05-28 成都信息工程大学 Flood area detection method based on deep learning
CN117934338B (en) * 2024-03-22 2024-07-09 四川轻化工大学 Image restoration method and system
CN118015687B (en) * 2024-04-10 2024-06-25 齐鲁工业大学(山东省科学院) Improved expression recognition method and device for multi-scale attention residual relation perception
CN118135496A (en) * 2024-05-06 2024-06-04 武汉纺织大学 Classroom behavior identification method based on double-flow convolutional neural network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6788264B2 (en) * 2016-09-29 2020-11-25 国立大学法人神戸大学 Facial expression recognition method, facial expression recognition device, computer program and advertisement management system
CN111325108B (en) * 2020-01-22 2023-05-26 中能国际高新科技研究院有限公司 Multitasking network model, using method, device and storage medium
CN111797683A (en) * 2020-05-21 2020-10-20 台州学院 Video expression recognition method based on depth residual error attention network
CN113627376B (en) * 2021-08-18 2024-02-09 北京工业大学 Facial expression recognition method based on multi-scale dense connection depth separable network
CN114758383A (en) * 2022-03-29 2022-07-15 河南工业大学 Expression recognition method based on attention modulation context spatial information

Also Published As

Publication number Publication date
WO2023185243A1 (en) 2023-10-05
CN114758383A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
LU503919B1 (en) Facial expression recognition method based on attention-modulated contextual spatial information
CN110532900B (en) Facial expression recognition method based on U-Net and LS-CNN
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN108830157A (en) Human bodys&#39; response method based on attention mechanism and 3D convolutional neural networks
Wang et al. NAS-guided lightweight multiscale attention fusion network for hyperspectral image classification
CN113239784B (en) Pedestrian re-identification system and method based on space sequence feature learning
CN110378208B (en) Behavior identification method based on deep residual error network
CN105913053B (en) A kind of facial expression recognizing method for singly drilling multiple features based on sparse fusion
CN113989890A (en) Face expression recognition method based on multi-channel fusion and lightweight neural network
CN111460980A (en) Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion
Ming et al. 3D-TDC: A 3D temporal dilation convolution framework for video action recognition
CN115171052B (en) Crowded crowd attitude estimation method based on high-resolution context network
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Tur et al. Evaluation of hidden markov models using deep cnn features in isolated sign recognition
CN115546500A (en) Infrared image small target detection method
CN115965864A (en) Lightweight attention mechanism network for crop disease identification
Liu et al. Lightweight ViT model for micro-expression recognition enhanced by transfer learning
Yawalkar et al. Automatic handwritten character recognition of Devanagari language: a hybrid training algorithm for neural network
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
Zhang et al. A novel CapsNet neural network based on MobileNetV2 structure for robot image classification
CN112348007B (en) Optical character recognition method based on neural network
CN111898479B (en) Mask wearing recognition method and device based on full convolution single-step target detection algorithm
Yu et al. Multimodal co-attention mechanism for one-stage visual grounding
CN115223220B (en) Face detection method based on key point regression
CN110569928A (en) Micro Doppler radar human body action classification method of convolutional neural network

Legal Events

Date Code Title Description
FG Patent granted

Effective date: 20231006