CN117133035A

CN117133035A - Facial expression recognition method and system and electronic equipment

Info

Publication number: CN117133035A
Application number: CN202311089347.3A
Authority: CN
Inventors: 陈靓影; 文仲淳; 徐如意; 杨宗凯
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-11-28

Abstract

The invention provides a facial expression recognition method, a facial expression recognition system and electronic equipment, wherein the facial expression recognition method comprises the following steps: preprocessing the facial sample image, and extracting facial feature point coordinates of a user; acquiring local features and global features by adopting a convolutional neural network backbone; fusing the aggregated global features with each local feature by adopting a cross attention module 1 to obtain enhanced local features; processing the enhanced local features by adopting a graph convolution neural network, and extracting structural semantic information of the face of the user; the cross attention module 2 is adopted to fuse the user face structural semantics into each global feature, so as to obtain the global features of the user face structural semantics enhancement; adopting a visual self-attention model to further encode global features of the structural semantic enhancement of the face of the user; the whole model training is supervised with cross entropy loss. The invention utilizes the graph convolution neural network to enhance the characteristic representation capability of the visual self-attention model, and is used for solving the expression recognition task in a natural scene.

Description

Facial expression recognition method and system and electronic equipment

Technical Field

The invention belongs to the field of computer vision, and in particular relates to a facial expression recognition method, a facial expression recognition system and electronic equipment.

Background

Facial expression recognition is a task of extracting and classifying facial expression features in images or videos by using computer vision and deep learning technologies. Facial expression is one of the main ways for humans to express their own emotions. Through expression recognition, the intrinsic emotional states including happiness, sadness, aversion, anger, and the like can be acquired. In recent years, facial expression recognition has been unprecedented in explored and applied in a variety of fields. For example, in the field of human-computer interaction research, the role of emotion interaction is particularly important. The system can determine the next interactive behavior by identifying the emotion fluctuation of the user, so that more humanized and intelligent interactive experience is realized. Meanwhile, facial expression recognition is widely used for emotion monitoring. By analyzing the facial expression changes of people in real time, the emotion change condition in a certain time period can be extracted, an important basis is provided for psychological health measurement and physical disease diagnosis, and the development of automatic early psychological disease detection and intelligent psychological consultation intervention is promoted.

The graph convolutional neural network is an effective method to solve the problem in facial expression recognition. By using the face feature points as nodes and connecting the nodes in the order of the face parts, a geometric face feature point diagram can be constructed. By carrying out graph rolling operation on the geometric face feature point diagram, the information of the neighbor nodes can be aggregated, so that geometric perception is carried out on each node, and the semantic expression capability of the node is improved. Compared with the traditional convolutional neural network, the visual self-attention model can directly gather global information of images in a global scope through calculation of a self-attention mechanism, and the method is more accurate and efficient in establishing long-distance relations. In expression recognition, a multi-head mechanism in the self-attention model can not only conduct parallel matrix operation to reduce time consumption, but also conduct feature coding through the multi-head mechanism, and through global semantic association learning, the model can capture richer and more complex features.

Although visual self-attention models have made significant advances in the performance of facial expression recognition, many problems remain. Firstly, a visual self-attention model cannot explicitly explore the geometric structural semantics of a human face, and cannot extract geometric association characterization which is helpful for facial expression recognition; secondly, the visual self-attention model does not explore semantic association of local features and global features, especially local and global semantic association around the five sense organs related to expression concentration. Meanwhile, since the graph structure of the graph convolution neural network is manually set and fixed, advanced semantic association between nodes cannot be explored, and new node connection cannot be adaptively established. For the above reasons, existing methods still lack robust, discriminative feature extraction capabilities.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a facial expression recognition method, a facial expression recognition system and electronic equipment, and aims to solve the problem of poor facial expression recognition effect in the prior art.

To achieve the above object, in a first aspect, the present invention provides a facial expression recognition method, comprising the steps of:

Acquiring a facial image of a user, and extracting facial feature points of the user;

extracting shallow features and deep features of a facial image of a user based on a convolutional neural network, cutting out corresponding local feature blocks in the shallow features by taking each facial feature point as a center to obtain a plurality of local features of the facial image, and cutting and projecting the deep features to obtain a plurality of global features of the facial image;

taking the similarity of the global features and each local feature as a corresponding attention weight, and then combining the corresponding attention weights to adaptively fuse a plurality of global features into each local feature respectively to obtain a plurality of enhanced local features;

carrying out graph convolution on feature point diagrams corresponding to facial feature points by combining a plurality of enhanced local features based on a graph convolution neural network, and extracting a plurality of structural semantic features of the face of the user;

the similarity between the structural semantic features and the global features is used as a corresponding attention weight, then a plurality of structural semantic features are respectively fused into each global feature by combining the corresponding attention weight, and the region with high association degree with facial expression recognition is enhanced in the global features to obtain enhanced global features;

The enhanced global feature recognition is further aggregated based on the visual self-attention model to obtain the facial expression of the user.

In one possible implementation, the shallow and deep features of the user's facial image are extracted over a ResNet-18 network or an IR-50 network.

In one possible implementation, a local sub-block of a preset size is cut out on the shallow features centered on each user facial feature pointWherein N represents the number of user face feature points, C represents the number of channels and H _local And W is _local Representing the size of the local sub-block cut out; finally, the final local feature ++is obtained through feature flattening and feature mapping>d _local Feature vector dimensions representing local features.

Deep features with convolutional neural network outputThe deep features are evenly segmented, and then the global features are obtained after the projection of the full-connection layer>Wherein the method comprises the steps ofH _local And W is _local Representing the deep feature size, h, w representing the height and width of each sub-block of the segmentation, d _global Representing the number of channels per global feature after projection.

In a possible embodiment, the enhanced local feature X' _local The method comprises the following steps:

wherein d is the channel dimension of the feature vector, softmax (·) is the activation function for speeding up training convergence, X _local As local features, X _global For global features, w represents projecting features with fully connected layers,representing matrix multiplication, G2L _Attention Attention weighting for global features versus local features.

In a possible embodiment, the structured semantic feature Z _local The method comprises the following steps:

constructing a graph model G= (v, e) through geometric prior knowledge of facial feature points of a user, wherein v represents a node, e represents an edge, and each feature point subblock x _i E v is taken as one node of the graph model, and the edge a between any two nodes _ij E e using user facial feature pointsIs initialized to obtain an adjacency matrix A _ij ；

Local feature X 'to be enhanced' _local And adjacency matrix A _ij Inputting the facial features into a graph convolution neural network, and extracting the enhanced local feature enhancement to obtain the structural semantic feature Z of the face _local The method comprises the steps of carrying out a first treatment on the surface of the The adjacency matrix is a learnable parameter of the graph convolution neural network, so that the graph convolution neural network adaptively searches for the association between local areas of the face of the user according to a target task, reduces the association between areas without association, and adds edge connection to the local areas with semantic association so as to extract corresponding structural semantic features.

In a possible implementation, the enhanced global feature X' _global The method comprises the following steps:

wherein d is the channel dimension of the feature vector, softmax (·) is the activation function for speeding up training convergence, Z _local To structure semantic features, X _global For global features, w represents projecting features with fully connected layers,representing matrix multiplication, L2G _Attention Attention weights for structured semantic features versus global features.

In one possible implementation, the visual self-attention model is composed of M encoders, each encoder comprising: a multi-headed self-care MSA with long jump connection and a multi-layered perceptron MLP;

the enhanced global features are further aggregated based on a visual self-attention model, specifically:

MAS operation section: first, enhanced global featuresThe linear projection is a query q, a keyword k, and a value v, as follows:

[q，k，v]＝X′ _global [w _q ，w _k ，w _v ]

wherein,the similarity between the features is explored through projection of the full connection layer to the subspace; d, d _k And d _v The number of channels after projection;

second, self-attention weight Z is calculated based on the linear projection result _global And weighting and summing all values to obtain an output Z' _global ：

Z′ _global ＝Z _global v

MLP operation part: will Z' _global The multi-layer perceptron is input into the multi-layer perceptron, the multi-layer perceptron consists of a two-layer advanced neural network and a ReLU activation function, and the characteristic Z 'is further improved through linear mapping and nonlinear activation' _global Is defined by the meaning of (1).

In one possible implementation, the facial expression of the user is identified, specifically:

and (3) inputting the characteristics output by the visual self-attention model into a full-connection layer for mapping, mapping the characteristics into output heads of corresponding expression categories, and finally normalizing by a softmax function to obtain probability distribution of each facial expression category.

In a second aspect, the present invention provides a facial expression recognition system, comprising:

a facial feature point extraction unit for acquiring a facial image of a user and extracting facial feature points of the user;

the facial feature determining unit is used for extracting shallow features and deep features of the facial image of the user based on the convolutional neural network, cutting out corresponding local feature blocks in the shallow features by taking each facial feature point as a center to obtain a plurality of local features of the facial image, and cutting and projecting the deep features to obtain a plurality of global features of the facial image;

the facial feature enhancement unit is used for taking the similarity between the global features and each local feature as corresponding attention weights, and then combining the corresponding attention weights to adaptively fuse a plurality of global features into each local feature respectively to obtain a plurality of enhanced local features; carrying out graph convolution on feature point diagrams corresponding to facial feature points by combining a plurality of enhanced local features based on a graph convolution neural network, and extracting a plurality of structural semantic features of the face of the user; the similarity between the structural semantic features and the global features is used as a corresponding attention weight, and then a plurality of structural semantic features are respectively fused into each global feature by combining the corresponding attention weights, so that a region with high association degree with facial expression recognition is enhanced in the global features, and the enhanced global features are obtained;

And the facial expression recognition unit is used for further aggregating the enhanced global feature recognition based on the visual self-attention model to obtain the facial expression of the user.

In a third aspect, the present invention provides an electronic device comprising: at least one memory for storing a program; at least one processor for executing a memory-stored program, which when executed is adapted to carry out the method of the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, the invention provides a computer readable storage medium storing a computer program which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, the invention provides a computer program product which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.

In general, the above technical solutions conceived by the present invention have the following beneficial effects compared with the prior art:

The invention provides a facial expression recognition method, a facial expression recognition system and electronic equipment, which are used for fusing facial structured semantic information extracted from a graph convolution neural network into a visual self-attention model for embedding, so that the representation capability of the visual self-attention model is improved. The invention utilizes the prior information of the facial feature points to construct a topological graph, and extracts the facial structural semantic features by aggregating the local areas rich in facial expression information by using a graph convolution neural network; the method makes up the induction bias of the visual self-attention model for the lack of the local area of the human face, and improves the characterization capability of global features. The invention provides a bidirectional coupling structure of a graph convolution neural network and a visual self-attention model, global features improve the robustness of local features and promote the exploration of the correlation between local areas; the local features complement the information with identification lacking in the global features, so that the representation capability is improved, and the facial expression identification precision is improved.

The invention provides a facial expression recognition method, a facial expression recognition system and electronic equipment, wherein local features guided by facial feature points are combined with global features provided by a visual self-attention model encoder to serve as inputs of a graph convolution neural network, so that noise interference caused by the local features is reduced, and the robustness of facial structure semantics is improved. The invention dynamically updates the adjacency matrix of the graph convolution neural network through initialization before training and regularization loss in the training process. While maintaining the inherent structural relation between the facial parts, the weights in the adjacent matrix are dynamically updated through a gradient descent algorithm, so that the self-adaptive capacity of the graph convolution neural network to each encoder block is improved, and the recognition accuracy of the facial expression is improved.

Drawings

Fig. 1 is a flowchart of a facial expression recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for facial expression recognition of an enhanced visual self-attention model provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a graphical model provided by an example of the present invention;

FIG. 4 is a schematic diagram of a facial expression recognition model according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a facial expression recognition system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a facial expression recognition method, a facial expression recognition system and electronic equipment, which belong to the technical field of computer vision, and the method comprises the following steps: inputting the pre-identified facial image into a visual self-attention model enhanced by a graph convolution neural network to perform facial expression identification; the training method comprises the following steps: preprocessing the face sample image, and extracting the coordinates of the feature points of the face; acquiring local features and global features by adopting a convolutional neural network backbone; fusing the aggregated global features with each local feature by adopting a cross attention module 1 to obtain enhanced local features; processing the enhanced local features by adopting a graph convolution neural network, and extracting the structural semantic information of the human face; the cross attention module 2 is adopted to fuse the facial structuring semantics into each global feature, so as to obtain global features of facial structuring semantic enhancement; adopting a visual self-attention model to further encode global features of the structural semantic enhancement of the human face; the whole model training is supervised with cross entropy loss. The invention utilizes the graph convolution neural network to enhance the characteristic representation capability of the visual self-attention model, and is used for solving the expression recognition task in a natural scene.

Fig. 1 is a flowchart of a facial expression recognition method according to an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:

s11, acquiring a facial image of a user, and extracting facial feature points of the user;

s12, extracting shallow features and deep features of a facial image of a user based on a convolutional neural network, cutting out corresponding local feature blocks in the shallow features by taking each facial feature point as a center to obtain a plurality of local features of the facial image, and cutting and projecting the deep features to obtain a plurality of global features of the facial image;

s13, taking the similarity of the global features and each local feature as a corresponding attention weight, and then combining the corresponding attention weights to adaptively fuse a plurality of global features into each local feature respectively to obtain a plurality of enhanced local features;

s14, carrying out graph convolution on feature point diagrams corresponding to facial feature points by combining a plurality of enhanced local features based on a graph convolution neural network, and extracting a plurality of structural semantic features of the face of the user;

s15, taking the similarity of the structural semantic features and the global features as the corresponding attention weights, then combining the corresponding attention weights to respectively fuse a plurality of structural semantic features into each global feature, and enhancing the region with high association degree with facial expression recognition in the global features to obtain enhanced global features;

S16, further aggregating the enhanced global feature recognition based on the visual self-attention model to obtain the facial expression of the user.

Specifically, preprocessing a facial sample image, and extracting facial feature points of a user; the convolution neural network extracts local features and global features; the global features are fused into the local features through the cross attention module 1, so that enhanced local features are obtained; constructing a graph convolution neural network to extract the structural semantics of the face of the user; the local features are fused into global features through the cross attention module 2, so that the global features of the structural semantic enhancement of the face of the user are obtained; and constructing a visual self-attention model coding global feature, inputting the feature into a full-connection layer for classification, and obtaining class probability output of each sample to finish the facial expression recognition task of the user.

Further, the user facial feature points refer to a key point feature set containing key point position information, and the user facial feature points are obtained by performing normalization and size amplification on the preprocessed user facial image through a user facial feature detector which is pre-trained in advance.

As shown in fig. 2, an embodiment of the present invention provides a facial expression recognition method for enhancing a visual self-attention model, including the steps of:

S101: preprocessing the data set sample, and extracting facial feature points of the user.

With the acquisition of natural scene data sets, user facial expression recognition faces two formidable challenges. First, intra-class variability, even if two user facial images are labeled as the same expression class, it may be difficult for the human eye to accurately distinguish. In one aspect, image blurring, illumination changes, head pose changes, and local occlusions can cause blurring or incomplete facial information of the user's face. On the other hand, the specific expression pattern of each person varies even if the same expression is expressed due to the race, age, sex, and individual difference of each person's face. Secondly, similarity among classes is that only tiny differences exist between facial images of users in different expression classes, and the facial images are difficult to distinguish accurately. For example, there may be similarity in the specific appearance of the user's face between the two expression categories of anger and surprise, which may lead to similar expressive features such as eyes opening greatly or mouth opening, whether anger or surprise. For these two problems, not only the global appearance of the whole user face image can be considered, but also some fine-grained local areas need to be focused, for example, the appearance of the vicinity of the face part often changes along with the expression change. Taking these areas into account also tends to help identify expressions.

Further, the data set samples are scaled to a size of 112 x 112 and then a pre-trained user facial feature point detector MobileFaceNet is used. And inputting the scaled samples into a MobileFaceneT detector to obtain 51 normalized user facial feature point position information, and amplifying the feature point position information according to the sample size to obtain the position of each user facial feature point in the sample image.

S102: and extracting shallow layer features and deep layer features of each sample in the data sample set, and further processing to obtain corresponding local features and global features.

Further, the feature extraction of the sample preferably takes one of the following two ways;

a. feature extraction method using ResNet-18 network

ResNet-18 is a convolutional neural network architecture that can solve the problem of gradient extinction in deep neural networks, even if deeper networks are trained without performance degradation. The architecture of ResNet-18 contains a total of 18 layers, including 17 convolutional layers and 1 fully-connected layer. It follows a building block structure in which residual connections are introduced to enable information to flow from one layer to another, skipping some of the layers. This helps to solve the problem of gradient extinction, enabling the network to learn a more efficient representation.

The network structure of ResNet-18 is divided into the following sections: first, an input layer for receiving pixel values of an input image; then through a convolution layer, the convolution kernel is 7*7, step size is 2, for reducing the size of the input image. The layer has 64 filters and performs batch normalization and ReLU activation function operation; followed by a 3*3 max pooling layer, step size 2, for further reduction of feature map size; the core of ResNet-18 is the design of residual blocks, each containing two convolutional layers with 3*3 filters, batch normalization and ReLU activation functions. The outputs of the two convolutional layers are added to the original input of the residual block, creating a shortcut connection. This addition operation is called a "long jump connection" or "short-cut identical". The purpose of skipping the connection is to ensure that information from the earlier layers is preserved and can flow directly to the later layers, making it easier for the network to learn the residual mapping.

The expression is as follows:

F(x)＝H(x)+x

wherein conv1 (x) and conv2 (x) both represent convolution operations; BN (x) represents a batch normalization operation for normalizing the mean and variance of each feature map; relu (x) represents a rectifying linear unit activation function, a negative value is set to 0, and a positive value is kept unchanged; h (x) represents the residual map and F (x) represents the entire operation of the residual block.

Further preferably, a ResNet-18 network which is already pre-trained in the public user face data set Ms-Celeb-1M can be used for migrating the model to a user facial expression recognition task, and the training process only needs fine adjustment to extract better features due to better model parameters. When the sample image is input into the ResNet-18 network, shallow features with the size of 64 x 32 and deep features with the size of 512 x 4 are respectively extracted from the first residual block and the last residual block of the network.

b. Feature extraction method using IR-50 network

In this feature extraction method of the present embodiment, the network used is still pre-trained on the Ms-Celeb-1M dataset. The network structure of IR-50 is similar to ResNet-50, but with some modifications prior to the last fully connected layer. Firstly, an input layer receives an input user face image; then a convolution layer, typically with 64 3*3 filters, with a step of 1, is used to extract the low-level features of the image; IR-50 also has residual blocks that are similar in structure to res net-50 to solve the gradient vanishing problem and help the network learn better feature representations. In contrast, the residual block of IR-50 consists of three convolution layers, the first using the convolution kernel of 1*1, for reducing the number of channels, typically for reducing the computational effort. The second convolution layer then uses the convolution kernel of 3*3 for feature extraction. The final third convolution layer uses the convolution kernel of 1*1 for recovering the channel number. A long jump connection is added between the second and third convolution layers so that information can be transferred directly without being affected by the increase in the number of layers. Shallow and deep features are still extracted from the network after the first and last residual blocks.

The features obtained by the two feature extraction methods can be used as feature vectors of the samples in the embodiment.

After the shallow layer features and the deep layer features are extracted, the process is neededThe inputs that render them acceptable to the graph roll-up neural network and visual self-attention model are processed in one step. Firstly, using the S1O1 extracted facial feature points of the user, cutting out 5*5 on shallow features by taking 51 facial feature points of the user as the center, and obtaining local sub-blocksWherein N represents the number of facial feature points of the user, C represents the number of channels and H _local And W is _local Representing the local sub-block size of the crop. Finally, the final local characteristics are obtained through characteristic flattening and characteristic mappingWherein d is _local Feature vector dimensions representing local features.

In particular, the process of feature flattening and feature mapping described above should be c×h _local ×W _local Flattened, i.e. becoming one dimension, nx (CxH _local ×W _local ) Then through a full connection layer mapping, C×H _local ×W _local Is mapped into d _local 。

Further extracting the output of the last layer of convolutional neural networkAnd uniformly dividing the optical fiber, and projecting the optical fiber through a full-connection layer to obtain global characteristics ∈10->Wherein the method comprises the steps ofh. w represents the height and width of each sub-block of the segment, d _global Representing the number of channels per global feature after projection.

S103: the global features are fused with the local features through cross-attention to enhance the local features.

Because the local features are obtained by positioning and clipping according to the facial feature points of the user, the detection of the facial feature points of the user in a natural scene is often not accurate, and the local features extracted in the last step may have noise. In order to improve the robustness of the local features, the global features are fused with the local features through a cross attention mechanism, so that the characterization capability of the local features is enhanced.

At the same time, each of the convolutional neural networks is embedded in front of each encoder in the visual self-attention model, and each of the convolutional neural networks should be adapted to the corresponding encoder, and attention mechanisms G2L are proposed for semantic promotion of local features aggregated by each of the convolutional neural networks and mining of potential facial associations _Attention Global feature X _global Fusion into local feature X _local . Because each local feature has unequal similarity with all global features, a attention mechanism is adopted to search high semantic similarity with each local feature in the global features, and enhanced local features X 'are obtained' _local The formula is expressed as follows:

wherein d is the channel dimension of the feature vector, the channel dimensions of the global feature and the local feature are the same, and the scaling factor In order to normalize the attention, w represents the projection of the feature with fully connected layers, +.>Representing a matrix multiplication.

Specifically, there are a plurality of global features and local features. For example, there are a and B, and this is the attention weight of a and B, respectively, and then B is fused with a plurality of a by the attention weight (mechanism). Each global feature may be focused with multiple local features, each of which may be focused differently on a different region of local features.

Through a cross attention mechanism, each local feature searches for a region with the largest relevance with the local feature in the global face to be fused, so that semantic representation of the local feature is enhanced, and the influence of noise is reduced.

S104: the graph convolution neural network extracts structural semantic features of the user face.

The graph convolutional neural network is an effective method to solve the problem in facial expression recognition of users. By using the facial feature points of the user as nodes and connecting the nodes in the order of facial parts, a geometric user facial feature point map can be constructed. The information of the neighbor nodes can be aggregated by carrying out graph rolling operation on the facial feature point diagram of the geometric user, so that geometric perception is carried out on each node, and the semantic expression capability of the node is improved. Meanwhile, as the neighbors of each node belong to the same facial part, the graph convolution neural network can inhibit noise introduced by inaccurate positioning of the feature points, and the feature representation capability of the facial part of the node is improved. In a user's facial expression, a plurality of facial parts often change simultaneously. To better explore the potential relationships between facial parts of a user's face, the graph may be defined as learnable.

In this way, the graph convolution neural network can adaptively explore the relation between different facial parts according to the input node characteristics so as to obtain better geometric perception user facial characteristics. The method based on the graph convolution neural network can effectively extract the association information between the face parts and has robustness for inaccurate feature point positioning. The method can better capture subtle changes and local characteristics of the facial expression of the user, thereby improving the performance of facial expression recognition of the user.

As shown in fig. 3, a graph model g= (v, e) is constructed with the aid of geometric prior knowledge of the user's facial feature points, each feature point sub-block x _i E v is taken as one node of the graph, and the edge a between any two nodes _ij E is initialized by using the geometric user facial feature point diagram, as shown in fig. 3, s (·) is a diagram initialization function, whose formula is:

the geometric user facial feature point diagram is utilized to aggregate the enhanced local features to obtain user facial structured semantic features, and meanwhile, the robustness and semantic characterization capability of each enhanced local feature are improved. In addition, the whole adjacency matrix is set as a leachable parameter, and the leachable parameter is updated based on a gradient descent method like a network parameter, so that the graph model can adaptively search the association between local areas of the face of the user according to a target task, the association between areas without association is reduced, and the connection of edges is added to the local areas with semantic association.

The graph roll-up neural network is arranged in two layers, one input being an enhanced local feature X' _local Another input is an adjacency matrix A _ij Aggregation of enhanced local features through a graph model to extract user facial structured semantics Z _local . Meanwhile, the global features extracted by the visual self-attention model are input into the graph convolution neural network, so that the graph convolution neural network can be pushed to explore the connection between different areas of the face, and the connection between different local areas is enhanced by taking the global features as intermediate points. The enhanced local features are aggregated through semantic relationships, so that the user face structural semantic features are more abstract, and the semantics of the user face structural semantic features are further improved.

S105: the user face structured semantic features are fused with the global features through cross attention to obtain enhanced global features.

The invention utilizes the attention mechanism to fuse the user face structural semantic features into the global features, provides context for the global features, has stronger feature characterization capability and improves the generalization capability of the model.

In particular, attention mechanism L2G _Attention The similarity of each global feature and the structural semantic features of all the faces of the user is calculated by adopting a method similar to the attention mechanism, the generation of one expression is often greatly related to a certain part of the face, and the attention mechanism is utilized to fully explore and strengthen the local area with distinguishing expression recognition, wherein the specific formula is as follows:

Wherein the Softmax (·) function is to normalize attention, speeding up convergence of training.

Although the self-attention model has good capturing capability on long-distance dependence, the self-attention mechanism calculation is lack of induction bias and position information is lack, so that the extraction capability of the structural semantic features of the face of the user is weak. The invention adopts the fusion of the user face structural semantic features extracted by the graph convolution neural network and the global features, considers the connection between the user face structural semantic features and the global features, further improves the semantics of the global features, and helps the visual self-attention model to join the connection between the user face structural semantic features when calculating self-attention.

S106: the visual self-attention model extracts features with excellent characterizations.

The specific feature fusion enhancement model is shown in fig. 4. Compared with the traditional convolutional neural network, the visual self-attention model can directly gather global information of images in a global scope through calculation of a self-attention mechanism, and the method is more accurate and efficient in establishing long-distance relations. Secondly, the multi-head mechanism in the self-attention model can not only perform parallel matrix operation to reduce time consumption, but also perform feature representation through the multi-head mechanism, and the self-attention modules can learn global and local relations in the image, so that the model can capture richer and more complex features.

Since the cut sub-blocks lose spatial information between each sub-block, a spatial code is initialized and directly added to the global features, introducing spatial information. Meanwhile, in order to facilitate the classification of the features finally, a classification vector is introduced and spliced with the global features, and expression features are learned from other global feature vectors.

Further, the just-mentioned attention mechanism is utilized to integrate the structured features of the user's face into the global features, and then the enhanced global features are obtained and then input into an encoder of the visual self-attention model. The visual self-attention model consists of M encoders. Each encoder consists of a multi-headed self-attention (MSA) with long jump connection and a multi-layer perceptron (MLP).

First, input is toThe linear projection is a query q, a keyword k, and a value v, as follows:

[q，k，v]＝X′ _global [w _q ，w _k ，w _v ]

wherein,the similarity between features is explored by projection of the full connection layer to the subspace.

Secondly, self-attention weight is calculated, and all values are weighted and summed to obtain output:

Z′ _global ＝Z _global v

the multi-head self-attention is that the self-attention mechanism operates for k times in parallel, and the spliced features are embedded through a full connection layer to form final output. The multi-layer perceptron is then composed of two fully connected layers for feature projection and an activation function RELU for nonlinearity. The design of the multi-head self-attention mechanism is used for embedding projections into respective spaces, adaptively calculating and aggregating the similarity among the features in each subspace, improving the diversity of the features, and further focusing on the correlation among the features from multiple perspectives, thereby increasing the robustness of the model.

Total loss of the whole modelThe method can be concretely expressed as follows:

wherein,the cross entropy loss is commonly used in classification tasks and is used for measuring the difference between a prediction result of a model in the classification tasks and a real label, and a specific formula is expressed as follows:

wherein,representing a sample tag distribution of the last encoder output of the visual self-attention model predicted by the full-connection layer; y is _i Representing the true distribution of the samples.

Further, the invention designs a graph loss functionThe method is used for keeping the connection relation of the original geometric user facial feature points of the dynamic update graph to a certain extent, ensuring the effectiveness of the user facial structural features extracted by the graph convolutional neural network, and the loss function is expressed as follows:

wherein A is _initial Representing an adjacency matrix initialized according to the geometrical user facial feature point diagram, a representing a dynamically updated adjacency matrix, and x representing a hadamard product.

In one embodiment, a RAF-DB (Real-world Affective Faces Database) expression library is employed; the expression library contains 29672 facial images collected from the Internet, and the expression library is marked by 315 staff (students at universities and textbook staff); a total of 7 expressions were contained, respectively: anger, aversion, fear, happiness, sadness, surprise and neutrality;

According to the invention, all facial expression pictures are selected, and experiments are carried out according to the training set and the testing set which are divided by the expression library; when the ResNet-18 network is adopted as a convolutional neural network backbone, the expression recognition accuracy is 89.93%; when the IR-50 network is used as a backbone of the convolutional neural network, the obtained expression recognition precision is 92.24%.

In another embodiment, a FER+ (Hard-Label) expression library is used; the expression library is an extension of the original FER dataset in which the facial expression images are re-labeled as one of the emotion types in 8: neutral, happy, surprised, sad, anger, aversion, fear and light.

According to the invention, all facial expression pictures in the data set are selected for training; when the ResNet-18 network is adopted as a convolutional neural network backbone, the obtained expression recognition accuracy is 90.25%; when the IR-50 network is used as a backbone of the convolutional neural network, the obtained expression recognition accuracy is 91.92%.

By contrast, the TransFER method achieved 90.91% accuracy on the RAF-DB dataset and 90.83% accuracy on the FER+ dataset. At the same time, the POSTER method achieves 92.05% accuracy on the RAF-DB dataset and 91.62% accuracy on the FER+ dataset. Compared with the prior advanced method, the method provided by the invention can be found to obtain more excellent performance and is the highest precision result known at present.

Compared with the expression recognition method taking the visual self-attention model as a main body in recent years, the expression recognition result on two data sets can obtain higher performance. Even though the different convolutional neural network trunks are the same as the corresponding methods, the convolutional neural network trunks are the highest recognition precision at present. In summary, the method provided by the invention is the SOTA method of expression recognition at present, and exceeds a series of leading edge methods such as TransFER, POSTER.

Fig. 5 is a schematic diagram of a facial expression recognition system according to an embodiment of the present invention; as shown in fig. 5, includes:

a facial feature point extraction unit 510 for acquiring a user's facial image and extracting facial feature points of the user;

the facial feature determining unit 520 is configured to extract shallow features and deep features of a facial image of a user based on a convolutional neural network, cut out corresponding local feature blocks in the shallow features with each facial feature point as a center, obtain a plurality of local features of the facial image, and cut and project the deep features to obtain a plurality of global features of the facial image;

the facial feature enhancement unit 530 is configured to take the similarity between the global feature and each local feature as a corresponding attention weight, and then combine the corresponding attention weights to adaptively fuse multiple global features into each local feature, so as to obtain multiple enhanced local features; carrying out graph convolution on feature point diagrams corresponding to facial feature points by combining a plurality of enhanced local features based on a graph convolution neural network, and extracting a plurality of structural semantic features of the face of the user; the similarity between the structural semantic features and the global features is used as a corresponding attention weight, and then a plurality of structural semantic features are respectively fused into each global feature by combining the corresponding attention weights, so that a region with high association degree with facial expression recognition is enhanced in the global features, and the enhanced global features are obtained;

The facial expression recognition unit 540 is configured to further aggregate the enhanced global feature recognition based on the visual self-attention model to obtain a facial expression of the user.

It should be noted that, the detailed functional implementation of each unit may refer to the description in the foregoing method embodiment, and will not be described herein.

It should be understood that, the system is used to execute the method in the foregoing embodiment, and the corresponding program element in the system performs the principle and technical effects similar to those described in the foregoing method, and the working process of the system may refer to the corresponding process in the foregoing method, which is not repeated herein.

Based on the method in the above embodiment, the embodiment of the invention provides an electronic device. The apparatus may include: at least one memory for storing programs and at least one processor for executing the programs stored by the memory. Wherein the processor is adapted to perform the method described in the above embodiments when the program stored in the memory is executed.

Based on the method in the above embodiment, the embodiment of the present invention provides a computer-readable storage medium storing a computer program, which when executed on a processor, causes the processor to perform the method in the above embodiment.

Based on the method in the above embodiments, an embodiment of the present invention provides a computer program product, which when run on a processor causes the processor to perform the method in the above embodiments.

It is to be appreciated that the processor in embodiments of the invention may be a central processing unit (centralprocessing unit, CPU), other general purpose processor, digital signal processor (digital signalprocessor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The steps of the method in the embodiment of the present invention may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present invention are merely for ease of description and are not intended to limit the scope of the embodiments of the present invention.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A facial expression recognition method, comprising the steps of:

2. The method of claim 1, wherein each user is selected from the group consisting ofCutting out local subblocks with preset sizes on shallow features by taking facial feature points as centersWherein N represents the number of user face feature points, C represents the number of channels and H _local And W is _local Representing the size of the local sub-block cut out; finally, the final local feature ++is obtained through feature flattening and feature mapping>d _local A feature vector dimension representing a local feature;

deep features with convolutional neural network outputThe deep features are evenly segmented, and then the global features are obtained after the projection of the full-connection layer >Wherein->H _local And W is _local Representing the deep feature size, h, w representing the height and width of each sub-block of the segmentation, d _global Representing the number of channels per global feature after projection.

3. The method according to claim 1, wherein the enhanced local feature X _l ^′ _ocal The method comprises the following steps:

4. A method according to claim 1 or 3, characterized in that the structured semantic features Z _local The method comprises the following steps:

constructing a graph model G= (v, e) through geometric prior knowledge of facial feature points of a user, wherein v represents a node, e represents an edge, and each feature point subblock x _i E v is taken as one node of the graph model, and the edge a between any two nodes _ij E, initializing the geometric figure by using the facial feature points of the user to obtain an adjacency matrix A _ij ；

5. The method of claim 1, wherein the enhanced global feature X' _global The method comprises the following steps:

6. The method of claim 1, wherein the visual self-attention model is comprised of M encoders, each encoder comprising: a multi-headed self-care MSA with long jump connection and a multi-layered perceptron MLP;

MAS operation section: first, enhanced global features The linear projection is a query q, a keyword k, and a value v, as follows:

[q,k,v]＝X′ _global [w _q ,w _k ,w _v ]

Z′ _global ＝Z _global v

7. The method according to claim 1 or 6, characterized in that the facial expression of the user is identified, in particular:

8. A facial expression recognition system, comprising:

9. An electronic device, comprising:

at least one memory for storing a program;

at least one processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 1-7 when the memory-stored program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when run on a processor, causes the processor to perform the method according to any one of claims 1-7.