CN116311455A

CN116311455A - Expression recognition method based on improved Mobile-former

Info

Publication number: CN116311455A
Application number: CN202310289138.7A
Authority: CN
Inventors: 严春满; 张翔
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-23

Abstract

The invention belongs to the technical field of expression recognition, and particularly relates to an expression recognition method based on an improved Mobile-former. The method solves the problem that the network cannot effectively combine local features and global features of the expression, and the application scene is limited due to the fact that the model parameters are too large. The method comprises the following steps: inputting the expression image into an improved Mobile-former module, and carrying out preliminary filtering and merging on the image by an ACmix module of a Mobile part and extracting characteristic information; the Mobile sub-module and the former sub-module respectively extract local features of the expression image, and the lightweight cross-attention module carries out bidirectional fusion on the two features; the combined characteristic information is output through the Mobile sub-module and is used as the input of the next Mobile-former module; after the table characteristic is extracted by a plurality of Mobile-former modules, the characteristic information is output to a classifier for classification through a Mobile sub-module in the last Mobile-former. According to the method and the device, the light weight advantage of the model is maintained on the premise of effectively improving the expression recognition accuracy.

Description

Expression recognition method based on improved Mobile-former

Technical Field

The invention belongs to the technical field of expression recognition, and particularly relates to an expression recognition method based on an improved Mobile-former.

Background

Facial expression is the most direct way of human emotional expression. Scientist studies have shown that 55% of information in human daily communication is expressed by facial expression. With the rapid development of artificial intelligence technology, the human-computer interaction level is continuously improved. Facial expression recognition has great development potential in the application fields of man-machine interaction, depression treatment, fatigue monitoring and the like. Facial expression recognition has also received a great deal of attention in the field of computer vision.

Convolutional Neural Networks (CNNs) have been widely developed in the expression recognition field. The convolutional neural network obtains better performance in FER tasks by virtue of the advantages of local perception and parameter sharing. The end-to-end recognition mode enables a well-designed CNN model to learn and train a large number of pictures. In addition, the convolution layer in CNN can capture low-level generic features and high-level semantic features for expression classification. Some researchers have also embedded attention mechanisms into convolutional neural networks so that the network model weights key regions of facial expression for better local information processing capability.

However, the CNN focuses on the feature of local perception information, but omits the grasp of global information, so that the performance of the convolutional neural network in the FER task is limited. In recent years, VIT (vision-transducer) has demonstrated the advantage of global processing in expression recognition. The performance advantage is more pronounced compared to CNN. ViT is a self-attention based structural network that captures the correlation between global contexts by self-attention. ViT introduces the concept of image blocks (patch) on the basis of a transducer, firstly, the image is divided into a plurality of image blocks, then each image block is stretched and then is subjected to projection transformation into a characteristic vector with a fixed length, and finally, the characteristic vector is input into the transducer for learning training.

The expression recognition model mainly has the following problems: 1) CNN and transducer only concentrate on local or global features, but cannot combine both local and global features. 2) The above models are overly complex in structure, require a significant amount of computing resources to train, and are difficult to use in embedded devices.

Disclosure of Invention

In order to solve the problem that the network cannot effectively combine local features and global features of expressions and the problem that the application scene is limited due to overlarge model parameters, the invention provides an expression recognition method for improving a mobile NeXt network.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

an expression recognition method based on improved Mobile-former comprises

S1, inputting an expression image into an improved Mobile-former module, and primarily filtering and merging the image by an ACmix module of a Mobile part and extracting characteristic information;

s2, the extracted feature information is transmitted to a Mobile sub-module and is input into a former sub-module through a lightweight cross attention module, the Mobile sub-module extracts local features of the expression image, and meanwhile, the former sub-module extracts global features of the expression image, and the lightweight cross attention module carries out bidirectional fusion on the two features;

s3, outputting the combined characteristic information through a Mobile sub-module and taking the combined characteristic information as input of a next Mobile-former module;

s4, after the table characteristic is extracted through a plurality of Mobile-former modules, the characteristic information is output to the classifier for classification through the Mobile sub-module in the last Mobile-former.

The stem part in the Mobile-former module is replaced by an ACmix module.

The lightweight cross attention module is two bidirectional bridges, and is connected with the Mobile sub-module and the former sub-module through the bidirectional bridges, and the local features and the global features are fused in a bidirectional manner.

The point-by-point convolution in the mobile sub-module is replaced by a Ghost module, and the SE attention mechanism in the mobile sub-module is deleted.

The beneficial effects of the invention are as follows: according to the facial expression characteristics, the ACmix module is used for replacing the original stem part in the network. ACmix combines the common convolution with a self-attention mechanism so that the network can obtain a larger receptive field when the input image is subjected to preliminary processing. Secondly, introducing a mobile submodule, replacing point-by-point convolution in depth separable convolution by a Ghost module, reducing feature diagram redundancy generated when a network extracts features by utilizing linear operation in the Ghost module, deleting SE attention mechanisms in the mobile submodule by a model, reducing overfitting caused by various attention mechanisms, and improving interaction efficiency of local and global features. Experimental results show that compared with a reference model and other various depth networks, the improved model maintains the light weight advantage of the model on the premise of effectively improving the expression recognition accuracy.

Drawings

FIG. 1 is a schematic diagram of a modified Mobile-former structure;

FIG. 2 is a schematic diagram showing a specific details of a Mobile-former module;

FIG. 3 is a block diagram of an ACmix module;

FIG. 4 is a diagram of a Mobile submodule architecture;

FIG. 5 is a block diagram of a Ghost module structure;

FIG. 6 is a partial sample of a dataset;

Detailed Description

The technical scheme of the invention is further described below by specific embodiments with reference to the accompanying drawings:

example 1

The invention provides an expression recognition method based on an improved Mobile-former, which comprises the following steps:

The specific model structure is as follows:

the Mobile-former is designed by Mobile V3 and the transducer in parallel, and the middle is connected by a bidirectional bridge. The parallel structure enables the model to simultaneously consider the advantages of the mobileNet network in terms of local processing and the transducer in terms of global interaction. The bidirectional bridge achieves bidirectional fusion of local and global features. The Mobile-former network structure is shown in fig. 1.

The Mobile-former module is shown in detail in fig. 2, and the different color depth regions represent four main constituent modules of the Mobile-former, respectively. The Mobile-former is composed of a stack of Mobile-former modules. Each mobile-former module is divided into three parts, namely a mobile sub-module, a former sub-module and a lightweight cross-attention module.

The Mobile-former has two inputs: (a) Local feature map X ε R ^H×W×C H is height, w is width, C is channel, (b) global token Z ε R ^M×d M and d are the number and dimension of tokens, respectively. The X 'and the Z' are updated by the Mobile-former module and then output to the Mobile sub-module, and the feature map X is taken as input to form an inverse residual error module by depth separable convolution.

The Former submodule includes a multi-headed self-attention Mechanism (MHA) and a Feed Forward Network (FFN).

Is a lightweight cross-attention used to fuse the local feature X with the global token Z. In particular, the lightweight cross-attention mapping from local features to global token Z can be expressed as:

the local feature X and the global token are divided into h headers, wherein

For multi-headed attention. Ith head->

And the ith token Z epsilon R ^d Different, the->

Is the ith head inquiry projection map, W ^O For joining together a plurality of heads. Attn (Q, K, V) is +.>

Is a standard attention function of (2). The local features are queries (Q), and the global features are Key values (Key, K) and Value terms (Value, V). So keep W in Mobile≡former ^K And W is equal to ^V Mapping matrix of (2) while deleting W ^Q The matrix is mapped to save computation. Mobile→former vice versa. The global to local cross-attention calculation can be expressed as

Wherein [.] _1:h Representing the series connection of h elements,

and->

Is the mapping matrix of key values and value items at the Former end. And deleting the mapping matrix of the query Q at the mobile terminal.

The stem part in the Mobile-former module is replaced by an ACmix module. The ACmix model has a larger receptive field so that the network can focus on important areas in a larger range. Assume a feature map of a self-care module with N heads, in which

And->

Representing input and output, < >>

The feature vectors of the pixels (i, j) corresponding to F and G are represented, respectively.

Where i represents the connection between the N heads,

is a mapping matrix of queries, key values, value items. N (N) _k (i, j) denotes a pixel partial region having a spatial range K with (i, j) as the center. />

Is N _k Attention weights corresponding to features in (i, j). Furthermore, multi-headed self-care can be broken down into two phases.

Stage I：

Stage II：

The normal convolution and self-attention have the same 1 x 1 convolution operation at the time of operation. Based on the above, the ACmix module combines the common convolution with self-attention. As shown in fig. 3, the ACmix module is split into two phases during operation. In the first stage, the ACmix module goes through 3 pieces of 1×The 1 convolution operation mapping division divides the input feature map into N segments to obtain a group of 3 XN feature maps containing rich intermediate features. In the second stage, the two paths are divided according to the operation modes of convolution and self-attention difference, and finally new features are generated through aggregation and movement. In the self-attention path, each group containing feature maps contains three features from a 1 x 1 convolution. The three feature maps are respectively used as query, key value and value items of the multi-head self-attention module. In the convolution path with a convolution kernel size of k, a fully connected layer is employed and k is generated ² And (3) a characteristic diagram. The ACmix module processes the input features in a conventional convolution manner and can collect information from the local receptive field. Finally, the outputs of the two paths are added.

F _out ＝αF _att +βF _conv (6)

The Mobile submodule extracts local features of the expression image, and the Mobile submodule is used as the most complex submodule in the network and is the key of the network to extract the local features. However, the more mobile sub-modules contained in the network, the higher the overall parameters of the model. The present solution therefore proposes a more lightweight and efficient mobile sub-module.

The original mobile sub-module is shown in a figure (a), after the characteristic information is input into the mobile sub-module, the characteristic dimension is compressed through point-by-point convolution, then the characteristic information is spatially encoded through depth convolution, then the characteristic weighting is carried out through SE attention mechanism, and the characteristic information is output after the point-by-point convolution expansion. The original Mobile sub-module generates a great amount of feature map redundancy when compressing and expanding the feature information, so that redundant calculation amount is caused in the calculation process. In addition, the original SE modules in the network enable various attention mechanisms to appear in the network, and interaction of the attention mechanisms causes the network to be over-fitted during training iteration.

In order to solve the above phenomenon, the Mobile sub-module in the network is reconstructed by using the Ghost module, so that the whole network is lighter and more efficient. The reconstructed structure is shown in fig. 4 (b). The scheme uses a 1×1 Ghost module to re-build the inverted residual structure. When the Ghost module transforms the feature information dimension, the customizable convolution kernel in the module can greatly reduce the redundancy of the feature map caused by the dimension ascending or descending through linear operation so as to reduce the parameter quantity and the calculation quantity of the whole network. Meanwhile, SE attention mechanisms in all Mobile sub-modules are omitted, so that mutual conflict among different attention mechanisms and the overall calculation load of the network are reduced.

The Ghost module structure is shown in FIG. 5, which decomposes the normal convolution into two parts: the first part uses common convolution to generate some inherent feature map X, and the second part uses inexpensive linear operations to enhance features and increase channels.

Y′＝X*f′+b (7)

Wherein y' _i Is the i-th original feature map in Y'. The above function phi _i,j Is a linear operation for generating the j-th Ghost feature map. Each original feature map y' _i One or more of the ghest feature maps may be generated

Last phi _i,s The method is suitable for preserving identity mapping of the original characteristic diagram.

To sum up: the model improves the efficiency of combining local features and global features in the process of inputting images through a network on the basis of selecting a Mobile-former, and simultaneously maintains the light weight of the network. An ACmix module is first used to replace the original stem part in the network. ACmix combines the common convolution with a self-attention mechanism so that the network can obtain a larger receptive field when the input image is subjected to preliminary processing. Secondly, introducing a mobile sub-module, replacing point-to-point convolution in the depth separable convolution by a Ghost module, and reducing feature map redundancy generated when a network extracts features by utilizing linear operation in the Ghost module. In addition, the text model deletes the SE attention mechanism in the mobile sub-module, reduces overfitting caused by various attention mechanisms, and improves interaction efficiency of local and global features.

Example 2 case analysis

1 experiment Environment

The experimental software and hardware configurations are shown in table 1, and all the comparison networks run on the same platform.

2Datasets

The RAF-DB is a large facial expression database containing 3 ten thousand facial images downloaded from the internet. The dataset has a great variety in terms of age, sex, head pose, lighting conditions, occlusion, etc. of the subject. The training set selected in the experiments herein contained 12271 images and the validation set 3068 images.

Ck+ was extended in 2010 on the basis of the Cohn-Kanda dataset, and is a facial expression dataset that is currently relatively common, comprising 123 participants, 593 picture sequences. The CK data set is a static picture, the CK+ data set comprises a static picture and a dynamic picture, and both data sets contain emotion labels for marking the expression of the participants.

A partial sample of the dataset is shown in fig. 6, for example.

3 experimental results and analysis

Training and testing were performed in the RAF-DB, ck+ dataset, respectively, in order to test the performance of the model in expression recognition. The weights are randomly initialized in the training process, and the NNI toolkit is used for counting the quantity and calculated quantity of the parameters. Because the migration learning has a great influence on the accuracy of the model in the RAF-DB data set, researchers cannot accurately judge the performance of the model, and therefore, the model does not adopt a pre-training and migration learning mode. The training settings are shown in table 2.

Data	Batch-size	Learnrate	optimizer	Momentum
					RAF-DB	300	0.01	Adam	0.9
CK+	150	0.01	SGD	0.9

3.1 improved Module comparison experiment

Model	RAF-DB	CK	Params
				Mobile-former	76.53％	93.97％	12.84M
Mobileformer+LE-mobile	78.58％	95.96％	11.78M
				Mobileformer+ACmix	77.21％	94.87％	12.86M
Our	79.56％	96.97％	11.79M

In order to analyze the influence of the improvement measures proposed by the application on accuracy, parameter quantity and network complexity, the experiment sets different improvement measure combination modes for comparison on the basis of a Mobileformer expression recognition model. The comparative test results of each module are shown in Table 3. Mobileformer is more efficient in feature extraction of the images of the expression by combining the advantages of MobileNet V3 in the local processing network and the transducer in the global interaction. The accuracy rate in RAF-DB dataset and CK+ dataset reaches 76.53% and 93.97%, and Params is 12.84M. When a new mobile sub-module is introduced, the Ghost module reduces redundant feature graphs generated by a convolution layer during operation through linear operation, so that redundant computation generated by a network during operation is effectively reduced, and meanwhile, the efficiency of extracting the surface condition features of the model is improved. Meanwhile, SE attention mechanisms in the original mobile sub-module are deleted, so that overfitting phenomenon caused by various different attentions is reduced, and the extraction capacity of the network characteristics to expression features is improved. The introduction of this module caused the network to be improved by 2.02% and 1.99% in the dataset, respectively, with a reduction of Params of 1.06M in terms of parameters. ACmix improves the receptive field of convolution in processing images by combining common convolution with self-attention. The ACmix module is used for replacing the original convolution layer in the stem part, so that the network can obtain the global receptive field when the primary information filtering and merging are carried out on the input images. When the ACmix module is introduced, the accuracy of the two data sets is improved by 0.68% and 0.9%, and meanwhile, the Params is not obviously increased because of the light weight of the ACmix module. When two improvements are applied to the underlying network simultaneously, the accuracy is improved by 3.03% and 3% respectively in the dataset, while Params is reduced by 1.05M. Therefore, the improvement measures can improve the accuracy of the network to the expression recognition to different degrees, and the introduction of the Ghost module can obviously reduce the overall parameter number and floating point operand of the network, so that the lightweight characteristic of the network is ensured.

Example 3

The embodiment provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the expression recognition method based on the improved Mobile-former provided by the embodiment 1 when executing the computer program.

Example 4

The present embodiment provides a computer readable storage medium having a computer program stored thereon, wherein the program when executed by a processor implements an expression recognition method based on an improved Mobile-former provided in embodiment 1 of the present invention.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. An expression recognition method based on an improved Mobile-former is characterized by comprising the following steps of: comprising

2. The improved Mobile-former-based expression recognition method of claim 1, wherein: the stem part in the Mobile-former module is replaced by an ACmix module.

3. The improved Mobile-former-based expression recognition method of claim 1, wherein: the lightweight cross attention module is two bidirectional bridges, and is connected with the Mobile sub-module and the former sub-module through the bidirectional bridges, and the local features and the global features are fused in a bidirectional manner.

4. The improved Mobile-former-based expression recognition method of claim 1, wherein: the point-by-point convolution in the mobile sub-module is replaced by a Ghost module, and the SE attention mechanism in the mobile sub-module is deleted.

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements an improved Mobile-former based expression recognition method as claimed in any one of claims 1 to 4 when the computer program is executed by the processor.

6. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a modified Mobile-form based expression recognition method as claimed in any one of claims 1 to 4.