CN116311455A - Expression recognition method based on improved Mobile-former - Google Patents

Expression recognition method based on improved Mobile-former Download PDF

Info

Publication number
CN116311455A
CN116311455A CN202310289138.7A CN202310289138A CN116311455A CN 116311455 A CN116311455 A CN 116311455A CN 202310289138 A CN202310289138 A CN 202310289138A CN 116311455 A CN116311455 A CN 116311455A
Authority
CN
China
Prior art keywords
module
mobile
former
sub
expression recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310289138.7A
Other languages
Chinese (zh)
Inventor
严春满
张翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN202310289138.7A priority Critical patent/CN116311455A/en
Publication of CN116311455A publication Critical patent/CN116311455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of expression recognition, and particularly relates to an expression recognition method based on an improved Mobile-former. The method solves the problem that the network cannot effectively combine local features and global features of the expression, and the application scene is limited due to the fact that the model parameters are too large. The method comprises the following steps: inputting the expression image into an improved Mobile-former module, and carrying out preliminary filtering and merging on the image by an ACmix module of a Mobile part and extracting characteristic information; the Mobile sub-module and the former sub-module respectively extract local features of the expression image, and the lightweight cross-attention module carries out bidirectional fusion on the two features; the combined characteristic information is output through the Mobile sub-module and is used as the input of the next Mobile-former module; after the table characteristic is extracted by a plurality of Mobile-former modules, the characteristic information is output to a classifier for classification through a Mobile sub-module in the last Mobile-former. According to the method and the device, the light weight advantage of the model is maintained on the premise of effectively improving the expression recognition accuracy.

Description

Expression recognition method based on improved Mobile-former
Technical Field
The invention belongs to the technical field of expression recognition, and particularly relates to an expression recognition method based on an improved Mobile-former.
Background
Facial expression is the most direct way of human emotional expression. Scientist studies have shown that 55% of information in human daily communication is expressed by facial expression. With the rapid development of artificial intelligence technology, the human-computer interaction level is continuously improved. Facial expression recognition has great development potential in the application fields of man-machine interaction, depression treatment, fatigue monitoring and the like. Facial expression recognition has also received a great deal of attention in the field of computer vision.
Convolutional Neural Networks (CNNs) have been widely developed in the expression recognition field. The convolutional neural network obtains better performance in FER tasks by virtue of the advantages of local perception and parameter sharing. The end-to-end recognition mode enables a well-designed CNN model to learn and train a large number of pictures. In addition, the convolution layer in CNN can capture low-level generic features and high-level semantic features for expression classification. Some researchers have also embedded attention mechanisms into convolutional neural networks so that the network model weights key regions of facial expression for better local information processing capability.
However, the CNN focuses on the feature of local perception information, but omits the grasp of global information, so that the performance of the convolutional neural network in the FER task is limited. In recent years, VIT (vision-transducer) has demonstrated the advantage of global processing in expression recognition. The performance advantage is more pronounced compared to CNN. ViT is a self-attention based structural network that captures the correlation between global contexts by self-attention. ViT introduces the concept of image blocks (patch) on the basis of a transducer, firstly, the image is divided into a plurality of image blocks, then each image block is stretched and then is subjected to projection transformation into a characteristic vector with a fixed length, and finally, the characteristic vector is input into the transducer for learning training.
The expression recognition model mainly has the following problems: 1) CNN and transducer only concentrate on local or global features, but cannot combine both local and global features. 2) The above models are overly complex in structure, require a significant amount of computing resources to train, and are difficult to use in embedded devices.
Disclosure of Invention
In order to solve the problem that the network cannot effectively combine local features and global features of expressions and the problem that the application scene is limited due to overlarge model parameters, the invention provides an expression recognition method for improving a mobile NeXt network.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
an expression recognition method based on improved Mobile-former comprises
S1, inputting an expression image into an improved Mobile-former module, and primarily filtering and merging the image by an ACmix module of a Mobile part and extracting characteristic information;
s2, the extracted feature information is transmitted to a Mobile sub-module and is input into a former sub-module through a lightweight cross attention module, the Mobile sub-module extracts local features of the expression image, and meanwhile, the former sub-module extracts global features of the expression image, and the lightweight cross attention module carries out bidirectional fusion on the two features;
s3, outputting the combined characteristic information through a Mobile sub-module and taking the combined characteristic information as input of a next Mobile-former module;
s4, after the table characteristic is extracted through a plurality of Mobile-former modules, the characteristic information is output to the classifier for classification through the Mobile sub-module in the last Mobile-former.
The stem part in the Mobile-former module is replaced by an ACmix module.
The lightweight cross attention module is two bidirectional bridges, and is connected with the Mobile sub-module and the former sub-module through the bidirectional bridges, and the local features and the global features are fused in a bidirectional manner.
The point-by-point convolution in the mobile sub-module is replaced by a Ghost module, and the SE attention mechanism in the mobile sub-module is deleted.
The beneficial effects of the invention are as follows: according to the facial expression characteristics, the ACmix module is used for replacing the original stem part in the network. ACmix combines the common convolution with a self-attention mechanism so that the network can obtain a larger receptive field when the input image is subjected to preliminary processing. Secondly, introducing a mobile submodule, replacing point-by-point convolution in depth separable convolution by a Ghost module, reducing feature diagram redundancy generated when a network extracts features by utilizing linear operation in the Ghost module, deleting SE attention mechanisms in the mobile submodule by a model, reducing overfitting caused by various attention mechanisms, and improving interaction efficiency of local and global features. Experimental results show that compared with a reference model and other various depth networks, the improved model maintains the light weight advantage of the model on the premise of effectively improving the expression recognition accuracy.
Drawings
FIG. 1 is a schematic diagram of a modified Mobile-former structure;
FIG. 2 is a schematic diagram showing a specific details of a Mobile-former module;
FIG. 3 is a block diagram of an ACmix module;
FIG. 4 is a diagram of a Mobile submodule architecture;
FIG. 5 is a block diagram of a Ghost module structure;
FIG. 6 is a partial sample of a dataset;
Detailed Description
The technical scheme of the invention is further described below by specific embodiments with reference to the accompanying drawings:
example 1
The invention provides an expression recognition method based on an improved Mobile-former, which comprises the following steps:
s1, inputting an expression image into an improved Mobile-former module, and primarily filtering and merging the image by an ACmix module of a Mobile part and extracting characteristic information;
s2, the extracted feature information is transmitted to a Mobile sub-module and is input into a former sub-module through a lightweight cross attention module, the Mobile sub-module extracts local features of the expression image, and meanwhile, the former sub-module extracts global features of the expression image, and the lightweight cross attention module carries out bidirectional fusion on the two features;
s3, outputting the combined characteristic information through a Mobile sub-module and taking the combined characteristic information as input of a next Mobile-former module;
s4, after the table characteristic is extracted through a plurality of Mobile-former modules, the characteristic information is output to the classifier for classification through the Mobile sub-module in the last Mobile-former.
The specific model structure is as follows:
the Mobile-former is designed by Mobile V3 and the transducer in parallel, and the middle is connected by a bidirectional bridge. The parallel structure enables the model to simultaneously consider the advantages of the mobileNet network in terms of local processing and the transducer in terms of global interaction. The bidirectional bridge achieves bidirectional fusion of local and global features. The Mobile-former network structure is shown in fig. 1.
The Mobile-former module is shown in detail in fig. 2, and the different color depth regions represent four main constituent modules of the Mobile-former, respectively. The Mobile-former is composed of a stack of Mobile-former modules. Each mobile-former module is divided into three parts, namely a mobile sub-module, a former sub-module and a lightweight cross-attention module.
The Mobile-former has two inputs: (a) Local feature map X ε R H×W×C H is height, w is width, C is channel, (b) global token Z ε R M×d M and d are the number and dimension of tokens, respectively. The X 'and the Z' are updated by the Mobile-former module and then output to the Mobile sub-module, and the feature map X is taken as input to form an inverse residual error module by depth separable convolution.
The Former submodule includes a multi-headed self-attention Mechanism (MHA) and a Feed Forward Network (FFN).
Figure SMS_1
Is a lightweight cross-attention used to fuse the local feature X with the global token Z. In particular, the lightweight cross-attention mapping from local features to global token Z can be expressed as:
Figure SMS_2
the local feature X and the global token are divided into h headers, wherein
Figure SMS_3
For multi-headed attention. Ith head->
Figure SMS_4
And the ith token Z epsilon R d Different, the->
Figure SMS_5
Is the ith head inquiry projection map, W O For joining together a plurality of heads. Attn (Q, K, V) is +.>
Figure SMS_6
Is a standard attention function of (2). The local features are queries (Q), and the global features are Key values (Key, K) and Value terms (Value, V). So keep W in Mobile≡former K And W is equal to V Mapping matrix of (2) while deleting W Q The matrix is mapped to save computation. Mobile→former vice versa. The global to local cross-attention calculation can be expressed as
Figure SMS_7
Wherein [.] 1:h Representing the series connection of h elements,
Figure SMS_8
and->
Figure SMS_9
Is the mapping matrix of key values and value items at the Former end. And deleting the mapping matrix of the query Q at the mobile terminal.
The stem part in the Mobile-former module is replaced by an ACmix module. The ACmix model has a larger receptive field so that the network can focus on important areas in a larger range. Assume a feature map of a self-care module with N heads, in which
Figure SMS_10
And->
Figure SMS_11
Representing input and output, < >>
Figure SMS_12
The feature vectors of the pixels (i, j) corresponding to F and G are represented, respectively.
Figure SMS_13
Where i represents the connection between the N heads,
Figure SMS_14
is a mapping matrix of queries, key values, value items. N (N) k (i, j) denotes a pixel partial region having a spatial range K with (i, j) as the center. />
Figure SMS_15
Is N k Attention weights corresponding to features in (i, j). Furthermore, multi-headed self-care can be broken down into two phases.
Stage I:
Figure SMS_16
Stage II:
Figure SMS_17
The normal convolution and self-attention have the same 1 x 1 convolution operation at the time of operation. Based on the above, the ACmix module combines the common convolution with self-attention. As shown in fig. 3, the ACmix module is split into two phases during operation. In the first stage, the ACmix module goes through 3 pieces of 1×The 1 convolution operation mapping division divides the input feature map into N segments to obtain a group of 3 XN feature maps containing rich intermediate features. In the second stage, the two paths are divided according to the operation modes of convolution and self-attention difference, and finally new features are generated through aggregation and movement. In the self-attention path, each group containing feature maps contains three features from a 1 x 1 convolution. The three feature maps are respectively used as query, key value and value items of the multi-head self-attention module. In the convolution path with a convolution kernel size of k, a fully connected layer is employed and k is generated 2 And (3) a characteristic diagram. The ACmix module processes the input features in a conventional convolution manner and can collect information from the local receptive field. Finally, the outputs of the two paths are added.
F out =αF att +βF conv (6)
The Mobile submodule extracts local features of the expression image, and the Mobile submodule is used as the most complex submodule in the network and is the key of the network to extract the local features. However, the more mobile sub-modules contained in the network, the higher the overall parameters of the model. The present solution therefore proposes a more lightweight and efficient mobile sub-module.
The original mobile sub-module is shown in a figure (a), after the characteristic information is input into the mobile sub-module, the characteristic dimension is compressed through point-by-point convolution, then the characteristic information is spatially encoded through depth convolution, then the characteristic weighting is carried out through SE attention mechanism, and the characteristic information is output after the point-by-point convolution expansion. The original Mobile sub-module generates a great amount of feature map redundancy when compressing and expanding the feature information, so that redundant calculation amount is caused in the calculation process. In addition, the original SE modules in the network enable various attention mechanisms to appear in the network, and interaction of the attention mechanisms causes the network to be over-fitted during training iteration.
In order to solve the above phenomenon, the Mobile sub-module in the network is reconstructed by using the Ghost module, so that the whole network is lighter and more efficient. The reconstructed structure is shown in fig. 4 (b). The scheme uses a 1×1 Ghost module to re-build the inverted residual structure. When the Ghost module transforms the feature information dimension, the customizable convolution kernel in the module can greatly reduce the redundancy of the feature map caused by the dimension ascending or descending through linear operation so as to reduce the parameter quantity and the calculation quantity of the whole network. Meanwhile, SE attention mechanisms in all Mobile sub-modules are omitted, so that mutual conflict among different attention mechanisms and the overall calculation load of the network are reduced.
The Ghost module structure is shown in FIG. 5, which decomposes the normal convolution into two parts: the first part uses common convolution to generate some inherent feature map X, and the second part uses inexpensive linear operations to enhance features and increase channels.
Y′=X*f′+b (7)
Figure SMS_18
Wherein y' i Is the i-th original feature map in Y'. The above function phi i,j Is a linear operation for generating the j-th Ghost feature map. Each original feature map y' i One or more of the ghest feature maps may be generated
Figure SMS_19
Last phi i,s The method is suitable for preserving identity mapping of the original characteristic diagram.
To sum up: the model improves the efficiency of combining local features and global features in the process of inputting images through a network on the basis of selecting a Mobile-former, and simultaneously maintains the light weight of the network. An ACmix module is first used to replace the original stem part in the network. ACmix combines the common convolution with a self-attention mechanism so that the network can obtain a larger receptive field when the input image is subjected to preliminary processing. Secondly, introducing a mobile sub-module, replacing point-to-point convolution in the depth separable convolution by a Ghost module, and reducing feature map redundancy generated when a network extracts features by utilizing linear operation in the Ghost module. In addition, the text model deletes the SE attention mechanism in the mobile sub-module, reduces overfitting caused by various attention mechanisms, and improves interaction efficiency of local and global features.
Example 2 case analysis
1 experiment Environment
The experimental software and hardware configurations are shown in table 1, and all the comparison networks run on the same platform.
Figure SMS_20
2Datasets
The RAF-DB is a large facial expression database containing 3 ten thousand facial images downloaded from the internet. The dataset has a great variety in terms of age, sex, head pose, lighting conditions, occlusion, etc. of the subject. The training set selected in the experiments herein contained 12271 images and the validation set 3068 images.
Ck+ was extended in 2010 on the basis of the Cohn-Kanda dataset, and is a facial expression dataset that is currently relatively common, comprising 123 participants, 593 picture sequences. The CK data set is a static picture, the CK+ data set comprises a static picture and a dynamic picture, and both data sets contain emotion labels for marking the expression of the participants.
A partial sample of the dataset is shown in fig. 6, for example.
3 experimental results and analysis
Training and testing were performed in the RAF-DB, ck+ dataset, respectively, in order to test the performance of the model in expression recognition. The weights are randomly initialized in the training process, and the NNI toolkit is used for counting the quantity and calculated quantity of the parameters. Because the migration learning has a great influence on the accuracy of the model in the RAF-DB data set, researchers cannot accurately judge the performance of the model, and therefore, the model does not adopt a pre-training and migration learning mode. The training settings are shown in table 2.
Data Batch-size Learnrate optimizer Momentum
RAF-DB 300 0.01 Adam 0.9
CK+ 150 0.01 SGD 0.9
3.1 improved Module comparison experiment
Model RAF-DB CK Params
Mobile-former 76.53% 93.97% 12.84M
Mobileformer+LE-mobile 78.58% 95.96% 11.78M
Mobileformer+ACmix 77.21% 94.87% 12.86M
Our 79.56% 96.97% 11.79M
In order to analyze the influence of the improvement measures proposed by the application on accuracy, parameter quantity and network complexity, the experiment sets different improvement measure combination modes for comparison on the basis of a Mobileformer expression recognition model. The comparative test results of each module are shown in Table 3. Mobileformer is more efficient in feature extraction of the images of the expression by combining the advantages of MobileNet V3 in the local processing network and the transducer in the global interaction. The accuracy rate in RAF-DB dataset and CK+ dataset reaches 76.53% and 93.97%, and Params is 12.84M. When a new mobile sub-module is introduced, the Ghost module reduces redundant feature graphs generated by a convolution layer during operation through linear operation, so that redundant computation generated by a network during operation is effectively reduced, and meanwhile, the efficiency of extracting the surface condition features of the model is improved. Meanwhile, SE attention mechanisms in the original mobile sub-module are deleted, so that overfitting phenomenon caused by various different attentions is reduced, and the extraction capacity of the network characteristics to expression features is improved. The introduction of this module caused the network to be improved by 2.02% and 1.99% in the dataset, respectively, with a reduction of Params of 1.06M in terms of parameters. ACmix improves the receptive field of convolution in processing images by combining common convolution with self-attention. The ACmix module is used for replacing the original convolution layer in the stem part, so that the network can obtain the global receptive field when the primary information filtering and merging are carried out on the input images. When the ACmix module is introduced, the accuracy of the two data sets is improved by 0.68% and 0.9%, and meanwhile, the Params is not obviously increased because of the light weight of the ACmix module. When two improvements are applied to the underlying network simultaneously, the accuracy is improved by 3.03% and 3% respectively in the dataset, while Params is reduced by 1.05M. Therefore, the improvement measures can improve the accuracy of the network to the expression recognition to different degrees, and the introduction of the Ghost module can obviously reduce the overall parameter number and floating point operand of the network, so that the lightweight characteristic of the network is ensured.
Example 3
The embodiment provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the expression recognition method based on the improved Mobile-former provided by the embodiment 1 when executing the computer program.
Example 4
The present embodiment provides a computer readable storage medium having a computer program stored thereon, wherein the program when executed by a processor implements an expression recognition method based on an improved Mobile-former provided in embodiment 1 of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (6)

1. An expression recognition method based on an improved Mobile-former is characterized by comprising the following steps of: comprising
S1, inputting an expression image into an improved Mobile-former module, and primarily filtering and merging the image by an ACmix module of a Mobile part and extracting characteristic information;
s2, the extracted feature information is transmitted to a Mobile sub-module and is input into a former sub-module through a lightweight cross attention module, the Mobile sub-module extracts local features of the expression image, and meanwhile, the former sub-module extracts global features of the expression image, and the lightweight cross attention module carries out bidirectional fusion on the two features;
s3, outputting the combined characteristic information through a Mobile sub-module and taking the combined characteristic information as input of a next Mobile-former module;
s4, after the table characteristic is extracted through a plurality of Mobile-former modules, the characteristic information is output to the classifier for classification through the Mobile sub-module in the last Mobile-former.
2. The improved Mobile-former-based expression recognition method of claim 1, wherein: the stem part in the Mobile-former module is replaced by an ACmix module.
3. The improved Mobile-former-based expression recognition method of claim 1, wherein: the lightweight cross attention module is two bidirectional bridges, and is connected with the Mobile sub-module and the former sub-module through the bidirectional bridges, and the local features and the global features are fused in a bidirectional manner.
4. The improved Mobile-former-based expression recognition method of claim 1, wherein: the point-by-point convolution in the mobile sub-module is replaced by a Ghost module, and the SE attention mechanism in the mobile sub-module is deleted.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements an improved Mobile-former based expression recognition method as claimed in any one of claims 1 to 4 when the computer program is executed by the processor.
6. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a modified Mobile-form based expression recognition method as claimed in any one of claims 1 to 4.
CN202310289138.7A 2023-03-23 2023-03-23 Expression recognition method based on improved Mobile-former Pending CN116311455A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310289138.7A CN116311455A (en) 2023-03-23 2023-03-23 Expression recognition method based on improved Mobile-former

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310289138.7A CN116311455A (en) 2023-03-23 2023-03-23 Expression recognition method based on improved Mobile-former

Publications (1)

Publication Number Publication Date
CN116311455A true CN116311455A (en) 2023-06-23

Family

ID=86799361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310289138.7A Pending CN116311455A (en) 2023-03-23 2023-03-23 Expression recognition method based on improved Mobile-former

Country Status (1)

Country Link
CN (1) CN116311455A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721351A (en) * 2023-07-06 2023-09-08 内蒙古电力(集团)有限责任公司内蒙古超高压供电分公司 Remote sensing intelligent extraction method for road environment characteristics in overhead line channel

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721351A (en) * 2023-07-06 2023-09-08 内蒙古电力(集团)有限责任公司内蒙古超高压供电分公司 Remote sensing intelligent extraction method for road environment characteristics in overhead line channel

Similar Documents

Publication Publication Date Title
Gao et al. A mutually supervised graph attention network for few-shot segmentation: The perspective of fully utilizing limited samples
Gao et al. MSCFNet: A lightweight network with multi-scale context fusion for real-time semantic segmentation
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
Liu et al. Face parsing via recurrent propagation
Zhang et al. Lightweight and efficient asymmetric network design for real-time semantic segmentation
CN113704531A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
Dandıl et al. Real-time facial emotion classification using deep learning
CN110852295B (en) Video behavior recognition method based on multitasking supervised learning
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN114863539A (en) Portrait key point detection method and system based on feature fusion
CN116311455A (en) Expression recognition method based on improved Mobile-former
Yi et al. Elanet: effective lightweight attention-guided network for real-time semantic segmentation
CN113657272B (en) Micro video classification method and system based on missing data completion
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
Gao et al. Generalized pyramid co-attention with learnable aggregation net for video question answering
Xie et al. Facial expression recognition through multi-level features extraction and fusion
Zhao et al. Position fusing and refining for clear salient object detection
Jiang et al. Transformer-Based Fused Attention Combined with CNNs for Image Classification
Luo et al. Temporal-aware mechanism with bidirectional complementarity for video q&a
Tan et al. PPEDNet: Pyramid pooling encoder-decoder network for real-time semantic segmentation
Xu et al. A facial expression recognition method based on residual separable convolutional neural network
Liu et al. Face expression recognition based on improved convolutional neural network
Putro et al. A Fast Real-time Facial Expression Classifier Deep Learning-based for Human-robot Interaction
Qin et al. Research on Semantic Segmentation Algorithm for Autonomous Driving Based on Improved DeepLabv3+
Rui et al. Fast Real-time Semantic Segmentation Network with an Asymmetric Encoder-Decoder Structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination