CN114429659A

CN114429659A - Self-attention-based stroke patient expression recognition method and system

Info

Publication number: CN114429659A
Application number: CN202210087537.0A
Authority: CN
Inventors: 朱啸宇; 陆小锋; 曾凤珍; 宋海洋; 朱民耀; 王鹤玮; 贾杰
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-03

Abstract

The invention discloses a stroke patient expression recognition method and system based on self-attention, which comprises the following steps: the method comprises the steps of obtaining an expression data set of a stroke patient, constructing an ViT-based expression recognition model, training the ViT-based expression recognition model through a human face expression data set and the stroke patient expression data set, and recognizing the category of the expression of the stroke patient through the trained ViT-based expression recognition model. According to the invention, an ViT-based expression recognition model is constructed, a pre-training model is obtained by pre-training with a facial expression data set as input, and then the pre-training model is trained with a stroke patient expression data set enhanced data set as input, so that the ViT-based expression recognition model can recognize the expression of a light-weight stroke patient.

Description

Self-attention-based stroke patient expression recognition method and system

Technical Field

The invention relates to the field of artificial intelligence, in particular to a stroke patient expression recognition method and system based on self-attention.

Background

The human face expression is the most important way for human to express self emotion and plays an important role in the interpersonal interaction process. Therefore, Facial Expression Recognition (FER) is a mainstream emotion perception method at present, and compared with an emotion perception method based on measurement of physiological indexes such as heart rate and blood pressure, the Facial Expression Recognition technology has the advantages of non-contact, simple measurement and high accuracy, and is one of research hotspots in the field of human-computer interaction. Ekman and Friesen defined 6 basic emotions based on cross-cultural studies as early as the 20 th century 70 s: anger, disgust, happiness, sadness, fear and surprise, the later people have the advantage over the situation, the 7 types of basic expressions which are mainstream at present are formed, and the main task of facial expression recognition is to classify the expression pictures into the 7 basic expressions. The expression recognition technology is also widely applied, such as fatigue driving, man-machine interaction, personalized recommendation, network teaching and the like, and the emotion information of the user is recognized and analyzed through the expression, so that the more personalized product can be provided. However, in the field of rehabilitation medicine, the application of expression recognition is still in a primary state, and further popularization of expression recognition technology in the medical field is hindered due to too low expression recognition rate of patients, equipment limitation and the like.

Transformer has enjoyed great success in natural language processing (NPL) as a self-attention mechanism-based deep learning model, and since then such emerging model has become a hotspot in the field of machine learning. Because of its excellent robustness and interpretability, transformers are also widely used in other disciplines such as Computer Vision (CV), audio processing, and life sciences. In the field of CV counting, various Transformer-based researches make a major breakthrough in the visual fields of image classification, target detection, semantic segmentation, image super-resolution reconstruction and the like. Wherein, the accuracy of vit (vision transformer) in the classification task exceeds the mainstream Convolutional Neural Network (CNN) model through pre-training on a large data set, and research finds ViT that the model has strong anti-occlusion capability and shape interpretation capability. However, because ViT has a weaker texture information interpretation capability than CNN and a large number of model parameters, it is difficult to apply ViT to the field of expression recognition due to the fact that large data set pre-training is required.

Disclosure of Invention

The invention aims to provide a stroke patient expression identification method and system based on self-attention, which can identify light stroke patient expressions.

In order to achieve the purpose, the invention provides the following scheme:

a stroke patient expression recognition method based on self-attention, the method comprising:

acquiring an expression data set of a stroke patient;

constructing ViT-based expression recognition models;

training the ViT-based expression recognition model through a facial expression data set and the stroke patient expression data set;

the category of the expression of the stroke patient is identified through a trained ViT-based expression recognition model.

Optionally, the obtaining of the expression dataset of the stroke patient specifically includes:

acquiring an expression video of a stroke patient;

labeling the expression video to obtain segmented videos of various expressions;

performing picture processing on each segmented video to obtain picture groups of various expressions; the pictures in the picture group comprise an expression starting state, an expression ending state and an expression peak value state;

framing the faces in the picture groups of the various expressions to obtain face pictures;

and zooming the face picture to obtain the stroke patient expression data set.

Optionally, the ViT-based expression recognition model includes: the device comprises a convolution embedding module, a Transformer module, a multi-channel feature extraction module and an output module which are connected in sequence.

Optionally, the Transformer module includes: encoder and block integration networks.

Optionally, the training of the ViT-based expression recognition model through the facial expression dataset and the stroke patient expression dataset specifically includes:

pre-training the ViT-based expression recognition model by taking a facial expression data set as input and the confidence degrees of various expressions as output to obtain a pre-training model; the expression category with the highest confidence coefficient is a prediction result of the pre-training model;

enhancing the expression data set of the stroke patient to obtain an enhanced expression data set;

training the pre-training model by taking the enhanced expression data set as input and the confidence degrees of various expressions as output to obtain a trained ViT-based expression recognition model; the expression class with the highest confidence is the prediction result of the trained ViT-based expression recognition model.

A self-attention based stroke patient expression recognition system, the system comprising:

the system comprises a data set acquisition unit, a stroke detection unit and a data processing unit, wherein the data set acquisition unit is used for acquiring an expression data set of a stroke patient;

the construction unit is used for constructing ViT-based expression recognition models;

a training unit for training the ViT-based expression recognition model by a facial expression dataset and the stroke patient expression dataset;

and the identification unit is used for identifying the category of the expression of the stroke patient through a trained ViT-based expression identification model.

Optionally, the data set obtaining unit includes:

the expression video acquisition subunit is used for acquiring an expression video of the stroke patient;

the labeling subunit is used for labeling the expression videos to obtain segmented videos of various expressions;

the picture processing subunit is used for carrying out picture processing on each segmented video to obtain picture groups with various expressions; the pictures in the picture group comprise an expression starting state, an expression ending state and an expression peak value state;

the face picture determining subunit is used for framing the faces in the picture groups with various expressions to obtain a face picture;

and the data set determining subunit is used for scaling the facial picture to obtain the stroke patient expression data set.

Optionally, the training unit includes:

the pre-training model determining subunit is used for pre-training the ViT-based expression recognition model by taking the facial expression data set as input and the confidence degrees of various expressions as output to obtain a pre-training model; the expression category with the highest confidence coefficient is a prediction result of the pre-training model;

the enhanced data set subunit is used for enhancing the stroke patient expression data set to obtain an enhanced expression data set;

an expression recognition model determining subunit, configured to train the pre-training model by using the enhanced expression data set as an input and using confidence degrees of various expressions as an output, so as to obtain a trained ViT-based expression recognition model; and the expression category with the highest confidence coefficient is the prediction result of the pre-training model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a stroke patient expression recognition method and system based on self-attention, which comprises the following steps: the method comprises the steps of obtaining an expression data set of a stroke patient, constructing an ViT-based expression recognition model, training the ViT-based expression recognition model through a facial expression data set and the stroke patient expression data set, and recognizing the category of the stroke patient expression through the trained ViT-based expression recognition model. According to the invention, an ViT-based expression recognition model is constructed, a pre-training model is obtained by pre-training with a facial expression data set as input, and then the pre-training model is trained with a stroke patient expression data set enhanced data set as input, so that the ViT-based expression recognition model can recognize the expression of a light-weight stroke patient.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flow chart of a self-attention-based stroke patient expression recognition method according to the present invention;

FIG. 2 is a histogram of the type and quantity of stroke patient expression data sets according to the present invention;

FIG. 3 is a diagram of an expression recognition model according to the present invention based on ViT;

FIG. 4 is a block integrated network architecture diagram of the present invention;

fig. 5 is a block diagram of a system for recognizing expressions of stroke patients based on self-attention according to the present invention;

FIG. 6 is a diagram of a confusion matrix on RAF-DB based on ViT expression recognition model according to the present invention;

FIG. 7 is a diagram of ViT confusion matrix over RAF-DB.

Description of the symbols:

the system comprises a data set acquisition unit-1, a construction unit-2, a training unit-3 and an identification unit-4.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the method for recognizing an expression of a stroke patient based on self-attention provided by the invention comprises the following steps:

step S1: an expression data set of a stroke patient is obtained.

Step S1 specifically includes:

step S11: the expression video of the stroke patient is obtained.

Specifically, the expression data set of the stroke patient is obtained by acquiring expression data of 37 real stroke patients serving as acquisition objects in a hospital rehabilitation department of a rehabilitation hospital. The acquisition process of the expression data of the acquired object comprises the following steps: the acquisition object is required to simulate 7 types of basic expressions according to the software prompt, the basic expressions simulated by the acquisition object are acquired, and the expression changes of the acquisition object are also acquired in the process of simulating the basic expressions, so that 568 sections of expression videos with the resolution of 1280 × 720 and the frame rate of 20FPS are obtained.

Step S12: and labeling the expression video to obtain segmented videos of various expressions.

Specifically, the basic expression video is subjected to video labeling by five coders trained by expression recognition coding, the expression videos of the same type are labeled as the same label, and the video of each type of expression of each stroke patient is a segmented video of each type of expression. And capturing two types of expressions of force exertion difficulty and pain in the real training video segment of the patient, wherein the nine types of expression video segments are total.

Step S13: performing picture processing on each segmented video to obtain picture groups of various expressions; the pictures in the picture group comprise an expression starting state, an expression ending state and an expression peak value state.

Specifically, the segmented video is subjected to picture processing to obtain a picture group of each expression of each stroke patient, 3-5 representative pictures are selected from the picture group of each segment of video, and the contents of the pictures comprise: the expression starting state, the expression ending state and the expression peak state.

Step S14: and framing the faces in the picture group with various expressions to obtain a face picture.

Specifically, the face in the picture is cut out separately to obtain a face picture.

Step S15: and zooming the face picture to obtain an expression data set of the stroke patient.

Specifically, the size of the face picture is uniformly scaled to 128 × 128, 1143 expression pictures are obtained as an expression data set of the stroke patient, and the expression data set of the stroke patient is as follows: the scale of 1 is randomly divided into a training set and a validation set. Further, the category and number of the expression data sets of the stroke patient are shown in fig. 2.

Step S2: ViT-based expression recognition models are constructed.

Specifically, as shown in fig. 3, the ViT-based expression recognition model includes: the device comprises a convolution embedding module, a Transformer module, a multi-channel feature extraction module and an output module which are connected in sequence.

Further, the convolution embedding module specifically includes:

the model input is a three channel gray scale image with 128 x 128 pixels, i.e., 3 x 128 gray scale image. The image is converted into Block feature vectors required by a transform network by convolution through a convolution Embedding module (CE Block). The convolution embedding module is divided into a characteristic convolution part and a block embedding part.

The Feature convolution (Feature Conv) performs convolution with a step size of 2 by a convolution kernel of 64 × 7, and outputs an image Feature map of 64 × 64 by a pooling layer of 2 × 2.

Block Embedding (Patch Embedding) performs a convolution with step size 2 using multiple convolution kernels 64 x 2, compressing the input into a Patch input TF network with sequence length N of 16 x 16.

Further, the transform module specifically includes:

and a transform module (TF Block) is a backbone network based on the expression recognition model of ViT and is used for processing the feature vectors output by the convolution embedding module. The Transformer module is divided into two parts of Transformer Encoder and block integration (batch Aggregate).

Wherein, the transform Encoder converts the input feature block into 3 × k × d feature vectors through linear transformation, and then divides the feature vectors into feature matrices q (queries), k (keys), and v (values) through dimension transformation, d is the parameter number of each attention head, so as to implement the transform's self-attention mechanism, and the self-attention module calculation formula is:

wherein D is_KThe latitude value of K is shown, K is the set number of attention heads, and A is the self-attention matrix.

The area concerned by the model can be analyzed by analyzing the matrix, and the visualization of the attention of the model can be realized after the self-attention moment matrix is output. The output of the Transformer Encoder is integrated and transmitted to the block integration structure through a forward feedback network composed of 1 × 1 convolution and a Gaussian Error linear rectification function (GELU).

As shown in fig. 4, the block aggregation network (Patch Aggregate) compresses the block feature data volume, and as shown in fig. 4, the network includes convolution of 3 × 3, convolution of 2 step sizes, BN layer, and pooling layer of 2 × 2 size, extracts the vector features, and accelerates the training of the model. After passing through the two rounds of TF modules, the output feature vector is 576 x 4, wherein d_iIs the input channel number, d_oFor the number of output channels, h is the length of the feature block and w is the width of the feature block.

The complexity of self-attention moment array calculation and the occupied memory are the square of the length N of the input sequence, namely N²The calculation from the attention matrix a becomes the bottleneck for the model use. The complexity of self-attention matrix calculation and the occupied memory can be reduced through the designed block integration network, so that the operation can be accelerated, and the fitting speed is improved.

Further, the multi-channel feature extraction module specifically includes:

the Multi-channel Feature extraction module (MCFE Block) performs channel extraction on the processed Feature vectors, thereby reducing data loss during Feature extraction. The multi-Channel Feature extraction module comprises block Reconstruction (Patch Reconstruction) and Channel Feature Pooling (CFP).

Wherein the block reconstruction upsamples the feature vector to 4 × 48 spatial dimensions, and performs convolution with step size 14 for each layer of features separately using convolution kernel 12 × 12, resulting in a feature map of 4 × 10.

And (3) outputting a channel feature vector of 1 × 10 by the channel feature pooling, wherein the calculation formula is as follows:

wherein

The average and maximum of the i, j channels at the positions, respectively, and the weight of a being the maximum can be trained by the model.

Further, the output module specifically includes:

the output module recombines the 10 × 10 characteristic values into one dimension, and the confidence coefficients of 7 types of basic expressions are obtained through full connection, normalization and Sigmoid functions, and the type with the highest confidence coefficient is the model prediction result. The output module is also called a linear classifier.

Step S3: ViT-based expression recognition models are trained by facial expression data sets and stroke patient expression data sets.

Step S3 specifically includes:

step S31: pre-training an ViT-based expression recognition model by taking a facial expression data set as input and confidence degrees of various expressions as output to obtain a pre-training model; and the expression category with the highest confidence coefficient is the prediction result of the pre-training model.

Specifically, the ViT-based expression recognition model is pre-trained using the Fer2013 dataset. The facial expression data set includes 7 types of basic expressions and 35886 facial expression pictures. 28708 pictures are selected for model training, and 3589 pictures are selected for model verification. The picture is a grayscale picture of 48 × 48 pixels per size, and the face is not standardized. During training, the image is converted into a three-channel image with 224 x 224 pixels, and the pixel value is normalized from [0, 255] to [0, 1] interval according to the average value and the variance of all the images in the data set. The model was pre-trained using a BatchSize set to 120, Lr 0.0003 for 100 rounds of training, the loss was calculated using a cross entropy loss function, and optimized using an Adam optimizer.

The facial expression data set has 7 types of expression labels, and the corresponding relationship is as follows:

0: anger: generating qi; 1: disturst: aversion; 2: and (5) fear: fear; 3: happy: opening the heart; 4: normal: normal; 5: and sad: hurting the heart; 6: bearing: surprisingly.

Step S32: and enhancing the expression data set of the stroke patient to obtain the enhanced expression data set.

Specifically, the expression recognition model based on ViT has strong explanatory power, so that the overfitting problem is easy to occur on a small data set, and the data set needs to be enhanced to relieve the overfitting of the model, so that the enhanced expression data set is obtained. Ways to enhance the data set include: and (3) performing horizontal turning operation on 50% of the pictures, performing random shielding on the pictures in 2% -10% of areas, and adding Gaussian white noise with the mean value of 0 and the variance of 30. According to the operation, the phenomenon of model overfitting is effectively reduced.

Step S33: training the pre-training model by taking the enhanced expression data set as input and the confidence degrees of various expressions as output to obtain a trained ViT-based expression recognition model; the expression class with the highest confidence coefficient is the prediction result of the well-trained ViT-based expression recognition model.

Specifically, a pre-training model is used for transfer learning so as to optimize parameters, model parameters except for an output module in the model are locked, and the enhanced expression data set is used for transfer learning of the pre-training model. Namely, pixel values are normalized from [0, 255] to [0, 1] interval according to the mean value and variance calculated by the data set as a whole and input into the model. And performing cross entropy loss calculation according to the output result of the model and the correct result, and performing iterative optimization on the network by using an Adam optimizer.

The self-attention matrix A of the expression recognition model based on ViT is output in the training process, the attention heat map of the expression recognition model based on ViT is calculated and drawn, and the learning situation of the expression recognition model based on ViT on the expression recognition task can be known. And adjusting the number of the model blocks and the depth of the model, and repeatedly training to achieve ideal model accuracy.

Step S4: the category of the stroke patient expression is identified through a trained ViT-based expression recognition model. Specifically, the expression category with the highest confidence coefficient is the prediction result of the trained ViT-based expression recognition model.

As shown in fig. 5, the system for recognizing expressions of stroke patients based on self-attention provided by the present invention comprises: a data set acquisition unit 1, a construction unit 2, a training unit 3 and a recognition unit 4.

Specifically, the data set obtaining unit 1 is used for obtaining an expression data set of a stroke patient.

And the construction unit 2 is used for constructing ViT-based expression recognition models.

And the training unit 3 is used for training the ViT-based expression recognition model through the facial expression data set and the stroke patient expression data set.

And the recognition unit 4 is used for recognizing the category of the expression of the stroke patient through a trained ViT-based expression recognition model.

The data set obtaining unit 1 specifically includes:

and the expression video acquisition subunit is used for acquiring the expression video of the stroke patient.

And the labeling subunit is used for labeling the expression videos to obtain segmented videos of various expressions.

The picture processing subunit is used for carrying out picture processing on each segmented video to obtain picture groups with various expressions; the pictures in the picture group comprise an expression starting state, an expression ending state and an expression peak value state.

And the face picture determining subunit is used for framing faces in the picture groups with various expressions to obtain the face pictures.

And the data set determining subunit is used for zooming the face picture to obtain an expression data set of the stroke patient.

The expression recognition model based on ViT specifically includes:

the device comprises a convolution embedding module, a Transformer module, a multi-channel feature extraction module and an output module which are connected in sequence. Specifically, the Transformer module comprises: encoder and block integration networks.

Wherein the training unit 3 comprises:

the pre-training model determining subunit is used for pre-training the ViT-based expression recognition model by taking the facial expression data set as input and the confidence degrees of various expressions as output to obtain a pre-training model; and the expression category with the highest confidence coefficient is the prediction result of the pre-training model.

And the enhancement data set subunit is used for enhancing the expression data set of the stroke patient to obtain an enhanced expression data set.

The expression recognition model determining subunit is used for training the pre-training model by taking the enhanced expression data set as input and the confidence degrees of various expressions as output, so as to obtain a well-trained ViT-based expression recognition model; and the expression category with the highest confidence coefficient is the prediction result of the pre-training model.

The application tests the proposed model in different mainstream face datasets: effects on RAF-DB and FERPlus, and comparison with the mainstream expression recognition model in terms of parameters and accuracy. The performance of the model is analyzed by methods such as analyzing a confusion matrix and a model attention chart, and then the advantages of the model in identifying the stroke patient are verified by training a self-made stroke patient expression data set.

First, the present application uses a convolution-based neural network that is mainstream in the current expression recognition direction for comparison. In the comparison process, a model without pre-training is used for training from the beginning, the same data enhancement method as other models is used in the training process, the Loss is calculated by using a cross entropy Loss function, an adam optimizer is used for optimization, the Lr (learning rate) is 0.0003, and due to the fact that the model parameter quantity is optimized, the Batch Size during training can be improved from 20 of ViT to 120, and the training speed of the model is greatly improved. The accuracy of the model on the RAF-DB data set is shown in table 2, and Ours in table 2 is the ViT-based expression recognition model provided herein. As can be seen from table 2, even in the state without using the pre-training, the ViT-based expression recognition model provided by the present application has a strong interpretation capability, the overall accuracy rate even exceeds that of the pre-trained ViT, and the ViT-based expression recognition model provided by the present application has a high interpretation capability for normal expression (Neutral), which is helpful for correctly interpreting the expression of the stroke patient.

TABLE 2 expression recognition model based on ViT correctness comparison table on RAF-DB data set

The comparison of model parameters and accuracy on the FERplus data set is shown in table 3, where Ours in table 3 is the ViT-based expression recognition model provided herein, and M is 10⁶From table 3, it can be seen that compared with ViT, the ViT-based expression recognition model provided by the present application is more suitable for the expression recognition task, and can achieve performance far exceeding ViT with the parameter amount reduced by nearly 8 times. Meanwhile, the model is less sensitive to the Lr (learning rate), and a better effect is easily trained.

TABLE 3 table of parameter quantities and accuracy on the FERplus data set

As shown in fig. 6 and 7, comparing the confusion matrix of each expression recognition model on the RAF-DB data set, it can be seen that the ViT-based expression recognition models provided by the present application all exceed ViT in overall accuracy and reach 99% in recognition of normal expressions, which can effectively help to identify the emotional baseline of stroke patients.

The expression recognition model based on ViT provided by the application uses more convolutions, the characteristic pixels do not completely correspond to the picture pixels one by one, but in the visualization of model attention, it can be seen that the expression recognition model based on ViT provided by the application still retains the self-attention mechanism of ViT, and has larger discrimination weights for the mouth angle and the eye part.

Considering that the emotion of a patient is simple in the rehabilitation process, five types of expressions (three types of basic expressions and two types of special expressions) are selected as recognition targets, and the recognition targets are respectively as follows: 1. happy, 2 sad, 3 normal, 4 forceful, 5 painful. Tests show that the expression recognition model based on ViT provided by the application achieves 83.88% of accuracy on a private data set, and compared with ResNet18, the accuracy is improved to 80.26%.

An expression recognition model based on ViT is designed according to the expression characteristics of the stroke patient, and the influence of the hemiplegia region on recognition is reduced by increasing the region attention mechanism. The model designs a convolution-based block Embedding structure (Conv Patch Embedding) and a block integrating structure (Patch aggregation) while having ViT high interpretation capability through a self-attention mechanism, thereby realizing a ViT backbone network of a pyramid structure. In the expression recognition task, compared with the original edition ViT, the quantity of parameters is reduced from 86,095,879 to 12,576,579, the calculated quantity is reduced by nearly 8 times, and the interpretation capability of the model on the texture is improved through a block embedding structure based on convolution so as to adapt to the expression recognition task. Finally, the accuracy rate in the FRF-DB is 88.17% (higher than that in the original ViT 86.14.14%) without pre-training, and the model is insensitive to Lr (learning rate) in the training process and is easier to train.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A stroke patient expression recognition method based on self-attention, characterized in that the method comprises the following steps:

acquiring an expression data set of a stroke patient;

constructing ViT-based expression recognition models;

the category of the stroke patient expression is identified through a trained ViT-based expression recognition model.

2. The stroke patient expression recognition method based on self-attention as claimed in claim 1, wherein the acquiring stroke patient expression data set specifically comprises:

acquiring an expression video of a stroke patient;

framing the faces in the picture groups with various expressions to obtain face pictures;

and zooming the face picture to obtain the stroke patient expression data set.

3. The stroke patient expression recognition method based on self-attention as claimed in claim 1, wherein the ViT-based expression recognition model comprises: the device comprises a convolution embedding module, a Transformer module, a multi-channel feature extraction module and an output module which are connected in sequence.

4. The stroke patient expression recognition method based on self-attention as claimed in claim 3, wherein the Transformer module comprises: encoder and block integration networks.

5. The stroke patient expression recognition method based on self-attention as claimed in claim 3, wherein the training of the ViT-based expression recognition model through the facial expression dataset and the stroke patient expression dataset specifically comprises:

6. A self-attention based stroke patient expression recognition system, the system comprising:

a training unit, configured to train the ViT-based expression recognition model through a facial expression dataset and the stroke patient expression dataset;

and the recognition unit is used for recognizing the category of the expression of the stroke patient through a trained ViT-based expression recognition model.

7. The system for stroke patient expression recognition based on self-attention as claimed in claim 6, wherein the data set obtaining unit comprises:

the picture processing subunit is used for carrying out picture processing on each segmented video to obtain picture groups of various expressions; the pictures in the picture group comprise an expression starting state, an expression ending state and an expression peak value state;

and the data set determining subunit is used for zooming the face picture to obtain the stroke patient expression data set.

8. The system of claim 6, wherein the ViT-based expression recognition model comprises: the device comprises a convolution embedding module, a Transformer module, a multi-channel feature extraction module and an output module which are connected in sequence.

9. The system of claim 8, wherein the Transformer module comprises: encoder and block integration networks.

10. The system of claim 8, wherein the training unit comprises: