CN116580440A

CN116580440A - Lightweight lip language identification method based on visual transducer

Info

Publication number: CN116580440A
Application number: CN202310592100.7A
Authority: CN
Inventors: 王慧娟; 袁全波; 邢艺兰; 谢佳飞
Original assignee: North China Institute of Aerospace Engineering
Current assignee: North China Institute of Aerospace Engineering
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-11
Anticipated expiration: 2043-05-24
Also published as: CN116580440B

Abstract

The application discloses a lightweight lip language identification method based on a visual transducer, which comprises the following steps: acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image; constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image; inputting the space-time features into an improved convolution visual transformation network, and extracting local spatial features and global spatial features of the lip identification image; inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image; and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition. The application solves the problems of large model parameter quantity, long operation and reasoning time and performance degradation caused by model compression.

Description

Lightweight lip language identification method based on visual transducer

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to a lightweight lip language recognition method based on a visual transducer.

Background

Lip recognition is also called visual speech recognition, which refers to judging speaking content through lip movement change of a speaker, and the research process involves technologies such as computer vision, natural language processing and the like. The lip language identification has wide application in the aspects of identity authentication, voice recognition, speaker face synthesis, improvement of communication between deaf-mutes, public safety and the like. According to literature studies, there are some problems in the field of lip recognition that need attention. First, recognition is focused mainly on the lips and their surrounding environment, which makes the network model particularly important for fine feature extraction; and secondly, the lip language identification is to process time and space information between adjacent frames of continuous video frames, so that the identification difficulty is high, and simultaneously, the parameters of the model are greatly increased due to convolution and full connection layers contained in the currently adopted deep learning model, and the hardware performance of a computer is required to be higher. Furthermore, as the network goes deeper, the resolution decreases resulting in loss of information. Based on the problems and the complexity of lip reading, the accuracy of the lip reading task is not high all the time, the model parameters are large, and the operation and reasoning time is long. Therefore, there is a need to propose a lightweight lip language recognition method based on visual transducer.

Disclosure of Invention

In order to solve the technical problems, the application provides a lightweight lip language identification method based on a visual transducer, which effectively extracts high-dimensional characteristics of an image sequence, enhances semantic representation among video key frames, thereby reducing loss caused by global average of the image sequence, having obvious advantages in terms of parameter quantity and calculation quantity, and solving the problem of greatly reducing identification accuracy caused by model compression.

In order to achieve the above purpose, the application provides a lightweight lip language identification method based on a visual transducer, which comprises the following steps:

acquiring a lip language data set, preprocessing based on the lip language data set, and acquiring a lip identification image;

constructing a 3D convolutional neural network, and extracting space-time features from the lip recognition image

Inputting the space-time features into an improved convolution visual transformation network to obtain local space features and global space features of the lip identification image;

inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;

and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition.

Optionally, the method for obtaining the lip identification image based on the lip language data set comprises the following steps:

and adjusting the size of the video frames input by the lip language data set to a first preset size, cutting the video frames to a second preset size, performing data enhancement expansion by adopting a data enhancement technology, setting a preset probability level, turning over each video frame, converting the video frames into gray images, and normalizing the gray images to obtain the lip identification images.

Optionally, constructing a 3D convolutional neural network, and the method for extracting spatiotemporal features from the lip recognition image includes:

setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), entering the batch normalization processing, then passing through a layer of activation function, finally sending into the maximum pooling layer processing, the kernel size of the pooling layer is (1, 3), and the stride is (1, 2).

Optionally, a 3D convolutional neural network is constructed, and the calculation for extracting spatiotemporal features from the lip recognition image is as follows:

wherein ,for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map, P _i 、Q _i 、R _i The width, height and time dimensions of the convolution kernel, respectively.

Optionally, the improved convolutional visual transformation network comprises an in-SE-Conve-Embedding layer and three convolutional visual transformation modules.

Optionally, the method for inputting the spatiotemporal features into the improved convolutional visual transformation network and obtaining the local spatial features and the global spatial features of the lip recognition image includes:

inputting the space-time characteristics into an SE-Conve-Embedding layer, compressing the space-time characteristics by adopting a compression function to obtain statistical information of a channel, inputting the statistical information into an excitation function to obtain the correlation of the channel, and inputting the correlation into a scale function to obtain a new characteristic map;

and inputting the new feature map into a convolution visual transformation module, acquiring local context information through a convolution projection layer in the convolution visual transformation module, inputting the local context information into a multi-head attention layer in the convolution visual transformation module for normalization processing, acquiring a normalization result, and inputting the normalization result into a multi-layer perceptron MLP layer in the convolution visual transformation module to acquire local spatial features and global spatial features of the lip identification image.

Optionally, the method for obtaining the new feature map is as follows:

where x is the input feature map, F _sq As a compression function, F _ex As an excitation function, F _scale As a function of the scale,is a new feature map.

Optionally, the method for obtaining local context information through the convolution projection layer in the convolution visual transformation module includes:

wherein For input of the i-th layer Q/K/V matrix, x _i And s is the convolution kernel size, which is the output of the SE-control-encoding layer.

Optionally, the local spatial feature and the global spatial feature are input into a bidirectional gating circulation unit, and the method for extracting the long-term and short-term feature sequence of the lip identification image comprises the following steps:

setting an input dimension as 512, a hidden layer dimension as 1024, 3 layers in total, and an output dimension as 2048;

the calculation formula of the gating cycle unit is as follows:

wherein ：z_t ＝σ(W ^z x _t +U ^z h _t-1 )，r _t ＝σ(W ^r x _t +U ^r h _t-1 )，z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.

Optionally, the long-short-term feature sequence is input into a multi-layer perceptron, the confidence scores of all the categories are obtained, and the method based on the confidence scores of all the categories comprises the following steps:

and inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron, receiving the long-short-period characteristic sequences in a form of flattening the long-short-period characteristic sequences into one-dimensional tensors, multiplying the one-dimensional tensors by a weight matrix, multiplying the weight matrix to generate output characteristics, and obtaining confidence scores of all categories.

The application has the technical effects that: the application discloses a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

fig. 1 is a schematic flow chart of a lightweight lip language identification method based on a visual transducer according to an embodiment of the application;

FIG. 2 is a block diagram of an embodiment of the present application for feature extraction and recognition by an image input 3D convolutional network and a visual transducer;

FIG. 3 is a basic architecture diagram of a Cvt module according to an embodiment of the present application;

FIG. 4 is a diagram of a Transformer Block structure modified in accordance with an embodiment of the present application;

FIG. 5 is a diagram of the overall structure of Mini-3DCvT according to an embodiment of the present application;

FIG. 6 is a schematic diagram of weight transformation in the attention layer and feed forward network according to an embodiment of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

As shown in fig. 1, the method for recognizing the lightweight lip language based on the visual transducer in the embodiment includes the following steps:

acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image;

constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image;

inputting the space-time features into an improved convolution visual transformation network to obtain local spatial features and global spatial features of the lip identification image;

inputting the local space features and the global space features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;

and inputting the long-short-period characteristic sequences into a multiple perceptron, obtaining confidence scores of all the categories, obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories, and completing lightweight lip language recognition.

As shown in fig. 2, the input size of the front-end network is (256×1×88×88), and the input data is preprocessed from the original data set. Data first passes through the 3DCnov layer and then enters Transformer Block. The parameter settings of the TransformerBlock are different in the three phases, and the parameter values of the different phases can be adjusted to adapt to different input data. After the front-end network has extracted the feature information, word boundary information is added, and the size of the input data is adjusted to (256×513×5×5). In order to better acquire global correlation and identification key information of the feature sequence, the feature information is sent to a back-end network Bi-GRU, the Bi-GRU splices the feature vector output by a front-end network and the feature vector with the direction inverted in the channel dimension, and finally, the gradient vanishing problem is effectively avoided through a dropout layer, and meanwhile, the model is prevented from being fitted.

The lip language identification method based on the deep convolution and the attention mechanism carries out preprocessing on a large lip reading data set, and the lip language identification image is obtained specifically by the following steps:

a unified preprocessing strategy is adopted for the data sets LRW and LRW-1000. We resize the initial input video frame to 96 x 96, then clip it to 88 x 88, and use both Mixup, cutmix data enhancement techniques for data enhancement expansion as the final input. We select a batch of video frames of size 256 in each epoch training. Each video frame is then flipped with a probability level of 0.5 and converted to a gray scale image, finally normalized to [0,1]. Furthermore, before the extracted feature information enters the backend, we expand the data dimension from 521 to 513, called word boundaries. The method can provide context and environmental information to aid in classifying lip readings. In addition, the label value of the data sample is smoothed, and the label is smoothly added into the loss function to guide the training of the network model, so that the overfitting of the model is reduced.

The lip language identification method based on the deep convolution and the attention mechanism comprises the following steps of: setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), then entering batch normalization processing, then passing through a layer of activation function, finally sending into a maximum pooling layer for processing, the kernel size of the pooling layer is (1, 3), the stride is (1, 2), and the 3D convolution calculation formula is as follows:

wherein ：for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map>Middle P _i 、Q _i 、R _i The width, height and time dimensions of the convolution kernel, respectively. The convolution mark embedding layer is processed, the size of the embedding core of the layer is (7, 7), the steps are (2, 2), and the number is 128.

The space-time feature vector generated in the previous step is input into an improved lightweight three-stage convolution visual transformer, the part consists of three layers of transformers, the specific architecture is shown in figure 3,

the module consists of a convolution projection layer, a multi-head attention layer and a full connection layer, which are stacked in three steps; wherein: the first step of convolution projection layer has the kernel size of (3, 3), the number of the kernels of 128, the attention of 3 heads and the depth of 128; the second step of convolution projection layer has the kernel size of (3, 3), the number of 256, attention of 12 heads and depth of 256; the third step convolves the projection layer with the kernel size (3, 3), the number of 512, the attention of 16 heads and the depth of 512.

In order to improve the recognition performance of the algorithm, the following modification is performed:

(1) The stacking of the Transformer blocks and the deepening of the network can result in reduced image resolution and reduced ability of the network to acquire spatial signature information. To solve this problem, convolutional mark Embedding layers are improved for the transducer block, compression and excitation structures (Squeeze and Excitation, SE) are added, and the new network layer is defined as SE-save-Embedding. The specific structure of SE-Conve-encoding is shown in figure 4, the feature map is input into the squeize to obtain statistical information of the channel, then the statistical information is input into the expression to obtain the correlation of the channel, and finally the correlation is input into the Scale to obtain a new feature map, as shown in formula 2.

Where x is the input feature map, F _sq As a squeezing function, F _ex As an excitation function, F _scale As a function of the scale,is a new feature map. This layer can increase the feature dimension of the mark by changing parameters and reduce the length of the mark sequence, thereby compensating the information loss caused by deepening the network and reducing the resolution.

The mapped SE-Conve-Embedding features are sent to the Conv-project layer in Transformer Block. The purpose of this layer is to better obtain local context information. The specific formula is as follows:

wherein For input of the i-th layer Q/K/V matrix, x _i And s is the convolution kernel size, which is the output of the SE-control-encoding layer. The feature matrix is then input from the Conv-project layer into the multi-headed attention layer. The number of heads in each step is increased, then the value is subjected to data normalization processing through layer normalization, finally, an output result is obtained through a multi-layer perceptron MLP layer, and the result is used as the output of a front-end network and is input into a next back-end network for sequence modeling. These operations may allow different scales to have good results in extracting features. Meanwhile, the difference between the output value and the expected value can be reduced by utilizing the ReLU activation function, as shown in an improved transducer block in fig. 4, the convolution and attention mechanism well extracts and models the characteristic information of the input characteristic diagram, and the rule characteristics among data are better learned.

(2) The module is lightweight modified as follows. The visual transducer is processed in two steps on the basis of a basic model:

the first step generates a compact architecture with weight conversion. Given a pre-trained large visual transducer model, parameters are first shared between every k adjacent transducer layers except LayerNorm. Each layer is then weight transformed by inserting a tiny linear layer before and after the softmax layer. In addition, a deep convolutional layer for MLP is introduced. These linear layers and conversion blocks are not shared.

And secondly, carrying out weight distillation training on the compression model. In this step, the target loss function is defined as the sum of the proposed multiple loss functions. And transferring knowledge from the large pre-training model to the small model by using the proposed weight distillation method. By distilling inside the transducer module, students can re-simulate the behavior of the teacher's network, thereby extracting more useful knowledge from the large-scale pre-trained model. Note that it is only performed when both the teacher and student models are a transducer architecture. In other cases, the student and teacher architectures are heterogeneous, leaving only complete pre-logic distillation, adding weighted Self-Attention distillation loss and Hidden-State distillation loss.

The overall structure of the resulting Mini-3DCvT is shown in FIG. 5, which is a method for extracting lip movement features by three-dimensional convolution and lightweight modified visual transducer. The extracted characteristic information is processed by a bi-directional gating circulating unit (BiGRU) and a full connection layer (FC) to carry out sequence modeling and classification.

The two-step process of compressing the visual transducer block is described in detail. The first step applies the weight transformation to both the multi-head self-attention (MSA) block and the multi-layer perceptron (MLP). Such a transformation allows each layer to be different, thereby improving parameter diversity and model representation capabilities. As shown in fig. 6, the parameters of the conversion core are not shared across layers, while all other blocks in the original converter except LayerNorm are shared. Since the shared blocks occupy a significant portion of the model parameters, the model size increases only slightly after weight conversion.

Attention layer weight conversion

The multi-headed self-attention (MSA) layer in Transformer Block uses weight conversion followed by layer normalization and residual network connection for analysis before and after each block. Elaborating the MSA layer as follows:

let M be the number of heads in the MSA, also referred to as the self-attention module. Given the input sequence Z0 ε RNxD, at the firstOf the k heads, query, key and value are generated by linear projection, respectively using Q _k 、K _k and V_k E RN x d, where N is the number of tags. The dimensions of the Conv-project input and Q-K-V matrices are D and D, respectively. Next, a weighted average of all values for each location is calculated. For similarity calculation between different elements, attention matrix A is obtained, wherein A is a weight matrix and is obtained by h _k To represent the output of the attention layer, i.e. formulas (4) and (5):

h _k ＝A _k V _k (4)

wherein a softmax operation is performed for each row of the input matrix. Finally, the outputs of all heads are connected using one full connection layer.

In order to improve the diversity of parameters, the application inserts two linear transformations before and after the softmax of the self-attention module, and is defined as follows:

wherein F⁽¹⁾ 、F ⁽²⁾ E, RM×M are linear transformation kernels around softmax, respectively. Such a linear transformation may make each attention matrix a' n different, while combining information across attention heads to increase the parameter variance.

Multi-layer perceptron weight transformation

The multi-layer perceptron (MLP) consists of two fully connected layers, the activation function being denoted as sigma, typically GELU. Let Y ε RN d be the input to the MLP. The output of the MLP is expressed as:

H＝(YW ⁽¹⁾ +b ⁽¹ ))W ⁽²⁾ +b ⁽²⁾ (8)

wherein ,W⁽¹⁾ ∈Rd×d'，b ⁽¹⁾ ∈Rd'，W ⁽²⁾ ∈Rd'×d，b ⁽²⁾ And E, rd are the weights and deviations of the first layer and the second layer respectively. It should be noted that d 'is usually set'>d。

And then, further carrying out lightweight conversion on the MLP so as to improve parameter diversity. Specifically, let the input be y= [ Y1, …, yd ], where yl represents the Lth-th position of all the marker embedded vectors. Then, d linear transforms are introduced to convert Y to Y' = [ C (1) Y1, … C (d) yd ], where C (1), … C (d) ∈rn×n is the independent weight matrix of the linear layer. Then the new formula for the rewrite of equation (9) is:

H＝σ(Y'W ⁽¹⁾ +b ⁽¹ ))W ⁽²⁾ +b ⁽²⁾ (9)

in order to reduce the number of parameters and introduce locality in the transformation, deep convolution is employed ^[41] To sparsify and share weights in each weight matrix, resulting in only K2d parameters instead of N2d parameters (K<<N), where K is the kernel size of the convolution. After transformation, the output of the MLP is more diversified, and the parameter efficiency is improved.

Through these transformations, the weight sharing layer can recover the behavior of the pre-trained model, similar to the multiplexing process. The problems of unstable training and reduced performance can be alleviated to avoid the disadvantages of the weight sharing method.

To compress the pre-trained large model and solve the performance degradation problem caused by weight sharing, further resort to weight distillation, transferring knowledge from the large model to the small and compact models. Three types of distillative transformation blocks are used in the present application, namely, predictive-Logit distillation, self-Attention distillation and Hidden-State distillation.

Presection-Logit distillation

Hinton et al first demonstrated that the deep learning model can achieve better performance by mimicking the output behavior of a well behaved teacher model during training. Using this concept, predictive loss is introduced as follows:

wherein ,z_s and z_t The logarithm of the prediction of the student model and the teacher model is respectively, and T is a temperature value for controlling the smoothness of the logarithm. In the experiment, let t=1. CE represents cross entropy loss.

Self-Attention distillation

It is beneficial to use the attention map in the transducer layer to guide the training of the student model. To solve the problem of inconsistent dimensions between student and teacher models due to the different number of faces, cross entropy loss is applied on the relationships between queries, keys and values in the MSA. First a matrix is attached on all fronts. For example, define q= [ Q1, … QM]E RN× M d, K, V e RN× M d, and so on. For the simplification of the symbols, Q, K and V are denoted by S1, S2 and S3, respectively. Then 9 different relationship matrices can be generated, defined asNote that R12 is the Attention matrix a and Self-Attention distillation loss can be expressed as: />

wherein ,R_i,j,n R represents _i,j Is the n-th row of (c).

Hidden-State distillation

Also, a relationship matrix, i.e., characteristics of the MLP output, can be generated for hidden states ^[45] . The Hidden State of the transducer layer is represented by H.epsilon.RN.times.d, and the Hidden-State distillation loss based on the relation matrix is defined as:

wherein RH, n is the nth row of RH, and the calculation formula is

The application adds the mixing function composed of two other knowledge distillation modes on the basis of the Prediction-Logit distillation as the final loss, so the final distillation target loss function is expressed as:

inputting the extracted spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences:

the setting of the bidirectional gating circulation unit specifically comprises the following steps: the input dimension is set to be 512, the hidden layer dimension is set to be 1024, 3 layers are added, the output dimension is set to be 2048, and the calculation formula of the gating circulation unit is as follows: wherein ：z_t ＝σ(W ^z x _t +U ^z h _t-1 )，r _t ＝σ(W ^r x _t +U ^r h _t-1 )，/>z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.

Inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron to obtain confidence scores of all categories, wherein the method specifically comprises the following steps of: the extracted long-short-period characteristic sequence is input into a multi-layer perceptron, the structure of the multi-layer perceptron is input dimension 2048 and output dimension 1000, the multi-layer perceptron is received in a form of flattening into one-dimensional tensor, then the one-dimensional tensor is multiplied by a weight matrix, and the weight matrix is multiplied to generate output characteristics, so that confidence scores of all categories are obtained. Based on the confidence scores of the various categories, outputting identification probability values through a cross entropy loss function with a label smoothing mechanism, wherein the identification probability values specifically comprise: based on the confidence scores of the various categories, the obtained output features and the real labels are sent into a cross entropy loss function with a label smoothing mechanism to output identification probability values.

The application provides a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency. Experiments prove that the Mini-3DCvT provided by the application has high accuracy, greatly reduces the model parameter quantity, reduces the calculated amount and the training time, and has excellent performance in the aspects of improving the model accuracy, accelerating and compressing. Meanwhile, in addition to the identification accuracy requirement of the lip reading method, the light weight of the model gradually becomes an important problem for restricting the development of the field.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The lightweight lip language identification method based on the visual transducer is characterized by comprising the following steps of:

2. The vision transducer-based lightweight lip language identification method as defined in claim 1, wherein the lip language data set is used for preprocessing, and the method for acquiring the lip language identification image is as follows:

3. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the method of extracting spatiotemporal features from the lip recognition image comprises:

4. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the computing of extracting spatiotemporal features from the lip recognition image is:

wherein ,for the value in the j-th featuremap at position (x, y, z) in the i-th layer, relu is the activation function, b is the bias, m is the index of the i-1 layer featuremap connected to the current layer featuremap, P _i 、Q _i 、R _i The width, height and time dimensions of the convolution kernel, respectively.

5. The vision-transformer-based lightweight lip language identification method of claim 1, wherein the modified convolutional vision transformation network comprises an in SE-save-Embedding layer, three convolutional vision transformation modules.

6. The vision transducer-based lightweight lip language identification method of claim 5, wherein inputting the spatiotemporal features into a modified convolutional vision transformation network, the method of obtaining local spatial features and global spatial features of the lip recognition image comprises:

7. The vision transducer-based lightweight lip language identification method of claim 6, wherein the method for obtaining the new feature map is:

8. The method for recognizing a lightweight lip language based on a visual transducer according to claim 6, wherein the method for acquiring local context information through a convolution projection layer in the convolution visual transformation module comprises the steps of:

9. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the method for inputting the local spatial feature and the global spatial feature into a bi-directional gating loop unit and extracting the long-short period feature sequence of the lip recognition image comprises:

the calculation formula of the gating cycle unit is as follows:

10. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the long-short term feature sequence is input into a multi-layer perceptron to obtain confidence scores of respective categories, and the method based on the confidence scores of the respective categories comprises: