CN118072395A

CN118072395A - Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Info

Publication number: CN118072395A
Application number: CN202410253805.0A
Authority: CN
Inventors: 张小瑞; 曾祥龙; 孙伟; 陈春辉; 黄志文
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-03-06
Filing date: 2024-03-06
Publication date: 2024-05-24

Abstract

The invention discloses a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, which aims to solve the problems that redundant information exists in video in the prior art, accurate capture of motion hand characteristics is difficult to carry out and the like, and comprises the steps of acquiring a dynamic gesture video; preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence; and identifying the dynamic gesture based on a pre-trained dynamic gesture identification model according to the dynamic gesture video frame sequence to obtain the dynamic gesture meaning category and the like, wherein the dynamic gesture identification model comprises an embedding module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are sequentially connected. The method and the device can reduce the search area of the space-time dimension into the area related to the hand, and can improve the dynamic gesture recognition precision while reducing the calculated amount.

Description

Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Technical Field

The invention relates to a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, and belongs to the technical field of gesture recognition.

Background

In many scenarios, gestures are a basic communication mode, such as daily call-in, command gestures of traffic police, and more typically sign language used by the deaf-mute. Gestures are essentially a type of language that, in real life, is continuous and dynamic as spoken language expressions, meaning being expressed by hand gestures that vary and move over time. The gesture recognition is to let a computing mechanism solve the meaning of a target gesture, and the dynamic gesture recognition based on computer vision is to analyze a video captured by a camera through a specific algorithm so as to classify the gesture, so that the gesture recognition has a great number of application scenes in human-computer interaction, such as virtual reality, sign language translation, clinical medical treatment and the like.

In dynamic gesture recognition, the complex background is usually static or moves slightly, that is to say, in gesture data, the hand is the most main moving object, and the obvious characteristic of the movement can help the model to exclude redundant information irrelevant to the hand, so that accurate extraction of the characteristics of the moving hand is realized, and gesture recognition accuracy is improved.

With the rise of deep learning, more and more neural networks in video understanding are created, but dynamic gesture recognition focuses more on hand actions than other video understanding fields, and the judgment of gestures is mainly finished by means of hand information in each frame and hand motions between frames. However, the video frame sequence contains a lot of redundant information, which affects the attention of the model to the hand: on one hand, each frame of image in the video frame sequence can record complex backgrounds behind human bodies and human bodies at the same time, and information outside hands usually influences the positioning of the model on the hands; on the other hand, due to limited quality of the device capturing the video, excessive frame rate and other reasons, some blurred frames and repeated frames are included in the data, and the redundant frames also affect the extraction of the motion hand features by the model.

Attention mechanisms are the most common approach to solving such problems, and refer to using some specific neural network structure to automatically learn and calculate the magnitude of the contribution of different parts of the data to the recognition result, thereby making the neural network model more focused on the sport hand region. Existing research often directly feeds data into an attention module, and finds effective features from a wide range of space-time features through an attention mechanism in the data, and although the model can really pay attention to hands, the accurate capture of the motion hand features is difficult due to the large searching range.

In order to more completely extract the characteristics of the hands of the motion, some researchers use a multi-mode fusion method, which applies attention mechanisms to the data of different modes respectively, and finally fuses the extracted characteristics of different modes to obtain richer hand characteristics. However, the method for directly fusing the features of each mode lacks the interactivity of data among modes, namely, a single mode cannot fully utilize the feature information of other modes during the hand positioning, so that the finally extracted features of each mode are inaccurate.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, which can reduce a search area of space-time dimension into an area related to a hand, and can improve dynamic gesture recognition precision while reducing the calculated amount.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme: a method of dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights, comprising:

Acquiring a dynamic gesture video;

Preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence;

According to the dynamic gesture video frame sequence, based on a pre-trained dynamic gesture recognition model, recognizing the dynamic gesture to obtain a dynamic gesture meaning category;

the dynamic gesture recognition model comprises an embedding module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are connected in sequence;

the embedding module is used for mapping the dynamic gesture video frame sequence to a high-dimensional vector space by utilizing block embedding operation, and carrying out position coding on the high-dimensional vector space to obtain initial characteristics;

the feature extraction module is used for extracting the local mode of the initial feature to obtain the feature containing the local mode;

The inter-frame motion attention module is used for carrying out inter-frame motion attention calculation on the characteristics including the local mode to obtain the characteristics of the attention hand region;

The self-adaptive fusion downsampling module is used for carrying out self-adaptive fusion downsampling operation on the characteristics of the hand region of interest;

The full connection layer is used for outputting gesture meaning categories corresponding to the gesture video frame sequences.

Further, preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence, including: processing the dynamic gesture video into multi-frame images;

and cutting the multi-frame image into a size of 224 multiplied by 224 pixels to obtain a dynamic gesture video frame sequence.

Furthermore, the embedding module is composed of a 3D convolution layer with the step length of (2, 4) and a position coding layer, the 3D convolution layer is used for extracting local features of dynamic gesture video data and expanding channel dimensions at the same time, so that a video frame sequence is mapped to a high-dimensional vector space, the position coding layer adopts a leachable parameter matrix which is used for converting the dimensions of the video frame sequence from B×3×L×H×W to B×C×L×H×W, wherein B is the batch number, C is the embedding dimension, L is the frame number, H is the frame height, and W is the frame width, and initial features are obtained.

Further, the feature extraction module consists of a 3D sliding window layer with step sizes of (2, h and w), 3 convolution layers, 1 residual error connection layer connected end to end and 2 linear layers which are connected in sequence, wherein h and w are the height and width of the sliding window;

The 3D sliding window layer can divide the same area of two adjacent frames of the initial characteristic X _raw into one block, and adjust the tensor shape to obtain the dimension as The expression of which is as follows:

X_f,X_l＝reshape(window(X_raw))

wherein, X _f is the feature of the previous frame of all blocks, X _l is the feature of the next frame of all blocks, reshape is the operation of adjusting the tensor shape, and window is the sliding window operation;

The 3 convolution layers sequentially comprise a convolution layer with the size of 1 multiplied by 1, the number of output channels of 64, a convolution layer with the size of 3 multiplied by 3, the number of output channels of 64 and a convolution layer with the size of 1 multiplied by 1, the number of output channels of embedded dimension C, wherein the convolution layer with the size of 1 multiplied by 1, the number of output channels of 64 is used for adjusting the dimension of a frame image channel, the size of 3 multiplied by 3, the number of output channels of 64 is used for extracting a local mode, the size of 1 multiplied by 1, the number of output channels of the convolution layer with the embedded dimension C is used for optimizing the local mode and adjusting the dimension of the frame image channel back;

The residual connecting layer is used for relieving the gradient disappearance phenomenon generated in the training process, and the expression is as follows:

P_f＝X_f+Conv(X_f)

P_l＝X_l+Conv(X_l)

Wherein, P _f、P_l is the characteristic that the previous frame and the next frame are both the local modes, conv represents the convolution module composed of three linear layers;

The 2 linear layers E _f、E_l are used to optimize the local patterns in P _f、P_l, respectively, and their expressions are as follows:

U＝P_f(A_f、)^T+b_f

G＝P_l(A_l)^T+b_l

Wherein U, G is the optimized feature containing the local pattern obtained by processing P _f、P_l by the linear layer E _f、E_l, a _f、A_l is the weight matrix of the linear layer E _f、E_l, T is the matrix transpose, and b _f、b_l is the bias of the linear layer E _f、E_l.

Further, the inter-frame motion attention module is configured to perform inter-frame motion attention calculation on an initial feature including a local mode to obtain a feature of a hand region of interest, and the method includes:

Taking the sum of products of a certain point of two adjacent frames in the characteristic containing the local mode and all points in the other frame as a similarity value of the point, and adopting matrix parallel calculation to obtain a similarity matrix M _S, wherein the expression is as follows:

M_S＝U×G^T

the similarity matrix main diagonal element is set to 0 through mask operation, and the expression is as follows:

M_m＝Mask(M_S)

Wherein M _m is a similarity matrix after mask operation;

Summing the similarity matrixes in two dimensions of the rows and the columns respectively to obtain two similarity vectors corresponding to two adjacent frames respectively, wherein the two similarity vectors are Sum (M _m,-1)、Sum(M_m -2);

The similarity vector is processed through a Softmax function to obtain a weight vector, and the expression is as follows:

Atten₁＝Softmax(Sum(M_m，-1))*Scale

Atten₂＝Softmax(Sum(M_m，-2))*Scale

Wherein Atten ₁ is the weight vector of the similarity vector Sum (M _m, -1), atten ₂ is the weight vector of the similarity vector Sum (M _m, -2), scale is a trainable parameter;

repeating the two weight vectors in the channel dimension for a plurality of times to obtain a weight matrix, wherein the repetition number is equal to the embedding dimension C, and applying the weight matrix to the characteristics comprising the local mode to obtain the characteristics of the hand region of interest, and the expression is as follows:

X_out＝[X_f,X_l]×[Atten₁,Atten₂]

where X _out is a feature of the hand region of interest.

Further, the adaptive fusion downsampling module is configured to perform adaptive fusion downsampling operation on features of a hand region of interest, and includes:

And sequentially performing adaptive space, time and multi-mode downsampling operations on the features, wherein each downsampling operation comprises two steps of sliding window division and adaptive downsampling.

Further, the sliding window dividing includes:

the self-adaptive space downsampling divides the characteristics of the hand region of interest into sliding windows with the size of 3 multiplied by 3 and the step length of 2;

the self-adaptive time downsampling is carried out on the characteristics of the hand region concerned by cutting a sliding window with the size of 3 multiplied by 1 and the step length of 1;

The self-adaptive multi-mode downsampling divides each mode t moment and frames adjacent to the t moment in the characteristics of the concerned hand area into a block;

the adaptive downsampling includes:

The method comprises the steps of respectively extracting the modes of the features through three linear layers E _q、E_k、E_v to obtain three feature matrixes, wherein the expression is as follows:

Q＝X_out(A_q)^T+b_q

K＝X_out(A_k)^T+b_k

V＝X_out(A_v)^T+b_v

Wherein Q, K, V is a feature matrix obtained by processing input by the linear layer E _q、E_k、E_v, A _q、A_k、A_v is a weight matrix of the linear layer E _q、E_k、E_v, and b _q、b_k、b_v is a bias of the linear layer E _q、E_k、E_v;

Matrix multiplication is carried out on the feature matrix Q, K, and a scaling factor is multiplied to obtain a correlation matrix;

carrying out average pooling operation on each row of the correlation matrix, and inputting the average pooling operation into a Softmax function to obtain a weight vector, wherein the expression is as follows:

Wherein A (Q, K) is a weight matrix of features, Is a correlation matrix,/>D is the number of neurons of the linear layer E _q, which is a scaling factor;

v is multiplied by A (Q, K) after matrix transposition, and a linear layer E _r is used for optimizing the characteristics to obtain tensor R;

R＝V^TA(Q,K)(A_r)^T+b_r

Where A _r is the weight matrix of the linear layer E _r and b _r is the bias of the linear layer E _r.

Further, the full connection layer is configured to output a gesture category corresponding to the gesture video frame sequence, and includes:

And classifying R to obtain vectors with the length of gesture category number, wherein each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the element represents the probability of the corresponding gesture category, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.

Further, the training method of the dynamic gesture recognition model comprises the following steps:

a. Dividing the dynamic gesture video frame sequence data into a training set and a testing set;

b. Training the dynamic gesture recognition model by taking training set data as input and recognized gesture meaning categories as output;

c. in the training process, adopting the combination of the attention weight and the cross entropy loss among modes as a loss function to adjust the model;

d. And testing the dynamic gesture recognition model by using the test set data as input to obtain a test result, judging whether the interpretation result reaches the convergence condition, if so, obtaining a pre-trained dynamic gesture recognition model, and if not, repeating a-d until the test result reaches the convergence condition, thereby obtaining the pre-trained dynamic gesture recognition model.

Further, the expression of the loss function is:

wherein, As a loss function,/>As cross entropy loss function, lambda is model hyper-parameter,/>N represents the number of rows and columns, Y _i represents the true gesture meaning category, Y _i represents the gesture meaning category identified by the model, avg represents the average pooling layer of 4 x 4, m ¹ represents the attention weight output during RGB mode inter-frame motion attention computation, m ² represents the attention weight output during depth mode inter-frame motion attention computation, (Avg (m ¹)⊙Avg(m²) is the effective weight value,Is the effective weight value of/>Line/>Column value,/>Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.

Compared with the prior art, the invention has the beneficial effects that:

The dynamic gesture recognition method provided by the invention adopts the combination of feature extraction and inter-frame motion attention to reduce the search area of space-time dimension to the area related to the hand; the self-adaptive downsampling calculation of different dimensions can keep the feature validity while reducing the overall calculation amount of the model;

In the training phase, the dynamic gesture recognition model of the invention allows different modes to share the attention weight through attention weight loss among the modes, so that each mode uses the attention weight of other modes to adjust the attention weight of the mode, thereby improving the accuracy of hand feature extraction of each mode.

Drawings

FIG. 1 is a flow chart of a method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

As shown in fig. 1, the method for identifying a dynamic gesture combining multi-mode inter-frame motion and sharing attention weight according to the embodiment of the present invention includes:

the method and the device acquire the dynamic gesture video, and only one gesture is contained in the acquired dynamic gesture video because the method and the device belong to isolated gesture recognition.

Preprocessing the dynamic gesture video, wherein the preprocessing comprises the following steps: processing the dynamic gesture video into multi-frame images; and cutting the multi-frame image into a size of 224 multiplied by 224 pixels to obtain a dynamic gesture video frame sequence.

The method comprises the steps of constructing a dynamic gesture recognition model, wherein the dynamic gesture recognition model comprises an embedded module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are connected in sequence.

The embedding module consists of a 3D convolution layer with the step length of (2, 4) and a position coding layer, wherein the position coding layer adopts a leachable parameter matrix which is used for mapping a dynamic gesture video frame sequence to a vector space with higher dimension by using one block embedding operation and carrying out position coding, and the method is specifically as follows:

The 3D convolution layer performs local feature extraction on the dynamic gesture video data and expands the channel dimension at the same time, so that the video frame sequence is mapped to a high-dimensional vector space. The learnable parameter matrix converts the dimension of the video frame sequence from b×3×l×h×w to b×c×l×h×w, where B is the number of batches, C is the embedding dimension, L is the number of frames, H is the frame height, and W is the frame width, to obtain the initial feature X _raw.

The feature extraction module consists of a 3D sliding window layer, 3 convolution layers, 1 residual error connecting layer and 2 linear layers which are connected in sequence, wherein the step length is (2, h and w), h and w are the height and width of the sliding window and are used for extracting the initial feature in a local mode to obtain the feature containing the local mode, and the processing process is specifically as follows:

the same area of two adjacent frames of the initial characteristic X _raw of the 3D sliding window layer is divided into a block, and the tensor shape is adjusted to obtain the dimension The expression of which is as follows:

X_f,X_l＝reshape(window(X_raw))

Where X _t is the feature of the previous frame of all blocks, X _t+1 is the feature of the next frame of all blocks, reshape is the operation of adjusting the tensor shape, and window is the sliding window operation.

The 3 convolution layers sequentially comprise a convolution layer with the size of 1 multiplied by 1, the number of output channels of 64, a convolution layer with the size of 3 multiplied by 3, the number of output channels of 64 and a convolution layer with the size of 1 multiplied by 1, and the number of output channels of the convolution layer with the embedded dimension C. The convolution layer with the size of 1×1 and the output channel number of 64 adjusts the channel dimension of the frame image, then the convolution layer with the size of 3×3 and the output channel number of 64 extracts the local mode, and finally the convolution layer with the size of 1×1 and the output channel number of embedded dimension C optimizes the local mode and adjusts the channel dimension of the frame image back.

Finally, the residual connecting layer is used for relieving the gradient disappearance phenomenon generated in the training process, and the expression is as follows:

P_f＝X_f+Conv(X_f)

P_l＝X_l+Conv(X_l)

wherein P _f、P_l is the feature that the previous frame and the next frame contain local patterns, respectively, conv represents a convolution module consisting of three linear layers.

To further enhance the expressive power of the features, the local patterns in P _f、P_l are each deepened using two linear layers E _f、E_l, expressed as follows:

U＝P_f(A_f)^T+b_f

G＝P_l(A_l)^T+b_l

The inter-frame motion attention module is used for performing inter-frame motion attention calculation on the initial characteristics including the local mode to obtain the characteristics of the hand region of interest, and the calculation process comprises the following steps:

The sum of products of a certain point of two adjacent frames in the characteristic containing the local mode and all points in the other frame is used as a similarity value of the point, whether the point moves or not is judged through the value, and in order to accelerate the operation efficiency, the embodiment adopts a matrix parallel computing mode to calculate and obtain a similarity matrix M _S, and the expression is as follows:

M_S＝U×G^T

In order to avoid the influence of high similarity of points at the same position in two adjacent frames on model convergence, a main diagonal element of a similarity matrix is set to be 0 through mask operation, and the expression is as follows:

M_m＝Mask(M_S)

wherein M _m is a similarity matrix after mask operation.

And then summing the similarity matrixes in two dimensions of the rows and the columns respectively to obtain similarity vectors of the two corresponding frames, and simultaneously obtaining weight vectors through the Softmax function and multiplying the weight vectors by a trainable parameter Scale in order to prevent the Softmax function from weakening the diversity of the features, wherein the expression is as follows:

Atten₁＝Softmax(Sum(M_m，-1))*Scale

Atten₂＝Softmax(Sum(M_m，-2))*Scale

Wherein Atten ₁ is the weight vector of the similarity vector Sum (M _m, -1), atten ₂ is the weight vector of the similarity vector Sum (M _m, -2), scale is a trainable parameter.

Repeating weights Atten ₁ and Atten ₂ in the channel dimension for a plurality of times to obtain a weight matrix, wherein the repetition number is equal to the embedding dimension C, and applying the weight matrix to the feature containing the local mode to obtain the feature of the hand region of interest, and the expression is as follows:

X_out＝[X_f,X_l]×[Atten₁,Atten₂]

where X _out is a feature of the hand region of interest.

The self-adaptive fusion downsampling module is used for carrying out self-adaptive fusion downsampling operation on the characteristics of the hand region of interest in different modes through a self-adaptive downsampling algorithm, and the self-adaptive fusion downsampling operation specifically comprises the steps of sequentially carrying out self-adaptive space, time and multi-mode downsampling operation on the characteristics, wherein each downsampling operation comprises two steps of sliding window division and self-adaptive downsampling.

The sliding window division includes:

the adaptive multi-mode downsampling divides each mode t moment and adjacent frames around t moment in the feature of the hand region of interest into one block.

Adaptive downsampling includes:

Q＝X_out(A_q)^T+b_q

K＝X_out(A_k)^T+b_k

V＝X_out(A_v)^T+b_v

Wherein Q, K, V is a feature matrix obtained by processing input by the linear layers E _q、E_k、E_v, a _q、A_k、A_v is a weight matrix of the linear layers E _q、E_k、E_v, and b _q、b_k、b_v is a bias of the linear layers E _q、E_k、E_v.

To obtain a self-attention-based weight vector, Q, K is matrix-multiplied and multiplied by a scaling factor to obtain a correlation matrix, the scaling factor functions to prevent the gradient from being pushed to a very small region after the Softmax function;

Before inputting the correlation matrix into the Softmax function, carrying out an average pooling operation on each row of the correlation matrix, obtaining the importance of different points under the same sliding window through the operation, and finally inputting the importance into the Softmax function to obtain a weight vector, wherein the expression is as follows:

Wherein A (Q, K) is a weight matrix of features, For the pooled correlation matrix,/>D is the number of neurons of the linear layer E _q, which is a scaling factor;

V is multiplied by A (Q, K) after matrix transposition, the characteristics can be fused and downsampled according to importance, and the result is optimized through a linear layer E _r to obtain tensor R, wherein the expression is as follows;

R＝V^TA(Q,K)(A_r)^T+b_r

The full connection layer is used for outputting gesture meaning categories corresponding to the gesture video frame sequence, specifically, classifying R to obtain vectors with the length of gesture category number, each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the corresponding gesture meaning category probability is represented, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.

The dynamic gesture recognition model needs to be pre-trained after being built, and the pre-training method comprises the following steps:

c. In the training process, the combination of the attention weight and the cross entropy loss among modes is adopted as a loss function to adjust the model, and the specific process is as follows:

firstly, carrying out effectiveness measurement on the weight of each region, and before calculating different modal distances, processing a weight matrix output by an inter-frame motion attention module by using a 4×4 average pooling layer, and obtaining a local representative weight value of a certain modal by means of average invariance and characteristic summarization capability of average pooling:

M¹＝Avg(m¹)

M²＝Avg(m²)

Wherein M ¹、M² is a RGB modality local representative weight value and a depth modality local representative weight value, M ¹、m² is a attention weight output during RGB modality and depth modality inter-frame motion attention calculation, and Avg is a 4×4 average pooling layer. In order to measure the effectiveness of the attention weight of two modes in the same local area at the same time, the local representative weight values of the two modes are Hadamard product to be the attention effective weight value, and the expression is as follows:

α＝M¹⊙M²

Where α is the effective weight of interest, and by which is meant Hadamard product.

In the training process, the distances among the attention weights of different modes are reduced to a certain extent, the square of the weight difference value is used as the measurement of the distance, and in order to avoid negative effects caused by the wrong weight, the distances are multiplied by the attention effective weight value alpha, so that the expression of the attention weight loss function among the modes is as follows:

where n represents the number of rows and columns, Representing the effective weight value of interest [ H ] >Line/>Column value,/> Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.

The cross entropy loss function is expressed as:

Where Y _i is the true label and Y _i is the model predictor.

Finally, two loss functions are used for summation according to weight as a final loss function, and the expression is as follows:

Where λ is a model hyper-parameter that represents the proportion of the inter-modal attention weight loss in the final loss value.

Inputting the obtained dynamic gesture recognition frame sequence into a pre-trained dynamic gesture recognition model, outputting vectors with the length of gesture category number, wherein each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the corresponding gesture meaning category is represented by the probability, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A method of dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights, comprising:

Acquiring a dynamic gesture video;

2. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weight of claim 1, wherein preprocessing the dynamic gesture video to obtain a sequence of dynamic gesture video frames comprises: processing the dynamic gesture video into multi-frame images;

3. The method according to claim 1, wherein the embedding module is composed of a 3D convolution layer with step size (2, 4) and a position coding layer, the 3D convolution layer is used for extracting local features of the dynamic gesture video data and expanding channel dimensions so that the video frame sequence is mapped to a high-dimensional vector space, the position coding layer adopts a learnable parameter matrix for converting dimensions of the video frame sequence from b×3×l×h×w to b×c×l×h×w, wherein B is a batch number, C is an embedding dimension, L is a frame number, H is a frame height, and W is a frame width, and initial features X _raw are obtained.

4. The method for dynamic gesture recognition combining multi-mode inter-frame motion and shared attention weight according to claim 1, wherein the feature extraction module consists of a 3D sliding window layer with step sizes of (2, h, w), 3 convolution layers, 1 residual connecting layer connected end to end and 2 linear layers which are connected in sequence, wherein h and w are the height and width of the sliding window;

X_f,X_l＝reshape(window(X_raw))

P_f＝X_f+Conv(X_f)

P_l＝X_l+Conv(X_l)

Wherein, P _f、P_l is the feature that the previous frame and the next frame contain local modes, respectively, conv represents a convolution module composed of three linear layers;

U＝P_f(A_f)^T+b_f

G＝P_l(A_l)^T+b_l

5. The method of claim 1, wherein the inter-frame motion attention module is configured to perform inter-frame motion attention computation on an initial feature comprising a local pattern to obtain a feature of a hand region of interest, and wherein the method comprises:

M_S＝U×G^T

M_m＝Mask(M_S)

Wherein M _m is a similarity matrix after mask operation;

Atten₁＝Softmax(Sum(M_m,-1))*Scale

Atten₂＝Softmax(Sum(M_m,-2))*Scale

X_out＝[X_f,X_l]×[Atten₁,Atten₂]

where X _out is a feature of the hand region of interest.

6. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights of claim 1, wherein the adaptive fusion downsampling module is configured to perform an adaptive fusion downsampling operation on features of a hand region of interest, comprising:

7. The method of claim 6, wherein the sliding window partitioning comprises:

the adaptive downsampling includes:

Q＝X_out(A_q)^T+b_q

K＝X_out(A_k)^T+b_k

V＝X_out(A_v)^T+b_v

Matrix multiplication is carried out on Q, K, and a scaling factor is multiplied to obtain a correlation matrix;

R＝V^TA(Q,K)(A_r)^T+b_r

8. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weight of claim 1, wherein the fully connected layer is configured to output a gesture category corresponding to a gesture video frame sequence, and the method comprises:

9. The method of claim 1, wherein the training method of the dynamic gesture recognition model comprises:

10. The method of claim 9, wherein the expression of the loss function is:

wherein, As a loss function,/>As cross entropy loss function, lambda is model hyper-parameter,/>As the intermodal attention weight loss function, n represents the number of rows and columns, Y _i is the true gesture meaning category, Y _i is the gesture meaning category identified by the model, λ is the model hyper-parameter, avg represents the 4×4 average pooling layer, m ¹ is the attention weight output during the RGB-modal intermodal inter-motion attention calculation, m ² is the attention weight output during the depth-modal intermodal inter-motion attention calculation, (Avg (m ¹)⊙Avg(m²) is the effective weight value,/> Is the effective weight value of/>Line/>Column value,/>Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.