CN118072395A - Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight - Google Patents

Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight Download PDF

Info

Publication number
CN118072395A
CN118072395A CN202410253805.0A CN202410253805A CN118072395A CN 118072395 A CN118072395 A CN 118072395A CN 202410253805 A CN202410253805 A CN 202410253805A CN 118072395 A CN118072395 A CN 118072395A
Authority
CN
China
Prior art keywords
dynamic gesture
layer
frame
attention
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410253805.0A
Other languages
Chinese (zh)
Inventor
张小瑞
曾祥龙
孙伟
陈春辉
黄志文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202410253805.0A priority Critical patent/CN118072395A/en
Publication of CN118072395A publication Critical patent/CN118072395A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, which aims to solve the problems that redundant information exists in video in the prior art, accurate capture of motion hand characteristics is difficult to carry out and the like, and comprises the steps of acquiring a dynamic gesture video; preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence; and identifying the dynamic gesture based on a pre-trained dynamic gesture identification model according to the dynamic gesture video frame sequence to obtain the dynamic gesture meaning category and the like, wherein the dynamic gesture identification model comprises an embedding module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are sequentially connected. The method and the device can reduce the search area of the space-time dimension into the area related to the hand, and can improve the dynamic gesture recognition precision while reducing the calculated amount.

Description

Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight
Technical Field
The invention relates to a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, and belongs to the technical field of gesture recognition.
Background
In many scenarios, gestures are a basic communication mode, such as daily call-in, command gestures of traffic police, and more typically sign language used by the deaf-mute. Gestures are essentially a type of language that, in real life, is continuous and dynamic as spoken language expressions, meaning being expressed by hand gestures that vary and move over time. The gesture recognition is to let a computing mechanism solve the meaning of a target gesture, and the dynamic gesture recognition based on computer vision is to analyze a video captured by a camera through a specific algorithm so as to classify the gesture, so that the gesture recognition has a great number of application scenes in human-computer interaction, such as virtual reality, sign language translation, clinical medical treatment and the like.
In dynamic gesture recognition, the complex background is usually static or moves slightly, that is to say, in gesture data, the hand is the most main moving object, and the obvious characteristic of the movement can help the model to exclude redundant information irrelevant to the hand, so that accurate extraction of the characteristics of the moving hand is realized, and gesture recognition accuracy is improved.
With the rise of deep learning, more and more neural networks in video understanding are created, but dynamic gesture recognition focuses more on hand actions than other video understanding fields, and the judgment of gestures is mainly finished by means of hand information in each frame and hand motions between frames. However, the video frame sequence contains a lot of redundant information, which affects the attention of the model to the hand: on one hand, each frame of image in the video frame sequence can record complex backgrounds behind human bodies and human bodies at the same time, and information outside hands usually influences the positioning of the model on the hands; on the other hand, due to limited quality of the device capturing the video, excessive frame rate and other reasons, some blurred frames and repeated frames are included in the data, and the redundant frames also affect the extraction of the motion hand features by the model.
Attention mechanisms are the most common approach to solving such problems, and refer to using some specific neural network structure to automatically learn and calculate the magnitude of the contribution of different parts of the data to the recognition result, thereby making the neural network model more focused on the sport hand region. Existing research often directly feeds data into an attention module, and finds effective features from a wide range of space-time features through an attention mechanism in the data, and although the model can really pay attention to hands, the accurate capture of the motion hand features is difficult due to the large searching range.
In order to more completely extract the characteristics of the hands of the motion, some researchers use a multi-mode fusion method, which applies attention mechanisms to the data of different modes respectively, and finally fuses the extracted characteristics of different modes to obtain richer hand characteristics. However, the method for directly fusing the features of each mode lacks the interactivity of data among modes, namely, a single mode cannot fully utilize the feature information of other modes during the hand positioning, so that the finally extracted features of each mode are inaccurate.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight, which can reduce a search area of space-time dimension into an area related to a hand, and can improve dynamic gesture recognition precision while reducing the calculated amount.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme: a method of dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights, comprising:
Acquiring a dynamic gesture video;
Preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence;
According to the dynamic gesture video frame sequence, based on a pre-trained dynamic gesture recognition model, recognizing the dynamic gesture to obtain a dynamic gesture meaning category;
the dynamic gesture recognition model comprises an embedding module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are connected in sequence;
the embedding module is used for mapping the dynamic gesture video frame sequence to a high-dimensional vector space by utilizing block embedding operation, and carrying out position coding on the high-dimensional vector space to obtain initial characteristics;
the feature extraction module is used for extracting the local mode of the initial feature to obtain the feature containing the local mode;
The inter-frame motion attention module is used for carrying out inter-frame motion attention calculation on the characteristics including the local mode to obtain the characteristics of the attention hand region;
The self-adaptive fusion downsampling module is used for carrying out self-adaptive fusion downsampling operation on the characteristics of the hand region of interest;
The full connection layer is used for outputting gesture meaning categories corresponding to the gesture video frame sequences.
Further, preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence, including: processing the dynamic gesture video into multi-frame images;
and cutting the multi-frame image into a size of 224 multiplied by 224 pixels to obtain a dynamic gesture video frame sequence.
Furthermore, the embedding module is composed of a 3D convolution layer with the step length of (2, 4) and a position coding layer, the 3D convolution layer is used for extracting local features of dynamic gesture video data and expanding channel dimensions at the same time, so that a video frame sequence is mapped to a high-dimensional vector space, the position coding layer adopts a leachable parameter matrix which is used for converting the dimensions of the video frame sequence from B×3×L×H×W to B×C×L×H×W, wherein B is the batch number, C is the embedding dimension, L is the frame number, H is the frame height, and W is the frame width, and initial features are obtained.
Further, the feature extraction module consists of a 3D sliding window layer with step sizes of (2, h and w), 3 convolution layers, 1 residual error connection layer connected end to end and 2 linear layers which are connected in sequence, wherein h and w are the height and width of the sliding window;
The 3D sliding window layer can divide the same area of two adjacent frames of the initial characteristic X raw into one block, and adjust the tensor shape to obtain the dimension as The expression of which is as follows:
Xf,Xl=reshape(window(Xraw))
wherein, X f is the feature of the previous frame of all blocks, X l is the feature of the next frame of all blocks, reshape is the operation of adjusting the tensor shape, and window is the sliding window operation;
The 3 convolution layers sequentially comprise a convolution layer with the size of 1 multiplied by 1, the number of output channels of 64, a convolution layer with the size of 3 multiplied by 3, the number of output channels of 64 and a convolution layer with the size of 1 multiplied by 1, the number of output channels of embedded dimension C, wherein the convolution layer with the size of 1 multiplied by 1, the number of output channels of 64 is used for adjusting the dimension of a frame image channel, the size of 3 multiplied by 3, the number of output channels of 64 is used for extracting a local mode, the size of 1 multiplied by 1, the number of output channels of the convolution layer with the embedded dimension C is used for optimizing the local mode and adjusting the dimension of the frame image channel back;
The residual connecting layer is used for relieving the gradient disappearance phenomenon generated in the training process, and the expression is as follows:
Pf=Xf+Conv(Xf)
Pl=Xl+Conv(Xl)
Wherein, P f、Pl is the characteristic that the previous frame and the next frame are both the local modes, conv represents the convolution module composed of three linear layers;
The 2 linear layers E f、El are used to optimize the local patterns in P f、Pl, respectively, and their expressions are as follows:
U=Pf(Af、)T+bf
G=Pl(Al)T+bl
Wherein U, G is the optimized feature containing the local pattern obtained by processing P f、Pl by the linear layer E f、El, a f、Al is the weight matrix of the linear layer E f、El, T is the matrix transpose, and b f、bl is the bias of the linear layer E f、El.
Further, the inter-frame motion attention module is configured to perform inter-frame motion attention calculation on an initial feature including a local mode to obtain a feature of a hand region of interest, and the method includes:
Taking the sum of products of a certain point of two adjacent frames in the characteristic containing the local mode and all points in the other frame as a similarity value of the point, and adopting matrix parallel calculation to obtain a similarity matrix M S, wherein the expression is as follows:
MS=U×GT
the similarity matrix main diagonal element is set to 0 through mask operation, and the expression is as follows:
Mm=Mask(MS)
Wherein M m is a similarity matrix after mask operation;
Summing the similarity matrixes in two dimensions of the rows and the columns respectively to obtain two similarity vectors corresponding to two adjacent frames respectively, wherein the two similarity vectors are Sum (M m,-1)、Sum(Mm -2);
The similarity vector is processed through a Softmax function to obtain a weight vector, and the expression is as follows:
Atten1=Softmax(Sum(Mm,-1))*Scale
Atten2=Softmax(Sum(Mm,-2))*Scale
Wherein Atten 1 is the weight vector of the similarity vector Sum (M m, -1), atten 2 is the weight vector of the similarity vector Sum (M m, -2), scale is a trainable parameter;
repeating the two weight vectors in the channel dimension for a plurality of times to obtain a weight matrix, wherein the repetition number is equal to the embedding dimension C, and applying the weight matrix to the characteristics comprising the local mode to obtain the characteristics of the hand region of interest, and the expression is as follows:
Xout=[Xf,Xl]×[Atten1,Atten2]
where X out is a feature of the hand region of interest.
Further, the adaptive fusion downsampling module is configured to perform adaptive fusion downsampling operation on features of a hand region of interest, and includes:
And sequentially performing adaptive space, time and multi-mode downsampling operations on the features, wherein each downsampling operation comprises two steps of sliding window division and adaptive downsampling.
Further, the sliding window dividing includes:
the self-adaptive space downsampling divides the characteristics of the hand region of interest into sliding windows with the size of 3 multiplied by 3 and the step length of 2;
the self-adaptive time downsampling is carried out on the characteristics of the hand region concerned by cutting a sliding window with the size of 3 multiplied by 1 and the step length of 1;
The self-adaptive multi-mode downsampling divides each mode t moment and frames adjacent to the t moment in the characteristics of the concerned hand area into a block;
the adaptive downsampling includes:
The method comprises the steps of respectively extracting the modes of the features through three linear layers E q、Ek、Ev to obtain three feature matrixes, wherein the expression is as follows:
Q=Xout(Aq)T+bq
K=Xout(Ak)T+bk
V=Xout(Av)T+bv
Wherein Q, K, V is a feature matrix obtained by processing input by the linear layer E q、Ek、Ev, A q、Ak、Av is a weight matrix of the linear layer E q、Ek、Ev, and b q、bk、bv is a bias of the linear layer E q、Ek、Ev;
Matrix multiplication is carried out on the feature matrix Q, K, and a scaling factor is multiplied to obtain a correlation matrix;
carrying out average pooling operation on each row of the correlation matrix, and inputting the average pooling operation into a Softmax function to obtain a weight vector, wherein the expression is as follows:
Wherein A (Q, K) is a weight matrix of features, Is a correlation matrix,/>D is the number of neurons of the linear layer E q, which is a scaling factor;
v is multiplied by A (Q, K) after matrix transposition, and a linear layer E r is used for optimizing the characteristics to obtain tensor R;
R=VTA(Q,K)(Ar)T+br
Where A r is the weight matrix of the linear layer E r and b r is the bias of the linear layer E r.
Further, the full connection layer is configured to output a gesture category corresponding to the gesture video frame sequence, and includes:
And classifying R to obtain vectors with the length of gesture category number, wherein each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the element represents the probability of the corresponding gesture category, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.
Further, the training method of the dynamic gesture recognition model comprises the following steps:
a. Dividing the dynamic gesture video frame sequence data into a training set and a testing set;
b. Training the dynamic gesture recognition model by taking training set data as input and recognized gesture meaning categories as output;
c. in the training process, adopting the combination of the attention weight and the cross entropy loss among modes as a loss function to adjust the model;
d. And testing the dynamic gesture recognition model by using the test set data as input to obtain a test result, judging whether the interpretation result reaches the convergence condition, if so, obtaining a pre-trained dynamic gesture recognition model, and if not, repeating a-d until the test result reaches the convergence condition, thereby obtaining the pre-trained dynamic gesture recognition model.
Further, the expression of the loss function is:
wherein, As a loss function,/>As cross entropy loss function, lambda is model hyper-parameter,/>N represents the number of rows and columns, Y i represents the true gesture meaning category, Y i represents the gesture meaning category identified by the model, avg represents the average pooling layer of 4 x 4, m 1 represents the attention weight output during RGB mode inter-frame motion attention computation, m 2 represents the attention weight output during depth mode inter-frame motion attention computation, (Avg (m 1)⊙Avg(m2) is the effective weight value,Is the effective weight value of/>Line/>Column value,/>Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.
Compared with the prior art, the invention has the beneficial effects that:
The dynamic gesture recognition method provided by the invention adopts the combination of feature extraction and inter-frame motion attention to reduce the search area of space-time dimension to the area related to the hand; the self-adaptive downsampling calculation of different dimensions can keep the feature validity while reducing the overall calculation amount of the model;
In the training phase, the dynamic gesture recognition model of the invention allows different modes to share the attention weight through attention weight loss among the modes, so that each mode uses the attention weight of other modes to adjust the attention weight of the mode, thereby improving the accuracy of hand feature extraction of each mode.
Drawings
FIG. 1 is a flow chart of a method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
As shown in fig. 1, the method for identifying a dynamic gesture combining multi-mode inter-frame motion and sharing attention weight according to the embodiment of the present invention includes:
the method and the device acquire the dynamic gesture video, and only one gesture is contained in the acquired dynamic gesture video because the method and the device belong to isolated gesture recognition.
Preprocessing the dynamic gesture video, wherein the preprocessing comprises the following steps: processing the dynamic gesture video into multi-frame images; and cutting the multi-frame image into a size of 224 multiplied by 224 pixels to obtain a dynamic gesture video frame sequence.
The method comprises the steps of constructing a dynamic gesture recognition model, wherein the dynamic gesture recognition model comprises an embedded module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are connected in sequence.
The embedding module consists of a 3D convolution layer with the step length of (2, 4) and a position coding layer, wherein the position coding layer adopts a leachable parameter matrix which is used for mapping a dynamic gesture video frame sequence to a vector space with higher dimension by using one block embedding operation and carrying out position coding, and the method is specifically as follows:
The 3D convolution layer performs local feature extraction on the dynamic gesture video data and expands the channel dimension at the same time, so that the video frame sequence is mapped to a high-dimensional vector space. The learnable parameter matrix converts the dimension of the video frame sequence from b×3×l×h×w to b×c×l×h×w, where B is the number of batches, C is the embedding dimension, L is the number of frames, H is the frame height, and W is the frame width, to obtain the initial feature X raw.
The feature extraction module consists of a 3D sliding window layer, 3 convolution layers, 1 residual error connecting layer and 2 linear layers which are connected in sequence, wherein the step length is (2, h and w), h and w are the height and width of the sliding window and are used for extracting the initial feature in a local mode to obtain the feature containing the local mode, and the processing process is specifically as follows:
the same area of two adjacent frames of the initial characteristic X raw of the 3D sliding window layer is divided into a block, and the tensor shape is adjusted to obtain the dimension The expression of which is as follows:
Xf,Xl=reshape(window(Xraw))
Where X t is the feature of the previous frame of all blocks, X t+1 is the feature of the next frame of all blocks, reshape is the operation of adjusting the tensor shape, and window is the sliding window operation.
The 3 convolution layers sequentially comprise a convolution layer with the size of 1 multiplied by 1, the number of output channels of 64, a convolution layer with the size of 3 multiplied by 3, the number of output channels of 64 and a convolution layer with the size of 1 multiplied by 1, and the number of output channels of the convolution layer with the embedded dimension C. The convolution layer with the size of 1×1 and the output channel number of 64 adjusts the channel dimension of the frame image, then the convolution layer with the size of 3×3 and the output channel number of 64 extracts the local mode, and finally the convolution layer with the size of 1×1 and the output channel number of embedded dimension C optimizes the local mode and adjusts the channel dimension of the frame image back.
Finally, the residual connecting layer is used for relieving the gradient disappearance phenomenon generated in the training process, and the expression is as follows:
Pf=Xf+Conv(Xf)
Pl=Xl+Conv(Xl)
wherein P f、Pl is the feature that the previous frame and the next frame contain local patterns, respectively, conv represents a convolution module consisting of three linear layers.
To further enhance the expressive power of the features, the local patterns in P f、Pl are each deepened using two linear layers E f、El, expressed as follows:
U=Pf(Af)T+bf
G=Pl(Al)T+bl
Wherein U, G is the optimized feature containing the local pattern obtained by processing P f、Pl by the linear layer E f、El, a f、Al is the weight matrix of the linear layer E f、El, T is the matrix transpose, and b f、bl is the bias of the linear layer E f、El.
The inter-frame motion attention module is used for performing inter-frame motion attention calculation on the initial characteristics including the local mode to obtain the characteristics of the hand region of interest, and the calculation process comprises the following steps:
The sum of products of a certain point of two adjacent frames in the characteristic containing the local mode and all points in the other frame is used as a similarity value of the point, whether the point moves or not is judged through the value, and in order to accelerate the operation efficiency, the embodiment adopts a matrix parallel computing mode to calculate and obtain a similarity matrix M S, and the expression is as follows:
MS=U×GT
In order to avoid the influence of high similarity of points at the same position in two adjacent frames on model convergence, a main diagonal element of a similarity matrix is set to be 0 through mask operation, and the expression is as follows:
Mm=Mask(MS)
wherein M m is a similarity matrix after mask operation.
And then summing the similarity matrixes in two dimensions of the rows and the columns respectively to obtain similarity vectors of the two corresponding frames, and simultaneously obtaining weight vectors through the Softmax function and multiplying the weight vectors by a trainable parameter Scale in order to prevent the Softmax function from weakening the diversity of the features, wherein the expression is as follows:
Atten1=Softmax(Sum(Mm,-1))*Scale
Atten2=Softmax(Sum(Mm,-2))*Scale
Wherein Atten 1 is the weight vector of the similarity vector Sum (M m, -1), atten 2 is the weight vector of the similarity vector Sum (M m, -2), scale is a trainable parameter.
Repeating weights Atten 1 and Atten 2 in the channel dimension for a plurality of times to obtain a weight matrix, wherein the repetition number is equal to the embedding dimension C, and applying the weight matrix to the feature containing the local mode to obtain the feature of the hand region of interest, and the expression is as follows:
Xout=[Xf,Xl]×[Atten1,Atten2]
where X out is a feature of the hand region of interest.
The self-adaptive fusion downsampling module is used for carrying out self-adaptive fusion downsampling operation on the characteristics of the hand region of interest in different modes through a self-adaptive downsampling algorithm, and the self-adaptive fusion downsampling operation specifically comprises the steps of sequentially carrying out self-adaptive space, time and multi-mode downsampling operation on the characteristics, wherein each downsampling operation comprises two steps of sliding window division and self-adaptive downsampling.
The sliding window division includes:
the self-adaptive space downsampling divides the characteristics of the hand region of interest into sliding windows with the size of 3 multiplied by 3 and the step length of 2;
the self-adaptive time downsampling is carried out on the characteristics of the hand region concerned by cutting a sliding window with the size of 3 multiplied by 1 and the step length of 1;
the adaptive multi-mode downsampling divides each mode t moment and adjacent frames around t moment in the feature of the hand region of interest into one block.
Adaptive downsampling includes:
The method comprises the steps of respectively extracting the modes of the features through three linear layers E q、Ek、Ev to obtain three feature matrixes, wherein the expression is as follows:
Q=Xout(Aq)T+bq
K=Xout(Ak)T+bk
V=Xout(Av)T+bv
Wherein Q, K, V is a feature matrix obtained by processing input by the linear layers E q、Ek、Ev, a q、Ak、Av is a weight matrix of the linear layers E q、Ek、Ev, and b q、bk、bv is a bias of the linear layers E q、Ek、Ev.
To obtain a self-attention-based weight vector, Q, K is matrix-multiplied and multiplied by a scaling factor to obtain a correlation matrix, the scaling factor functions to prevent the gradient from being pushed to a very small region after the Softmax function;
Before inputting the correlation matrix into the Softmax function, carrying out an average pooling operation on each row of the correlation matrix, obtaining the importance of different points under the same sliding window through the operation, and finally inputting the importance into the Softmax function to obtain a weight vector, wherein the expression is as follows:
Wherein A (Q, K) is a weight matrix of features, For the pooled correlation matrix,/>D is the number of neurons of the linear layer E q, which is a scaling factor;
V is multiplied by A (Q, K) after matrix transposition, the characteristics can be fused and downsampled according to importance, and the result is optimized through a linear layer E r to obtain tensor R, wherein the expression is as follows;
R=VTA(Q,K)(Ar)T+br
Where A r is the weight matrix of the linear layer E r and b r is the bias of the linear layer E r.
The full connection layer is used for outputting gesture meaning categories corresponding to the gesture video frame sequence, specifically, classifying R to obtain vectors with the length of gesture category number, each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the corresponding gesture meaning category probability is represented, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.
The dynamic gesture recognition model needs to be pre-trained after being built, and the pre-training method comprises the following steps:
a. Dividing the dynamic gesture video frame sequence data into a training set and a testing set;
b. Training the dynamic gesture recognition model by taking training set data as input and recognized gesture meaning categories as output;
c. In the training process, the combination of the attention weight and the cross entropy loss among modes is adopted as a loss function to adjust the model, and the specific process is as follows:
firstly, carrying out effectiveness measurement on the weight of each region, and before calculating different modal distances, processing a weight matrix output by an inter-frame motion attention module by using a 4×4 average pooling layer, and obtaining a local representative weight value of a certain modal by means of average invariance and characteristic summarization capability of average pooling:
M1=Avg(m1)
M2=Avg(m2)
Wherein M 1、M2 is a RGB modality local representative weight value and a depth modality local representative weight value, M 1、m2 is a attention weight output during RGB modality and depth modality inter-frame motion attention calculation, and Avg is a 4×4 average pooling layer. In order to measure the effectiveness of the attention weight of two modes in the same local area at the same time, the local representative weight values of the two modes are Hadamard product to be the attention effective weight value, and the expression is as follows:
α=M1⊙M2
Where α is the effective weight of interest, and by which is meant Hadamard product.
In the training process, the distances among the attention weights of different modes are reduced to a certain extent, the square of the weight difference value is used as the measurement of the distance, and in order to avoid negative effects caused by the wrong weight, the distances are multiplied by the attention effective weight value alpha, so that the expression of the attention weight loss function among the modes is as follows:
where n represents the number of rows and columns, Representing the effective weight value of interest [ H ] >Line/>Column value,/> Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.
The cross entropy loss function is expressed as:
Where Y i is the true label and Y i is the model predictor.
Finally, two loss functions are used for summation according to weight as a final loss function, and the expression is as follows:
Where λ is a model hyper-parameter that represents the proportion of the inter-modal attention weight loss in the final loss value.
D. And testing the dynamic gesture recognition model by using the test set data as input to obtain a test result, judging whether the interpretation result reaches the convergence condition, if so, obtaining a pre-trained dynamic gesture recognition model, and if not, repeating a-d until the test result reaches the convergence condition, thereby obtaining the pre-trained dynamic gesture recognition model.
Inputting the obtained dynamic gesture recognition frame sequence into a pre-trained dynamic gesture recognition model, outputting vectors with the length of gesture category number, wherein each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the corresponding gesture meaning category is represented by the probability, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. A method of dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights, comprising:
Acquiring a dynamic gesture video;
Preprocessing the dynamic gesture video to obtain a dynamic gesture video frame sequence;
According to the dynamic gesture video frame sequence, based on a pre-trained dynamic gesture recognition model, recognizing the dynamic gesture to obtain a dynamic gesture meaning category;
the dynamic gesture recognition model comprises an embedding module, a feature extraction module, an inter-frame motion attention module, a self-adaptive fusion downsampling module and a full-connection layer which are connected in sequence;
the embedding module is used for mapping the dynamic gesture video frame sequence to a high-dimensional vector space by utilizing block embedding operation, and carrying out position coding on the high-dimensional vector space to obtain initial characteristics;
the feature extraction module is used for extracting the local mode of the initial feature to obtain the feature containing the local mode;
The inter-frame motion attention module is used for carrying out inter-frame motion attention calculation on the characteristics including the local mode to obtain the characteristics of the attention hand region;
The self-adaptive fusion downsampling module is used for carrying out self-adaptive fusion downsampling operation on the characteristics of the hand region of interest;
The full connection layer is used for outputting gesture meaning categories corresponding to the gesture video frame sequences.
2. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weight of claim 1, wherein preprocessing the dynamic gesture video to obtain a sequence of dynamic gesture video frames comprises: processing the dynamic gesture video into multi-frame images;
and cutting the multi-frame image into a size of 224 multiplied by 224 pixels to obtain a dynamic gesture video frame sequence.
3. The method according to claim 1, wherein the embedding module is composed of a 3D convolution layer with step size (2, 4) and a position coding layer, the 3D convolution layer is used for extracting local features of the dynamic gesture video data and expanding channel dimensions so that the video frame sequence is mapped to a high-dimensional vector space, the position coding layer adopts a learnable parameter matrix for converting dimensions of the video frame sequence from b×3×l×h×w to b×c×l×h×w, wherein B is a batch number, C is an embedding dimension, L is a frame number, H is a frame height, and W is a frame width, and initial features X raw are obtained.
4. The method for dynamic gesture recognition combining multi-mode inter-frame motion and shared attention weight according to claim 1, wherein the feature extraction module consists of a 3D sliding window layer with step sizes of (2, h, w), 3 convolution layers, 1 residual connecting layer connected end to end and 2 linear layers which are connected in sequence, wherein h and w are the height and width of the sliding window;
The 3D sliding window layer can divide the same area of two adjacent frames of the initial characteristic X raw into one block, and adjust the tensor shape to obtain the dimension as The expression of which is as follows:
Xf,Xl=reshape(window(Xraw))
wherein, X f is the feature of the previous frame of all blocks, X l is the feature of the next frame of all blocks, reshape is the operation of adjusting the tensor shape, and window is the sliding window operation;
The 3 convolution layers sequentially comprise a convolution layer with the size of 1 multiplied by 1, the number of output channels of 64, a convolution layer with the size of 3 multiplied by 3, the number of output channels of 64 and a convolution layer with the size of 1 multiplied by 1, the number of output channels of embedded dimension C, wherein the convolution layer with the size of 1 multiplied by 1, the number of output channels of 64 is used for adjusting the dimension of a frame image channel, the size of 3 multiplied by 3, the number of output channels of 64 is used for extracting a local mode, the size of 1 multiplied by 1, the number of output channels of the convolution layer with the embedded dimension C is used for optimizing the local mode and adjusting the dimension of the frame image channel back;
The residual connecting layer is used for relieving the gradient disappearance phenomenon generated in the training process, and the expression is as follows:
Pf=Xf+Conv(Xf)
Pl=Xl+Conv(Xl)
Wherein, P f、Pl is the feature that the previous frame and the next frame contain local modes, respectively, conv represents a convolution module composed of three linear layers;
The 2 linear layers E f、El are used to optimize the local patterns in P f、Pl, respectively, and their expressions are as follows:
U=Pf(Af)T+bf
G=Pl(Al)T+bl
Wherein U, G is the optimized feature containing the local pattern obtained by processing P f、Pl by the linear layer E f、El, a f、Al is the weight matrix of the linear layer E f、El, T is the matrix transpose, and b f、bl is the bias of the linear layer E f、El.
5. The method of claim 1, wherein the inter-frame motion attention module is configured to perform inter-frame motion attention computation on an initial feature comprising a local pattern to obtain a feature of a hand region of interest, and wherein the method comprises:
Taking the sum of products of a certain point of two adjacent frames in the characteristic containing the local mode and all points in the other frame as a similarity value of the point, and adopting matrix parallel calculation to obtain a similarity matrix M S, wherein the expression is as follows:
MS=U×GT
the similarity matrix main diagonal element is set to 0 through mask operation, and the expression is as follows:
Mm=Mask(MS)
Wherein M m is a similarity matrix after mask operation;
Summing the similarity matrixes in two dimensions of the rows and the columns respectively to obtain two similarity vectors corresponding to two adjacent frames respectively, wherein the two similarity vectors are Sum (M m,-1)、Sum(Mm -2);
The similarity vector is processed through a Softmax function to obtain a weight vector, and the expression is as follows:
Atten1=Softmax(Sum(Mm,-1))*Scale
Atten2=Softmax(Sum(Mm,-2))*Scale
Wherein Atten 1 is the weight vector of the similarity vector Sum (M m, -1), atten 2 is the weight vector of the similarity vector Sum (M m, -2), scale is a trainable parameter;
repeating the two weight vectors in the channel dimension for a plurality of times to obtain a weight matrix, wherein the repetition number is equal to the embedding dimension C, and applying the weight matrix to the characteristics comprising the local mode to obtain the characteristics of the hand region of interest, and the expression is as follows:
Xout=[Xf,Xl]×[Atten1,Atten2]
where X out is a feature of the hand region of interest.
6. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weights of claim 1, wherein the adaptive fusion downsampling module is configured to perform an adaptive fusion downsampling operation on features of a hand region of interest, comprising:
And sequentially performing adaptive space, time and multi-mode downsampling operations on the features, wherein each downsampling operation comprises two steps of sliding window division and adaptive downsampling.
7. The method of claim 6, wherein the sliding window partitioning comprises:
the self-adaptive space downsampling divides the characteristics of the hand region of interest into sliding windows with the size of 3 multiplied by 3 and the step length of 2;
the self-adaptive time downsampling is carried out on the characteristics of the hand region concerned by cutting a sliding window with the size of 3 multiplied by 1 and the step length of 1;
The self-adaptive multi-mode downsampling divides each mode t moment and frames adjacent to the t moment in the characteristics of the concerned hand area into a block;
the adaptive downsampling includes:
The method comprises the steps of respectively extracting the modes of the features through three linear layers E q、Ek、Ev to obtain three feature matrixes, wherein the expression is as follows:
Q=Xout(Aq)T+bq
K=Xout(Ak)T+bk
V=Xout(Av)T+bv
Wherein Q, K, V is a feature matrix obtained by processing input by the linear layer E q、Ek、Ev, A q、Ak、Av is a weight matrix of the linear layer E q、Ek、Ev, and b q、bk、bv is a bias of the linear layer E q、Ek、Ev;
Matrix multiplication is carried out on Q, K, and a scaling factor is multiplied to obtain a correlation matrix;
carrying out average pooling operation on each row of the correlation matrix, and inputting the average pooling operation into a Softmax function to obtain a weight vector, wherein the expression is as follows:
wherein A (Q, K) is a weight matrix of features, For the pooled correlation matrix,/>D is the number of neurons of the linear layer E q, which is a scaling factor;
V is multiplied by A (Q, K) after matrix transposition, and a linear layer E r is used for optimizing the characteristics to obtain tensor R;
R=VTA(Q,K)(Ar)T+br
Where A r is the weight matrix of the linear layer E r and b r is the bias of the linear layer E r.
8. The method for dynamic gesture recognition combining multi-modal inter-frame motion and shared attention weight of claim 1, wherein the fully connected layer is configured to output a gesture category corresponding to a gesture video frame sequence, and the method comprises:
And classifying R to obtain vectors with the length of gesture category number, wherein each element in the vectors corresponds to one gesture meaning category, the value of each element is 0-1, the element represents the probability of the corresponding gesture category, and the gesture meaning category corresponding to the maximum value is the gesture meaning category corresponding to the dynamic gesture video frame sequence.
9. The method of claim 1, wherein the training method of the dynamic gesture recognition model comprises:
a. Dividing the dynamic gesture video frame sequence data into a training set and a testing set;
b. Training the dynamic gesture recognition model by taking training set data as input and recognized gesture meaning categories as output;
c. in the training process, adopting the combination of the attention weight and the cross entropy loss among modes as a loss function to adjust the model;
d. And testing the dynamic gesture recognition model by using the test set data as input to obtain a test result, judging whether the interpretation result reaches the convergence condition, if so, obtaining a pre-trained dynamic gesture recognition model, and if not, repeating a-d until the test result reaches the convergence condition, thereby obtaining the pre-trained dynamic gesture recognition model.
10. The method of claim 9, wherein the expression of the loss function is:
wherein, As a loss function,/>As cross entropy loss function, lambda is model hyper-parameter,/>As the intermodal attention weight loss function, n represents the number of rows and columns, Y i is the true gesture meaning category, Y i is the gesture meaning category identified by the model, λ is the model hyper-parameter, avg represents the 4×4 average pooling layer, m 1 is the attention weight output during the RGB-modal intermodal inter-motion attention calculation, m 2 is the attention weight output during the depth-modal intermodal inter-motion attention calculation, (Avg (m 1)⊙Avg(m2) is the effective weight value,/> Is the effective weight value of/>Line/>Column value,/>Values of RGB modality and depth modality attention weights at the ith row and jth column are represented, respectively.
CN202410253805.0A 2024-03-06 2024-03-06 Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight Pending CN118072395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410253805.0A CN118072395A (en) 2024-03-06 2024-03-06 Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410253805.0A CN118072395A (en) 2024-03-06 2024-03-06 Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Publications (1)

Publication Number Publication Date
CN118072395A true CN118072395A (en) 2024-05-24

Family

ID=91107211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410253805.0A Pending CN118072395A (en) 2024-03-06 2024-03-06 Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Country Status (1)

Country Link
CN (1) CN118072395A (en)

Similar Documents

Publication Publication Date Title
CN108491835B (en) Two-channel convolutional neural network for facial expression recognition
CN111639692B (en) Shadow detection method based on attention mechanism
US11967175B2 (en) Facial expression recognition method and system combined with attention mechanism
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN113496217B (en) Method for identifying human face micro expression in video image sequence
CN111814661B (en) Human body behavior recognition method based on residual error-circulating neural network
CN111652903B (en) Pedestrian target tracking method based on convolution association network in automatic driving scene
CN109886225A (en) A kind of image gesture motion on-line checking and recognition methods based on deep learning
CN112070768B (en) Anchor-Free based real-time instance segmentation method
CN110097028B (en) Crowd abnormal event detection method based on three-dimensional pyramid image generation network
CN112818764A (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN114360067A (en) Dynamic gesture recognition method based on deep learning
CN116403152A (en) Crowd density estimation method based on spatial context learning network
CN115953736A (en) Crowd density estimation method based on video monitoring and deep neural network
CN114724185A (en) Light-weight multi-person posture tracking method
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN114913342A (en) Motion blurred image line segment detection method and system fusing event and image
CN111242003A (en) Video salient object detection method based on multi-scale constrained self-attention mechanism
CN117975565A (en) Action recognition system and method based on space-time diffusion and parallel convertors
CN114898464B (en) Lightweight accurate finger language intelligent algorithm identification method based on machine vision
CN112528077A (en) Video face retrieval method and system based on video embedding
CN116993760A (en) Gesture segmentation method, system, device and medium based on graph convolution and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination