CN112507898B - Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN - Google Patents

Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN Download PDF

Info

Publication number
CN112507898B
CN112507898B CN202011467797.8A CN202011467797A CN112507898B CN 112507898 B CN112507898 B CN 112507898B CN 202011467797 A CN202011467797 A CN 202011467797A CN 112507898 B CN112507898 B CN 112507898B
Authority
CN
China
Prior art keywords
network
convolution
sequence
lightweight
rgb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011467797.8A
Other languages
Chinese (zh)
Other versions
CN112507898A (en
Inventor
唐贤伦
闫振甫
李洁
彭德光
彭江平
郝博慧
朱楚洪
李鹏华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011467797.8A priority Critical patent/CN112507898B/en
Publication of CN112507898A publication Critical patent/CN112507898A/en
Application granted granted Critical
Publication of CN112507898B publication Critical patent/CN112507898B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal dynamic gesture recognition method based on a lightweight 3D residual error network and a TCN. Firstly, sampling original videos in a data set, and sequencing and storing the videos according to a time sequence; then, pre-training a lightweight 3D residual error network by using a large public gesture recognition data set, and storing a weight file of the model; then, long-short term spatio-temporal features are extracted using the RGB-D image sequence as input and the lightweight 3D residual network and the time convolution network as base models, and information of the multiple modalities is fused by weighting using an attention mechanism. Wherein, RGB and Depth (Depth) sequences are respectively input into the same network structure; and finally, classifying by using a full connection layer, calculating a loss value by adopting a cross entropy loss function, and using the accuracy and the F1Score as evaluation indexes of the network model. The invention can achieve higher classification accuracy and has the advantage of low parameter quantity.

Description

Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
Technical Field
The invention belongs to the technical field of video spatiotemporal feature extraction and classification methods, and particularly relates to a method for reducing model parameters and ensuring model performance by a lightweight heterogeneous structure for dynamic gesture spatiotemporal feature extraction.
Background
Gestures are a common form of human communication. Gesture recognition may enable human-computer interaction in a natural way. Gesture recognition aims to understand human actions by extracting features from images or videos and then classifying or recognizing each sample as a specific label. Traditional gesture recognition is mainly based on manually extracted features, and although the mode can achieve good recognition effect, the mode depends on experience of researchers to design the features, and the manually extracted features are poor in adaptability to dynamic gestures.
With the development of deep learning, end-to-end gesture recognition is increasingly possible. More and more researchers are trying to perform gesture recognition based on deep learning models. The dual-stream network is a pioneering attempt in dynamic gesture recognition research. The first proposal of the dual-flow network model is to solve the problem that the conventional Convolutional Neural Networks (CNNs) cannot well process the time sequence information in motion recognition, and the main idea is to use two independent CNNs to respectively extract spatial features and time sequence information from pictures and optical flow data. However, optical flow is based on continuous video input, which requires a large amount of computation. Thereby greatly reducing the overall speed of the dual-flow network model. 3DCNNs can directly learn spatiotemporal features, which have made a breakthrough in various computer vision-related analysis tasks. 3DCNNs mainly introduce a time dimension on the basis of a 2D convolution kernel, so that spatial features and temporal features can be extracted simultaneously. Based on 3DCNNs, researchers have proposed a number of deep network models with prominent performance, such as 3D-ResNet, I3D, and S3D. However, 3D convolution has a very large parameter number compared to 2D convolution, and it often takes a long time to train a model. Moreover, each 3D convolution typically only processes a small time window, not the entire video. Therefore, 3d cnns cannot efficiently encode long-term spatial information in dynamic gesture video, which hinders its development in video tasks.
Recurrent Neural Networks (RNN) and its variant Long Short-Term Memory (LSTM) are a deep learning model that models sequence data as input, which are commonly used to encode Long-Term spatiotemporal features of dynamic gestures. LSTM integrates information over time by learning how to store, modify and access internal states using storage units, which makes it better able to discover long-term and short-term temporal relationships of videos. However, since the storage unit utilizes full connections in input to state and state-to-state transitions, no spatial correlation information is encoded. Unlike traditional LSTM, the conditional Long Short-Term Memory (ConvLSTM) explicitly assumes that the input is a sequence of images and replaces the vector multiplication in the LSTM gate by a convolution operation, where the intermediate representation of the image retains spatially relevant information during the recursion process. Among the dynamic gesture recognition tasks, 3d cnns cascade ConvLSTM is currently the most used method. However, this method requires more memory and higher computation when the model is trained.
Therefore, a lightweight deep network model is needed which can guarantee the performance of the model. The separate convolution can greatly reduce the parameter amount of the 3D convolution, and meanwhile, the performance of the model can be kept. The Time Convolutional Network (TCN) is a new type of algorithm that can be used to solve the time series prediction and has relatively little computational complexity. The combination of the lightweight 3D residual network and the TCN is expected to solve the problem that the existing method is generally high in complexity. Meanwhile, the accuracy of classification can be improved by adopting the weighted fusion of the multi-modal features.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The method for multi-modal dynamic gesture recognition based on the lightweight 3D residual network and the TCN is provided on the premise of balancing model performance and model parameter quantity. The technical scheme of the invention is as follows:
a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN comprises the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set to generate pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequences generated by sampling in a time dimension mode by using a window sliding method;
step 2: using the normalized picture sequence in the step 1 as input, pre-training the lightweight 3D residual error network, and storing a lightweight 3D residual error network model weight file in a h5 mode, wherein the weight file stores the structure of the model; the weight of the model; training configuration; the state of the optimizer, so as to start from where the last training was interrupted;
and step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of the gestures in the video by using a light-weight 3D residual error network;
and 4, step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding long-term spatiotemporal features of the dynamic gesture by using the time convolution network;
and 5: weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using an attention mechanism;
step 6: and (5) classifying the feature vectors output in the step (5) by using a full connection layer, and mapping the classification result into a probability value of the gesture class through Softmax.
Further, the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:
sampling each gesture video in the data set according to the frame rate of the video, generating a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence. To ensure that the input data has the same dimensionality, a window sliding method is used to set the input reference frame number for each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is retained, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is executed in a loop until the sample exceeds 32 frames, and finally, each frame is randomly clipped to 224 × 224 and resized to 112 × 112 pixels.
Further, the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:
adopting a transfer learning idea, and pre-training a light-weight 3D residual error network by using a Jester public data set; the pre-training process is divided into a feature extraction part and a feature classification part, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and the weight of the model is saved in a form of h5 in the pre-training process.
Further, the step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:
dividing a data set into a training set, a verification set and a test set according to the proportion of 3:1:1, wherein the training set is used for training a network model, the verification set is used for verifying the performance of the model when the model is trained, parameters are optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization ability of the model; and (2) respectively inputting the training set and the verification set of the RGB and Depth picture sequences preprocessed in the step 1 into two same lightweight 3D residual networks for short-term spatio-temporal feature extraction, wherein the lightweight 3D residual networks replace an original 3D convolution kernel 3 x 3 by using a separation convolution on the basis of 3D-ResNet, and the separation convolution is a convolution kernel for splitting the 3D convolution kernel into 1 x 3 and 3 x 1.
Further, the step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding the long-term spatiotemporal features of the dynamic gesture by using the time convolution network, wherein the specific steps are as follows:
encoding the feature map output in step 3 by using a time convolution network to capture the relevant information between video frames in the dynamic gesture, wherein the time convolution network uses causal convolution and maps the input sequence to the output sequence with the same length, and in addition, expanding convolution and residual connection are used to train a deeper network;
assuming the output of a time convolutional networkThe sequence is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y istDepends only on X ═ X1,...,xT],t<T, the reason is that the calculation formula of the dilated convolution is as follows:
Figure BDA0002835048740000041
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for a TCN with L layersLThe output of (c) is used for sequence classification. Class tags of sequences
Figure BDA0002835048740000042
Is assigned by a fully connected layer with a Softmax activation function;
Figure BDA0002835048740000043
boa bias term is represented.
Further, the step 5: the method for fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism in a weighted mode specifically comprises the following steps:
the Depth image contains motion information and three-dimensional structure information from a Depth channel and is insensitive to illumination change, clothes, skin color and other external factors, the fusion of RGB data and Depth data is used for accurately representing the characteristics of gestures, a non-linear combination mode is provided by introducing an attention mechanism, the mode can enable a network to dynamically select corresponding information in the whole characteristic extraction process, the strategy of weighting and fusing RGB and Depth data is realized, and the characteristic diagram sequence of the output of RGB branches is assumed as SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
Figure BDA0002835048740000051
wherein α ═ αrgbdepth]For the fusion coefficient, the calculation formula is as follows:
Figure BDA0002835048740000052
wherein
Figure BDA0002835048740000053
Denotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively0And W1Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.
Further, the step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping the classification result into a probability value of a gesture class by using Softmax, wherein the probability value is specifically as follows:
weighting the fused information z ═ z in the step 51,...,zT]Inputting the data into a full connection layer, multiplying a weight matrix by an input vector by the full connection layer, adding offset, outputting n fractions, wherein the value of n is from positive infinity to negative infinity, and Softmax maps the n fractions into a probability y of (0,1), and the calculation formula is as follows:
y=Softmax(z)=Softmax(WTz+b) (6)
w represents the weight, b represents the bias term, and Softmax is calculated as follows:
Figure BDA0002835048740000061
the fully connected layer can classify the gesture, obtain a score of each category, and then map the score to an interval of 0-1 by softmax, that is, generate a probability value of the gesture category.
The invention has the following advantages and beneficial effects:
the invention provides a multi-modal dynamic gesture recognition depth network model based on a lightweight 3D residual network and TCN (traffic control network). in the structure, a 3D convolution kernel is optimized by utilizing the idea of separating convolution, long-term space-time characteristics are coded by the TCN, and multi-modal information is weighted and fused by adopting an attention mechanism. Compared with the existing method, on one hand, the idea of separating convolution and the joint use of TCN can greatly reduce the complexity of the model and improve the speed of model recognition, thereby being expected to realize real-time gesture recognition. The idea of the separate convolution is to split the three-dimensional convolution kernel into a one-dimensional convolution kernel and a two-dimensional convolution kernel, and connect the two convolutions in a serial manner. The method is equivalent to extracting the time dimension characteristics first and then extracting the space dimension characteristics, and meets the requirement of learning the dynamic gesture spatiotemporal characteristics. Moreover, the sum of the parameter quantities of the one-dimensional convolution kernel and the two-dimensional convolution kernel is far smaller than the parameter quantity of the three-dimensional convolution kernel; in addition, TCN is simpler in model and consumes less memory than LSTM. On the other hand, most of the existing methods directly fuse multi-modal information in a linear mode, which causes redundancy of the information to a certain extent, and the model has poor expandability. The information of the multiple modes is weighted and fused by using an attention mechanism, so that the redundancy of the information can be reduced, the weight of the long and short space-time characteristics can be automatically adjusted by the model according to the stimulation of the neurons, the performance of the model is improved, and the model has self-adaptability so as to be applied to other types of video learning tasks in an expanded mode.
Drawings
Fig. 1 is a flow chart of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN according to a preferred embodiment of the present invention.
Fig. 2 is an architecture diagram of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN.
Fig. 3 is a comparison graph of the convolution kernel of the 3D residual network and the convolution kernel of the lightweight 3D residual network.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, the multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN provided by this embodiment includes the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, and sorting and storing the pictures according to the time sequence. In order to ensure that input data have the same dimension, a window sliding method is used to set an input reference frame number for each gesture video. The value is set to 32, and for videos above 32 frames, irrelevant images at both ends are deleted, and the key frame in the middle is reserved. For video less than 32 frames, we repeat some frames at a certain rate. The process will loop until the samples exceed 32 frames. Finally, we randomly crop each frame to 224 × 224 and resize it to 112 × 112 pixels.
Step 2: because the 3D convolution has more parameter quantity, the convergence speed is low in model training, and the overfitting phenomenon is easy to occur. In order to solve the problems, the invention adopts the idea of transfer learning and uses a Jester data set to pre-train a lightweight 3D residual error network. The pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, and the feature classification part is a full connection layer and a Softmax layer. During pre-training, the weights for the model are saved in the form of h 5.
And 3, step 3: the data set is divided into three parts, namely a training set, a verification set and a test set according to the ratio of 3:1: 1. The training set is mainly used for training the network model, the verification set is mainly used for verifying the performance of the model when the model is trained, parameters can be generally optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization capability of the model. In the invention, a training set and a verification set of RGB and Depth picture sequences preprocessed in step 1 are used as input, and are respectively input into two same lightweight 3D residual networks for short-term space-time feature extraction, wherein the lightweight 3D residual networks use 3D-ResNet as a basis and replace original 3D convolution kernels 3 x 3 by using separation convolution. The split convolution is a convolution kernel that splits the 3D convolution kernel into 1 x 3 and 3 x 1, which both preserves the performance of the 3D convolution and reduces the amount of parameters of the 3D convolution.
And 4, step 4: dynamic gesture recognition is primarily based on a series of gesture actions to identify specific gesture classes. Therefore, the characteristics of the time dimension are very important for dynamic gesture recognition. In the invention, a time convolution network is used for coding the characteristic diagram sequence output in the step 3 so as to capture the relevant information among the video frames in the dynamic gesture. The time convolution network is a novel algorithm which can be used for solving the time series prediction. The main feature of a time convolutional network is to use causal convolution and to map the input sequence to an output sequence of the same length. In addition, considering the far-ahead sequences, the model uses the expansion convolution and residual connection, so that a larger receptive field is realized, and a deeper network can be trained.
Assume that the input sequence of the time convolutional network is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y ist,t<The calculation of T depends only on X ═ X1,...,xT]. The reason is that the calculation formula of the dilation convolution is as follows:
Figure BDA0002835048740000081
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for a TCN with L layersLThe output of (c) is used for sequence classification. Class tags of sequences
Figure BDA0002835048740000082
Is assigned through the fully connected layer with the Softmax activation function.
Figure BDA0002835048740000083
boA bias term is represented.
And 5: the Depth image contains motion information and three-dimensional structure information from the Depth channel and is insensitive to illumination variations, clothing, skin tone and other external factors. Therefore, it can be an important complement to the original RGB image. The fusion of the RGB data and the Depth data can more accurately represent the characteristics of the gesture, so that the accuracy of gesture recognition is improved. In addition, it is also important to select an appropriate fusion strategy. The approach of linear aggregation may not be sufficient to provide a strong adaptability of the neurons, and may also produce redundant information. In order to solve the problem, the invention provides a nonlinear combination mode by introducing an attention mechanism. The method can enable the network to dynamically select corresponding information in the whole feature extraction process, and realizes a strategy of weighting and fusing RGB and Depth data. Assume that the sequence of the profile of the output of the RGB branch is SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
Figure BDA0002835048740000091
wherein α ═ αrgbdepth]For the fusion coefficient, the calculation formula is as follows:
Figure BDA0002835048740000092
wherein
Figure BDA0002835048740000093
Denotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively0And W1The convolution weight and full-link layer weight of 1 × 1 × 1 are respectively expressed, β represents batch normalization, and δ represents the ReLU activation function.
And 6: weighting the fused information z ═ z in the step 51,...,zT]The input is the full connection layer, the full connection layer multiplies the weight matrix and the input vector, and the offset is added, and n (plus infinity, minus infinity) fractions are output. Softmax maps the n (positive infinity, negative infinity) scores to a probability y of (0, 1). The calculation formula is as follows:
y=Softmax(z)=Softmax(WTz+b) (13)
w represents a weight, b represents a bias term. Softmax is calculated as follows:
Figure BDA0002835048740000094
the method illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (6)

1. A multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN is characterized by comprising the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequence generated by sampling in a time dimension by using a window sliding method;
and 2, step: pre-training the lightweight 3D residual network by using the unified picture sequence in the step 1 as input, and saving a lightweight 3D residual network model weight file in a h5 form, wherein the weight file saves the structure of the model, the weight of the model, training configuration and the state of an optimizer so as to start from the place where the last training is interrupted;
and step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of the gestures in the video by using a light-weight 3D residual error network;
and 4, step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding long-term spatiotemporal features of the dynamic gesture by using the time convolution network;
and 5: weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using an attention mechanism;
step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax;
the step 5: the method for weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism specifically comprises the following steps:
the Depth image contains motion information and three-dimensional structure information from a Depth channel and is insensitive to illumination change, clothes, skin color and other external factors, the fusion of RGB data and Depth data is used for accurately representing the characteristics of gestures, a non-linear combination mode is provided by introducing an attention mechanism, the mode can enable a network to dynamically select corresponding information in the whole characteristic extraction process, the strategy of weighting and fusing RGB and Depth data is realized, and the characteristic diagram sequence of the output of RGB branches is assumed as SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
Figure FDA0003661930110000021
wherein α ═ αrgbdepth]For the fusion coefficient, the calculation formula is as follows:
Figure FDA0003661930110000022
wherein
Figure FDA0003661930110000023
Denotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of profiles representing the output of RGB branches, sequence of output profiles of Depth branches, W, respectively0And W1Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.
2. The method according to claim 1, wherein the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:
sampling each gesture video in the data set according to the frame rate of the video to generate a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence; to ensure that the input data has the same dimension, a window sliding method is used to set the input reference frame number of each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is kept, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is circularly executed until the sample exceeds 32 frames, and finally, each frame is randomly cut into 224 × 224 and adjusted to be 112 × 112 pixels in size.
3. The method according to claim 2, wherein the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:
adopting a transfer learning idea, and pre-training a lightweight 3D residual error network by using a Jester public data set; the pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and in the pre-training process, the weight of the model is saved in a form of h 5.
4. The method according to claim 3, wherein the step 3 is as follows: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:
dividing a data set into a training set, a verification set and a test set according to the proportion of 3:1:1, wherein the training set is used for training a network model, the verification set is used for verifying the performance of the model when the model is trained, parameters are optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization ability of the model; and (2) respectively inputting the training set and the verification set of the RGB and Depth picture sequences preprocessed in the step 1 into two same lightweight 3D residual networks for short-term spatio-temporal feature extraction, wherein the lightweight 3D residual networks replace an original 3D convolution kernel 3 x 3 by using a separation convolution on the basis of 3D-ResNet, and the separation convolution is a convolution kernel for splitting the 3D convolution kernel into 1 x 3 and 3 x 1.
5. The method according to claim 4, wherein the step 4: inputting the feature map output in the step 3 into a time convolution network, and coding the long-term space-time features of the dynamic gestures by using the time convolution network, wherein the long-term space-time features specifically comprise the following steps:
encoding the feature map output in step 3 by using a time convolution network to capture the relevant information between video frames in the dynamic gesture, wherein the time convolution network uses causal convolution and maps the input sequence to the output sequence with the same length, and in addition, expanding convolution and residual connection are used to train a deeper network;
assume that the input sequence of the time convolutional network is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y istDepends only on X ═ X1,...,xT]T < T, since the swell convolution is calculated as follows:
Figure FDA0003661930110000031
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for TCN with L layersLThe output of (2) is used for sequence classification, class labels of sequences
Figure FDA0003661930110000041
Is assigned by a fully connected layer with a Softmax activation function;
Figure FDA0003661930110000042
borepresenting a bias term.
6. The method according to claim 5, wherein the step 6 comprises: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax, wherein the probability value is specifically as follows:
weighting the fused information z ═ z in the step 51,...,zT]Inputting the data into a full connection layer, multiplying the weight matrix by the input vector by the full connection layer, adding offset, and outputting n fractions, wherein n is positive infinity, negative noneSoftmax maps the n scores to a probability y of (0,1), which is calculated as follows:
y=Softmax(z)=Softmax(WTz+b) (6)
w represents the weight, b represents the bias term, and Softmax is calculated as follows:
Figure FDA0003661930110000043
the fully connected layer can classify the gesture, obtain a score of each category, and then map the score to an interval of 0-1 by softmax, that is, generate a probability value of the gesture category.
CN202011467797.8A 2020-12-14 2020-12-14 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN Active CN112507898B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011467797.8A CN112507898B (en) 2020-12-14 2020-12-14 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011467797.8A CN112507898B (en) 2020-12-14 2020-12-14 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Publications (2)

Publication Number Publication Date
CN112507898A CN112507898A (en) 2021-03-16
CN112507898B true CN112507898B (en) 2022-07-01

Family

ID=74972911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011467797.8A Active CN112507898B (en) 2020-12-14 2020-12-14 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Country Status (1)

Country Link
CN (1) CN112507898B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673280A (en) * 2020-05-14 2021-11-19 索尼公司 Image processing apparatus, image processing method, and computer-readable storage medium
CN113065451B (en) * 2021-03-29 2022-08-09 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113095386B (en) * 2021-03-31 2023-10-13 华南师范大学 Gesture recognition method and system based on triaxial acceleration space-time feature fusion
CN113178073A (en) * 2021-04-25 2021-07-27 南京工业大学 Traffic flow short-term prediction optimization application method based on time convolution network
CN112926557B (en) * 2021-05-11 2021-09-10 北京的卢深视科技有限公司 Method for training multi-mode face recognition model and multi-mode face recognition method
CN113239824B (en) * 2021-05-19 2024-04-05 北京工业大学 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module
CN113297955B (en) * 2021-05-21 2022-03-18 中国矿业大学 Sign language word recognition method based on multi-mode hierarchical information fusion
CN113343198B (en) * 2021-06-23 2022-12-16 华南理工大学 Video-based random gesture authentication method and system
CN113435340B (en) * 2021-06-29 2022-06-10 福州大学 Real-time gesture recognition method based on improved Resnet
CN113361655B (en) * 2021-07-12 2022-09-27 武汉智目智能技术合伙企业(有限合伙) Differential fiber classification method based on residual error network and characteristic difference fitting
CN113609923B (en) * 2021-07-13 2022-05-13 中国矿业大学 Attention-based continuous sign language sentence recognition method
CN113449682B (en) * 2021-07-15 2023-08-08 四川九洲电器集团有限责任公司 Method for identifying radio frequency fingerprints in civil aviation field based on dynamic fusion model
CN115578683B (en) * 2022-12-08 2023-04-28 中国海洋大学 Construction method of dynamic gesture recognition model and dynamic gesture recognition method
CN115862144B (en) * 2022-12-23 2023-06-23 杭州晨安科技股份有限公司 Gesture recognition method for camera
CN115953839B (en) * 2022-12-26 2024-04-12 广州紫为云科技有限公司 Real-time 2D gesture estimation method based on loop architecture and key point regression
CN117218716B (en) * 2023-08-10 2024-04-09 中国矿业大学 DVS-based automobile cabin gesture recognition system and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298361B (en) * 2019-05-22 2021-05-04 杭州未名信科科技有限公司 Semantic segmentation method and system for RGB-D image
CN111985369B (en) * 2020-08-07 2021-09-17 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism

Also Published As

Publication number Publication date
CN112507898A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112507898B (en) Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN109543714B (en) Data feature acquisition method and device, electronic equipment and storage medium
CN112396002A (en) Lightweight remote sensing target detection method based on SE-YOLOv3
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
Mungra et al. PRATIT: a CNN-based emotion recognition system using histogram equalization and data augmentation
WO2021057056A1 (en) Neural architecture search method, image processing method and device, and storage medium
CN113255443B (en) Graph annotation meaning network time sequence action positioning method based on pyramid structure
CN111639544A (en) Expression recognition method based on multi-branch cross-connection convolutional neural network
US11908457B2 (en) Orthogonally constrained multi-head attention for speech tasks
CN109885709A (en) A kind of image search method, device and storage medium based on from the pre- dimensionality reduction of coding
Sharma et al. Deep eigen space based ASL recognition system
CN112307982A (en) Human behavior recognition method based on staggered attention-enhancing network
US20220101539A1 (en) Sparse optical flow estimation
Li et al. Robustness comparison between the capsule network and the convolutional network for facial expression recognition
CN113158815A (en) Unsupervised pedestrian re-identification method, system and computer readable medium
Wang et al. A pseudoinverse incremental algorithm for fast training deep neural networks with application to spectra pattern recognition
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
Gkalelis et al. Objectgraphs: Using objects and a graph convolutional network for the bottom-up recognition and explanation of events in video
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
CN112016592B (en) Domain adaptive semantic segmentation method and device based on cross domain category perception
CN111079900B (en) Image processing method and device based on self-adaptive connection neural network
CN110347853B (en) Image hash code generation method based on recurrent neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant