CN112507898B - Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN - Google Patents
Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN Download PDFInfo
- Publication number
- CN112507898B CN112507898B CN202011467797.8A CN202011467797A CN112507898B CN 112507898 B CN112507898 B CN 112507898B CN 202011467797 A CN202011467797 A CN 202011467797A CN 112507898 B CN112507898 B CN 112507898B
- Authority
- CN
- China
- Prior art keywords
- network
- convolution
- sequence
- lightweight
- rgb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-modal dynamic gesture recognition method based on a lightweight 3D residual error network and a TCN. Firstly, sampling original videos in a data set, and sequencing and storing the videos according to a time sequence; then, pre-training a lightweight 3D residual error network by using a large public gesture recognition data set, and storing a weight file of the model; then, long-short term spatio-temporal features are extracted using the RGB-D image sequence as input and the lightweight 3D residual network and the time convolution network as base models, and information of the multiple modalities is fused by weighting using an attention mechanism. Wherein, RGB and Depth (Depth) sequences are respectively input into the same network structure; and finally, classifying by using a full connection layer, calculating a loss value by adopting a cross entropy loss function, and using the accuracy and the F1Score as evaluation indexes of the network model. The invention can achieve higher classification accuracy and has the advantage of low parameter quantity.
Description
Technical Field
The invention belongs to the technical field of video spatiotemporal feature extraction and classification methods, and particularly relates to a method for reducing model parameters and ensuring model performance by a lightweight heterogeneous structure for dynamic gesture spatiotemporal feature extraction.
Background
Gestures are a common form of human communication. Gesture recognition may enable human-computer interaction in a natural way. Gesture recognition aims to understand human actions by extracting features from images or videos and then classifying or recognizing each sample as a specific label. Traditional gesture recognition is mainly based on manually extracted features, and although the mode can achieve good recognition effect, the mode depends on experience of researchers to design the features, and the manually extracted features are poor in adaptability to dynamic gestures.
With the development of deep learning, end-to-end gesture recognition is increasingly possible. More and more researchers are trying to perform gesture recognition based on deep learning models. The dual-stream network is a pioneering attempt in dynamic gesture recognition research. The first proposal of the dual-flow network model is to solve the problem that the conventional Convolutional Neural Networks (CNNs) cannot well process the time sequence information in motion recognition, and the main idea is to use two independent CNNs to respectively extract spatial features and time sequence information from pictures and optical flow data. However, optical flow is based on continuous video input, which requires a large amount of computation. Thereby greatly reducing the overall speed of the dual-flow network model. 3DCNNs can directly learn spatiotemporal features, which have made a breakthrough in various computer vision-related analysis tasks. 3DCNNs mainly introduce a time dimension on the basis of a 2D convolution kernel, so that spatial features and temporal features can be extracted simultaneously. Based on 3DCNNs, researchers have proposed a number of deep network models with prominent performance, such as 3D-ResNet, I3D, and S3D. However, 3D convolution has a very large parameter number compared to 2D convolution, and it often takes a long time to train a model. Moreover, each 3D convolution typically only processes a small time window, not the entire video. Therefore, 3d cnns cannot efficiently encode long-term spatial information in dynamic gesture video, which hinders its development in video tasks.
Recurrent Neural Networks (RNN) and its variant Long Short-Term Memory (LSTM) are a deep learning model that models sequence data as input, which are commonly used to encode Long-Term spatiotemporal features of dynamic gestures. LSTM integrates information over time by learning how to store, modify and access internal states using storage units, which makes it better able to discover long-term and short-term temporal relationships of videos. However, since the storage unit utilizes full connections in input to state and state-to-state transitions, no spatial correlation information is encoded. Unlike traditional LSTM, the conditional Long Short-Term Memory (ConvLSTM) explicitly assumes that the input is a sequence of images and replaces the vector multiplication in the LSTM gate by a convolution operation, where the intermediate representation of the image retains spatially relevant information during the recursion process. Among the dynamic gesture recognition tasks, 3d cnns cascade ConvLSTM is currently the most used method. However, this method requires more memory and higher computation when the model is trained.
Therefore, a lightweight deep network model is needed which can guarantee the performance of the model. The separate convolution can greatly reduce the parameter amount of the 3D convolution, and meanwhile, the performance of the model can be kept. The Time Convolutional Network (TCN) is a new type of algorithm that can be used to solve the time series prediction and has relatively little computational complexity. The combination of the lightweight 3D residual network and the TCN is expected to solve the problem that the existing method is generally high in complexity. Meanwhile, the accuracy of classification can be improved by adopting the weighted fusion of the multi-modal features.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The method for multi-modal dynamic gesture recognition based on the lightweight 3D residual network and the TCN is provided on the premise of balancing model performance and model parameter quantity. The technical scheme of the invention is as follows:
a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN comprises the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set to generate pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequences generated by sampling in a time dimension mode by using a window sliding method;
step 2: using the normalized picture sequence in the step 1 as input, pre-training the lightweight 3D residual error network, and storing a lightweight 3D residual error network model weight file in a h5 mode, wherein the weight file stores the structure of the model; the weight of the model; training configuration; the state of the optimizer, so as to start from where the last training was interrupted;
and step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of the gestures in the video by using a light-weight 3D residual error network;
and 4, step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding long-term spatiotemporal features of the dynamic gesture by using the time convolution network;
and 5: weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using an attention mechanism;
step 6: and (5) classifying the feature vectors output in the step (5) by using a full connection layer, and mapping the classification result into a probability value of the gesture class through Softmax.
Further, the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:
sampling each gesture video in the data set according to the frame rate of the video, generating a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence. To ensure that the input data has the same dimensionality, a window sliding method is used to set the input reference frame number for each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is retained, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is executed in a loop until the sample exceeds 32 frames, and finally, each frame is randomly clipped to 224 × 224 and resized to 112 × 112 pixels.
Further, the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:
adopting a transfer learning idea, and pre-training a light-weight 3D residual error network by using a Jester public data set; the pre-training process is divided into a feature extraction part and a feature classification part, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and the weight of the model is saved in a form of h5 in the pre-training process.
Further, the step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:
dividing a data set into a training set, a verification set and a test set according to the proportion of 3:1:1, wherein the training set is used for training a network model, the verification set is used for verifying the performance of the model when the model is trained, parameters are optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization ability of the model; and (2) respectively inputting the training set and the verification set of the RGB and Depth picture sequences preprocessed in the step 1 into two same lightweight 3D residual networks for short-term spatio-temporal feature extraction, wherein the lightweight 3D residual networks replace an original 3D convolution kernel 3 x 3 by using a separation convolution on the basis of 3D-ResNet, and the separation convolution is a convolution kernel for splitting the 3D convolution kernel into 1 x 3 and 3 x 1.
Further, the step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding the long-term spatiotemporal features of the dynamic gesture by using the time convolution network, wherein the specific steps are as follows:
encoding the feature map output in step 3 by using a time convolution network to capture the relevant information between video frames in the dynamic gesture, wherein the time convolution network uses causal convolution and maps the input sequence to the output sequence with the same length, and in addition, expanding convolution and residual connection are used to train a deeper network;
assuming the output of a time convolutional networkThe sequence is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y istDepends only on X ═ X1,...,xT],t<T, the reason is that the calculation formula of the dilated convolution is as follows:
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for a TCN with L layersLThe output of (c) is used for sequence classification. Class tags of sequencesIs assigned by a fully connected layer with a Softmax activation function;
boa bias term is represented.
Further, the step 5: the method for fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism in a weighted mode specifically comprises the following steps:
the Depth image contains motion information and three-dimensional structure information from a Depth channel and is insensitive to illumination change, clothes, skin color and other external factors, the fusion of RGB data and Depth data is used for accurately representing the characteristics of gestures, a non-linear combination mode is provided by introducing an attention mechanism, the mode can enable a network to dynamically select corresponding information in the whole characteristic extraction process, the strategy of weighting and fusing RGB and Depth data is realized, and the characteristic diagram sequence of the output of RGB branches is assumed as SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
wherein α ═ αrgb,αdepth]For the fusion coefficient, the calculation formula is as follows:
whereinDenotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively0And W1Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.
Further, the step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping the classification result into a probability value of a gesture class by using Softmax, wherein the probability value is specifically as follows:
weighting the fused information z ═ z in the step 51,...,zT]Inputting the data into a full connection layer, multiplying a weight matrix by an input vector by the full connection layer, adding offset, outputting n fractions, wherein the value of n is from positive infinity to negative infinity, and Softmax maps the n fractions into a probability y of (0,1), and the calculation formula is as follows:
y=Softmax(z)=Softmax(WTz+b) (6)
w represents the weight, b represents the bias term, and Softmax is calculated as follows:
the fully connected layer can classify the gesture, obtain a score of each category, and then map the score to an interval of 0-1 by softmax, that is, generate a probability value of the gesture category.
The invention has the following advantages and beneficial effects:
the invention provides a multi-modal dynamic gesture recognition depth network model based on a lightweight 3D residual network and TCN (traffic control network). in the structure, a 3D convolution kernel is optimized by utilizing the idea of separating convolution, long-term space-time characteristics are coded by the TCN, and multi-modal information is weighted and fused by adopting an attention mechanism. Compared with the existing method, on one hand, the idea of separating convolution and the joint use of TCN can greatly reduce the complexity of the model and improve the speed of model recognition, thereby being expected to realize real-time gesture recognition. The idea of the separate convolution is to split the three-dimensional convolution kernel into a one-dimensional convolution kernel and a two-dimensional convolution kernel, and connect the two convolutions in a serial manner. The method is equivalent to extracting the time dimension characteristics first and then extracting the space dimension characteristics, and meets the requirement of learning the dynamic gesture spatiotemporal characteristics. Moreover, the sum of the parameter quantities of the one-dimensional convolution kernel and the two-dimensional convolution kernel is far smaller than the parameter quantity of the three-dimensional convolution kernel; in addition, TCN is simpler in model and consumes less memory than LSTM. On the other hand, most of the existing methods directly fuse multi-modal information in a linear mode, which causes redundancy of the information to a certain extent, and the model has poor expandability. The information of the multiple modes is weighted and fused by using an attention mechanism, so that the redundancy of the information can be reduced, the weight of the long and short space-time characteristics can be automatically adjusted by the model according to the stimulation of the neurons, the performance of the model is improved, and the model has self-adaptability so as to be applied to other types of video learning tasks in an expanded mode.
Drawings
Fig. 1 is a flow chart of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN according to a preferred embodiment of the present invention.
Fig. 2 is an architecture diagram of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN.
Fig. 3 is a comparison graph of the convolution kernel of the 3D residual network and the convolution kernel of the lightweight 3D residual network.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, the multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN provided by this embodiment includes the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, and sorting and storing the pictures according to the time sequence. In order to ensure that input data have the same dimension, a window sliding method is used to set an input reference frame number for each gesture video. The value is set to 32, and for videos above 32 frames, irrelevant images at both ends are deleted, and the key frame in the middle is reserved. For video less than 32 frames, we repeat some frames at a certain rate. The process will loop until the samples exceed 32 frames. Finally, we randomly crop each frame to 224 × 224 and resize it to 112 × 112 pixels.
Step 2: because the 3D convolution has more parameter quantity, the convergence speed is low in model training, and the overfitting phenomenon is easy to occur. In order to solve the problems, the invention adopts the idea of transfer learning and uses a Jester data set to pre-train a lightweight 3D residual error network. The pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, and the feature classification part is a full connection layer and a Softmax layer. During pre-training, the weights for the model are saved in the form of h 5.
And 3, step 3: the data set is divided into three parts, namely a training set, a verification set and a test set according to the ratio of 3:1: 1. The training set is mainly used for training the network model, the verification set is mainly used for verifying the performance of the model when the model is trained, parameters can be generally optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization capability of the model. In the invention, a training set and a verification set of RGB and Depth picture sequences preprocessed in step 1 are used as input, and are respectively input into two same lightweight 3D residual networks for short-term space-time feature extraction, wherein the lightweight 3D residual networks use 3D-ResNet as a basis and replace original 3D convolution kernels 3 x 3 by using separation convolution. The split convolution is a convolution kernel that splits the 3D convolution kernel into 1 x 3 and 3 x 1, which both preserves the performance of the 3D convolution and reduces the amount of parameters of the 3D convolution.
And 4, step 4: dynamic gesture recognition is primarily based on a series of gesture actions to identify specific gesture classes. Therefore, the characteristics of the time dimension are very important for dynamic gesture recognition. In the invention, a time convolution network is used for coding the characteristic diagram sequence output in the step 3 so as to capture the relevant information among the video frames in the dynamic gesture. The time convolution network is a novel algorithm which can be used for solving the time series prediction. The main feature of a time convolutional network is to use causal convolution and to map the input sequence to an output sequence of the same length. In addition, considering the far-ahead sequences, the model uses the expansion convolution and residual connection, so that a larger receptive field is realized, and a deeper network can be trained.
Assume that the input sequence of the time convolutional network is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y ist,t<The calculation of T depends only on X ═ X1,...,xT]. The reason is that the calculation formula of the dilation convolution is as follows:
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for a TCN with L layersLThe output of (c) is used for sequence classification. Class tags of sequencesIs assigned through the fully connected layer with the Softmax activation function.
boA bias term is represented.
And 5: the Depth image contains motion information and three-dimensional structure information from the Depth channel and is insensitive to illumination variations, clothing, skin tone and other external factors. Therefore, it can be an important complement to the original RGB image. The fusion of the RGB data and the Depth data can more accurately represent the characteristics of the gesture, so that the accuracy of gesture recognition is improved. In addition, it is also important to select an appropriate fusion strategy. The approach of linear aggregation may not be sufficient to provide a strong adaptability of the neurons, and may also produce redundant information. In order to solve the problem, the invention provides a nonlinear combination mode by introducing an attention mechanism. The method can enable the network to dynamically select corresponding information in the whole feature extraction process, and realizes a strategy of weighting and fusing RGB and Depth data. Assume that the sequence of the profile of the output of the RGB branch is SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
wherein α ═ αrgb,αdepth]For the fusion coefficient, the calculation formula is as follows:
whereinDenotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively0And W1The convolution weight and full-link layer weight of 1 × 1 × 1 are respectively expressed, β represents batch normalization, and δ represents the ReLU activation function.
And 6: weighting the fused information z ═ z in the step 51,...,zT]The input is the full connection layer, the full connection layer multiplies the weight matrix and the input vector, and the offset is added, and n (plus infinity, minus infinity) fractions are output. Softmax maps the n (positive infinity, negative infinity) scores to a probability y of (0, 1). The calculation formula is as follows:
y=Softmax(z)=Softmax(WTz+b) (13)
w represents a weight, b represents a bias term. Softmax is calculated as follows:
the method illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.
Claims (6)
1. A multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN is characterized by comprising the following steps:
step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequence generated by sampling in a time dimension by using a window sliding method;
and 2, step: pre-training the lightweight 3D residual network by using the unified picture sequence in the step 1 as input, and saving a lightweight 3D residual network model weight file in a h5 form, wherein the weight file saves the structure of the model, the weight of the model, training configuration and the state of an optimizer so as to start from the place where the last training is interrupted;
and step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of the gestures in the video by using a light-weight 3D residual error network;
and 4, step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding long-term spatiotemporal features of the dynamic gesture by using the time convolution network;
and 5: weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using an attention mechanism;
step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax;
the step 5: the method for weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism specifically comprises the following steps:
the Depth image contains motion information and three-dimensional structure information from a Depth channel and is insensitive to illumination change, clothes, skin color and other external factors, the fusion of RGB data and Depth data is used for accurately representing the characteristics of gestures, a non-linear combination mode is provided by introducing an attention mechanism, the mode can enable a network to dynamically select corresponding information in the whole characteristic extraction process, the strategy of weighting and fusing RGB and Depth data is realized, and the characteristic diagram sequence of the output of RGB branches is assumed as SrgbThe output characteristic diagram sequence of the Depth branch is SdepthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:
wherein α ═ αrgb,αdepth]For the fusion coefficient, the calculation formula is as follows:
whereinDenotes average pooling, FfcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, Srgb、SdepthSequence of profiles representing the output of RGB branches, sequence of output profiles of Depth branches, W, respectively0And W1Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.
2. The method according to claim 1, wherein the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:
sampling each gesture video in the data set according to the frame rate of the video to generate a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence; to ensure that the input data has the same dimension, a window sliding method is used to set the input reference frame number of each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is kept, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is circularly executed until the sample exceeds 32 frames, and finally, each frame is randomly cut into 224 × 224 and adjusted to be 112 × 112 pixels in size.
3. The method according to claim 2, wherein the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:
adopting a transfer learning idea, and pre-training a lightweight 3D residual error network by using a Jester public data set; the pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and in the pre-training process, the weight of the model is saved in a form of h 5.
4. The method according to claim 3, wherein the step 3 is as follows: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:
dividing a data set into a training set, a verification set and a test set according to the proportion of 3:1:1, wherein the training set is used for training a network model, the verification set is used for verifying the performance of the model when the model is trained, parameters are optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization ability of the model; and (2) respectively inputting the training set and the verification set of the RGB and Depth picture sequences preprocessed in the step 1 into two same lightweight 3D residual networks for short-term spatio-temporal feature extraction, wherein the lightweight 3D residual networks replace an original 3D convolution kernel 3 x 3 by using a separation convolution on the basis of 3D-ResNet, and the separation convolution is a convolution kernel for splitting the 3D convolution kernel into 1 x 3 and 3 x 1.
5. The method according to claim 4, wherein the step 4: inputting the feature map output in the step 3 into a time convolution network, and coding the long-term space-time features of the dynamic gestures by using the time convolution network, wherein the long-term space-time features specifically comprise the following steps:
encoding the feature map output in step 3 by using a time convolution network to capture the relevant information between video frames in the dynamic gesture, wherein the time convolution network uses causal convolution and maps the input sequence to the output sequence with the same length, and in addition, expanding convolution and residual connection are used to train a deeper network;
assume that the input sequence of the time convolutional network is X ═ X1,...,xT]The output is S ═ S1,...,sT]And y istDepends only on X ═ X1,...,xT]T < T, since the swell convolution is calculated as follows:
whereindFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, hmRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for TCN with L layersLThe output of (2) is used for sequence classification, class labels of sequencesIs assigned by a fully connected layer with a Softmax activation function;
borepresenting a bias term.
6. The method according to claim 5, wherein the step 6 comprises: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax, wherein the probability value is specifically as follows:
weighting the fused information z ═ z in the step 51,...,zT]Inputting the data into a full connection layer, multiplying the weight matrix by the input vector by the full connection layer, adding offset, and outputting n fractions, wherein n is positive infinity, negative noneSoftmax maps the n scores to a probability y of (0,1), which is calculated as follows:
y=Softmax(z)=Softmax(WTz+b) (6)
w represents the weight, b represents the bias term, and Softmax is calculated as follows:
the fully connected layer can classify the gesture, obtain a score of each category, and then map the score to an interval of 0-1 by softmax, that is, generate a probability value of the gesture category.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011467797.8A CN112507898B (en) | 2020-12-14 | 2020-12-14 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011467797.8A CN112507898B (en) | 2020-12-14 | 2020-12-14 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112507898A CN112507898A (en) | 2021-03-16 |
CN112507898B true CN112507898B (en) | 2022-07-01 |
Family
ID=74972911
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011467797.8A Active CN112507898B (en) | 2020-12-14 | 2020-12-14 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112507898B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673280A (en) * | 2020-05-14 | 2021-11-19 | 索尼公司 | Image processing apparatus, image processing method, and computer-readable storage medium |
CN113065451B (en) * | 2021-03-29 | 2022-08-09 | 四川翼飞视科技有限公司 | Multi-mode fused action recognition device and method and storage medium |
CN113095386B (en) * | 2021-03-31 | 2023-10-13 | 华南师范大学 | Gesture recognition method and system based on triaxial acceleration space-time feature fusion |
CN113178073A (en) * | 2021-04-25 | 2021-07-27 | 南京工业大学 | Traffic flow short-term prediction optimization application method based on time convolution network |
CN112926557B (en) * | 2021-05-11 | 2021-09-10 | 北京的卢深视科技有限公司 | Method for training multi-mode face recognition model and multi-mode face recognition method |
CN113239824B (en) * | 2021-05-19 | 2024-04-05 | 北京工业大学 | Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module |
CN113297955B (en) * | 2021-05-21 | 2022-03-18 | 中国矿业大学 | Sign language word recognition method based on multi-mode hierarchical information fusion |
CN113343198B (en) * | 2021-06-23 | 2022-12-16 | 华南理工大学 | Video-based random gesture authentication method and system |
CN113435340B (en) * | 2021-06-29 | 2022-06-10 | 福州大学 | Real-time gesture recognition method based on improved Resnet |
CN113361655B (en) * | 2021-07-12 | 2022-09-27 | 武汉智目智能技术合伙企业(有限合伙) | Differential fiber classification method based on residual error network and characteristic difference fitting |
CN113609923B (en) * | 2021-07-13 | 2022-05-13 | 中国矿业大学 | Attention-based continuous sign language sentence recognition method |
CN113449682B (en) * | 2021-07-15 | 2023-08-08 | 四川九洲电器集团有限责任公司 | Method for identifying radio frequency fingerprints in civil aviation field based on dynamic fusion model |
CN115578683B (en) * | 2022-12-08 | 2023-04-28 | 中国海洋大学 | Construction method of dynamic gesture recognition model and dynamic gesture recognition method |
CN115862144B (en) * | 2022-12-23 | 2023-06-23 | 杭州晨安科技股份有限公司 | Gesture recognition method for camera |
CN115953839B (en) * | 2022-12-26 | 2024-04-12 | 广州紫为云科技有限公司 | Real-time 2D gesture estimation method based on loop architecture and key point regression |
CN117218716B (en) * | 2023-08-10 | 2024-04-09 | 中国矿业大学 | DVS-based automobile cabin gesture recognition system and method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
WO2020181685A1 (en) * | 2019-03-12 | 2020-09-17 | 南京邮电大学 | Vehicle-mounted video target detection method based on deep learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298361B (en) * | 2019-05-22 | 2021-05-04 | 杭州未名信科科技有限公司 | Semantic segmentation method and system for RGB-D image |
CN111985369B (en) * | 2020-08-07 | 2021-09-17 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
-
2020
- 2020-12-14 CN CN202011467797.8A patent/CN112507898B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020181685A1 (en) * | 2019-03-12 | 2020-09-17 | 南京邮电大学 | Vehicle-mounted video target detection method based on deep learning |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN112507898A (en) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112507898B (en) | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN | |
CN110866140B (en) | Image feature extraction model training method, image searching method and computer equipment | |
CN111210443B (en) | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance | |
CN109543714B (en) | Data feature acquisition method and device, electronic equipment and storage medium | |
CN112396002A (en) | Lightweight remote sensing target detection method based on SE-YOLOv3 | |
WO2021022521A1 (en) | Method for processing data, and method and device for training neural network model | |
Mungra et al. | PRATIT: a CNN-based emotion recognition system using histogram equalization and data augmentation | |
WO2021057056A1 (en) | Neural architecture search method, image processing method and device, and storage medium | |
CN113255443B (en) | Graph annotation meaning network time sequence action positioning method based on pyramid structure | |
CN111639544A (en) | Expression recognition method based on multi-branch cross-connection convolutional neural network | |
US11908457B2 (en) | Orthogonally constrained multi-head attention for speech tasks | |
CN109885709A (en) | A kind of image search method, device and storage medium based on from the pre- dimensionality reduction of coding | |
Sharma et al. | Deep eigen space based ASL recognition system | |
CN112307982A (en) | Human behavior recognition method based on staggered attention-enhancing network | |
US20220101539A1 (en) | Sparse optical flow estimation | |
Li et al. | Robustness comparison between the capsule network and the convolutional network for facial expression recognition | |
CN113158815A (en) | Unsupervised pedestrian re-identification method, system and computer readable medium | |
Wang et al. | A pseudoinverse incremental algorithm for fast training deep neural networks with application to spectra pattern recognition | |
CN113780249B (en) | Expression recognition model processing method, device, equipment, medium and program product | |
Gkalelis et al. | Objectgraphs: Using objects and a graph convolutional network for the bottom-up recognition and explanation of events in video | |
CN116863194A (en) | Foot ulcer image classification method, system, equipment and medium | |
US20230072445A1 (en) | Self-supervised video representation learning by exploring spatiotemporal continuity | |
CN112016592B (en) | Domain adaptive semantic segmentation method and device based on cross domain category perception | |
CN111079900B (en) | Image processing method and device based on self-adaptive connection neural network | |
CN110347853B (en) | Image hash code generation method based on recurrent neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |