CN112507898B

CN112507898B - Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Info

Publication number: CN112507898B
Application number: CN202011467797.8A
Authority: CN
Inventors: 唐贤伦; 闫振甫; 李洁; 彭德光; 彭江平; 郝博慧; 朱楚洪; 李鹏华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2022-07-01
Anticipated expiration: 2040-12-14
Also published as: CN112507898A

Abstract

The invention provides a multi-modal dynamic gesture recognition method based on a lightweight 3D residual error network and a TCN. Firstly, sampling original videos in a data set, and sequencing and storing the videos according to a time sequence; then, pre-training a lightweight 3D residual error network by using a large public gesture recognition data set, and storing a weight file of the model; then, long-short term spatio-temporal features are extracted using the RGB-D image sequence as input and the lightweight 3D residual network and the time convolution network as base models, and information of the multiple modalities is fused by weighting using an attention mechanism. Wherein, RGB and Depth (Depth) sequences are respectively input into the same network structure; and finally, classifying by using a full connection layer, calculating a loss value by adopting a cross entropy loss function, and using the accuracy and the F1Score as evaluation indexes of the network model. The invention can achieve higher classification accuracy and has the advantage of low parameter quantity.

Description

Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Technical Field

The invention belongs to the technical field of video spatiotemporal feature extraction and classification methods, and particularly relates to a method for reducing model parameters and ensuring model performance by a lightweight heterogeneous structure for dynamic gesture spatiotemporal feature extraction.

Background

Gestures are a common form of human communication. Gesture recognition may enable human-computer interaction in a natural way. Gesture recognition aims to understand human actions by extracting features from images or videos and then classifying or recognizing each sample as a specific label. Traditional gesture recognition is mainly based on manually extracted features, and although the mode can achieve good recognition effect, the mode depends on experience of researchers to design the features, and the manually extracted features are poor in adaptability to dynamic gestures.

With the development of deep learning, end-to-end gesture recognition is increasingly possible. More and more researchers are trying to perform gesture recognition based on deep learning models. The dual-stream network is a pioneering attempt in dynamic gesture recognition research. The first proposal of the dual-flow network model is to solve the problem that the conventional Convolutional Neural Networks (CNNs) cannot well process the time sequence information in motion recognition, and the main idea is to use two independent CNNs to respectively extract spatial features and time sequence information from pictures and optical flow data. However, optical flow is based on continuous video input, which requires a large amount of computation. Thereby greatly reducing the overall speed of the dual-flow network model. 3DCNNs can directly learn spatiotemporal features, which have made a breakthrough in various computer vision-related analysis tasks. 3DCNNs mainly introduce a time dimension on the basis of a 2D convolution kernel, so that spatial features and temporal features can be extracted simultaneously. Based on 3DCNNs, researchers have proposed a number of deep network models with prominent performance, such as 3D-ResNet, I3D, and S3D. However, 3D convolution has a very large parameter number compared to 2D convolution, and it often takes a long time to train a model. Moreover, each 3D convolution typically only processes a small time window, not the entire video. Therefore, 3d cnns cannot efficiently encode long-term spatial information in dynamic gesture video, which hinders its development in video tasks.

Recurrent Neural Networks (RNN) and its variant Long Short-Term Memory (LSTM) are a deep learning model that models sequence data as input, which are commonly used to encode Long-Term spatiotemporal features of dynamic gestures. LSTM integrates information over time by learning how to store, modify and access internal states using storage units, which makes it better able to discover long-term and short-term temporal relationships of videos. However, since the storage unit utilizes full connections in input to state and state-to-state transitions, no spatial correlation information is encoded. Unlike traditional LSTM, the conditional Long Short-Term Memory (ConvLSTM) explicitly assumes that the input is a sequence of images and replaces the vector multiplication in the LSTM gate by a convolution operation, where the intermediate representation of the image retains spatially relevant information during the recursion process. Among the dynamic gesture recognition tasks, 3d cnns cascade ConvLSTM is currently the most used method. However, this method requires more memory and higher computation when the model is trained.

Therefore, a lightweight deep network model is needed which can guarantee the performance of the model. The separate convolution can greatly reduce the parameter amount of the 3D convolution, and meanwhile, the performance of the model can be kept. The Time Convolutional Network (TCN) is a new type of algorithm that can be used to solve the time series prediction and has relatively little computational complexity. The combination of the lightweight 3D residual network and the TCN is expected to solve the problem that the existing method is generally high in complexity. Meanwhile, the accuracy of classification can be improved by adopting the weighted fusion of the multi-modal features.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The method for multi-modal dynamic gesture recognition based on the lightweight 3D residual network and the TCN is provided on the premise of balancing model performance and model parameter quantity. The technical scheme of the invention is as follows:

a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN comprises the following steps:

step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set to generate pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequences generated by sampling in a time dimension mode by using a window sliding method;

step 2: using the normalized picture sequence in the step 1 as input, pre-training the lightweight 3D residual error network, and storing a lightweight 3D residual error network model weight file in a h5 mode, wherein the weight file stores the structure of the model; the weight of the model; training configuration; the state of the optimizer, so as to start from where the last training was interrupted;

and step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of the gestures in the video by using a light-weight 3D residual error network;

and 4, step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding long-term spatiotemporal features of the dynamic gesture by using the time convolution network;

and 5: weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using an attention mechanism;

step 6: and (5) classifying the feature vectors output in the step (5) by using a full connection layer, and mapping the classification result into a probability value of the gesture class through Softmax.

Further, the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:

sampling each gesture video in the data set according to the frame rate of the video, generating a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence. To ensure that the input data has the same dimensionality, a window sliding method is used to set the input reference frame number for each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is retained, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is executed in a loop until the sample exceeds 32 frames, and finally, each frame is randomly clipped to 224 × 224 and resized to 112 × 112 pixels.

Further, the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:

adopting a transfer learning idea, and pre-training a light-weight 3D residual error network by using a Jester public data set; the pre-training process is divided into a feature extraction part and a feature classification part, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and the weight of the model is saved in a form of h5 in the pre-training process.

Further, the step 3: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:

dividing a data set into a training set, a verification set and a test set according to the proportion of 3:1:1, wherein the training set is used for training a network model, the verification set is used for verifying the performance of the model when the model is trained, parameters are optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization ability of the model; and (2) respectively inputting the training set and the verification set of the RGB and Depth picture sequences preprocessed in the step 1 into two same lightweight 3D residual networks for short-term spatio-temporal feature extraction, wherein the lightweight 3D residual networks replace an original 3D convolution kernel 3 x 3 by using a separation convolution on the basis of 3D-ResNet, and the separation convolution is a convolution kernel for splitting the 3D convolution kernel into 1 x 3 and 3 x 1.

Further, the step 4: inputting the feature graph output in the step 3 into a time convolution network, and coding the long-term spatiotemporal features of the dynamic gesture by using the time convolution network, wherein the specific steps are as follows:

encoding the feature map output in step 3 by using a time convolution network to capture the relevant information between video frames in the dynamic gesture, wherein the time convolution network uses causal convolution and maps the input sequence to the output sequence with the same length, and in addition, expanding convolution and residual connection are used to train a deeper network;

assuming the output of a time convolutional networkThe sequence is X ═ X₁,...,x_T]The output is S ═ S₁,...,s_T]And y is_tDepends only on X ═ X₁,...,x_T]，t<T, the reason is that the calculation formula of the dilated convolution is as follows:

wherein_dFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, h_mRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for a TCN with L layers^LThe output of (c) is used for sequence classification. Class tags of sequences

Is assigned by a fully connected layer with a Softmax activation function;

b_oa bias term is represented.

Further, the step 5: the method for fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism in a weighted mode specifically comprises the following steps:

the Depth image contains motion information and three-dimensional structure information from a Depth channel and is insensitive to illumination change, clothes, skin color and other external factors, the fusion of RGB data and Depth data is used for accurately representing the characteristics of gestures, a non-linear combination mode is provided by introducing an attention mechanism, the mode can enable a network to dynamically select corresponding information in the whole characteristic extraction process, the strategy of weighting and fusing RGB and Depth data is realized, and the characteristic diagram sequence of the output of RGB branches is assumed as S_rgbThe output characteristic diagram sequence of the Depth branch is S_depthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:

wherein α ═ α_rgb,α_depth]For the fusion coefficient, the calculation formula is as follows:

wherein

Denotes average pooling, F_fcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, S_rgb、S_depthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively₀And W₁Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.

Further, the step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping the classification result into a probability value of a gesture class by using Softmax, wherein the probability value is specifically as follows:

weighting the fused information z ═ z in the step 5₁,...,z_T]Inputting the data into a full connection layer, multiplying a weight matrix by an input vector by the full connection layer, adding offset, outputting n fractions, wherein the value of n is from positive infinity to negative infinity, and Softmax maps the n fractions into a probability y of (0,1), and the calculation formula is as follows:

y＝Softmax(z)＝Softmax(W^Tz+b) (6)

w represents the weight, b represents the bias term, and Softmax is calculated as follows:

the fully connected layer can classify the gesture, obtain a score of each category, and then map the score to an interval of 0-1 by softmax, that is, generate a probability value of the gesture category.

The invention has the following advantages and beneficial effects:

the invention provides a multi-modal dynamic gesture recognition depth network model based on a lightweight 3D residual network and TCN (traffic control network). in the structure, a 3D convolution kernel is optimized by utilizing the idea of separating convolution, long-term space-time characteristics are coded by the TCN, and multi-modal information is weighted and fused by adopting an attention mechanism. Compared with the existing method, on one hand, the idea of separating convolution and the joint use of TCN can greatly reduce the complexity of the model and improve the speed of model recognition, thereby being expected to realize real-time gesture recognition. The idea of the separate convolution is to split the three-dimensional convolution kernel into a one-dimensional convolution kernel and a two-dimensional convolution kernel, and connect the two convolutions in a serial manner. The method is equivalent to extracting the time dimension characteristics first and then extracting the space dimension characteristics, and meets the requirement of learning the dynamic gesture spatiotemporal characteristics. Moreover, the sum of the parameter quantities of the one-dimensional convolution kernel and the two-dimensional convolution kernel is far smaller than the parameter quantity of the three-dimensional convolution kernel; in addition, TCN is simpler in model and consumes less memory than LSTM. On the other hand, most of the existing methods directly fuse multi-modal information in a linear mode, which causes redundancy of the information to a certain extent, and the model has poor expandability. The information of the multiple modes is weighted and fused by using an attention mechanism, so that the redundancy of the information can be reduced, the weight of the long and short space-time characteristics can be automatically adjusted by the model according to the stimulation of the neurons, the performance of the model is improved, and the model has self-adaptability so as to be applied to other types of video learning tasks in an expanded mode.

Drawings

Fig. 1 is a flow chart of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN according to a preferred embodiment of the present invention.

Fig. 2 is an architecture diagram of a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN.

Fig. 3 is a comparison graph of the convolution kernel of the 3D residual network and the convolution kernel of the lightweight 3D residual network.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1, the multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN provided by this embodiment includes the following steps:

step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, and sorting and storing the pictures according to the time sequence. In order to ensure that input data have the same dimension, a window sliding method is used to set an input reference frame number for each gesture video. The value is set to 32, and for videos above 32 frames, irrelevant images at both ends are deleted, and the key frame in the middle is reserved. For video less than 32 frames, we repeat some frames at a certain rate. The process will loop until the samples exceed 32 frames. Finally, we randomly crop each frame to 224 × 224 and resize it to 112 × 112 pixels.

Step 2: because the 3D convolution has more parameter quantity, the convergence speed is low in model training, and the overfitting phenomenon is easy to occur. In order to solve the problems, the invention adopts the idea of transfer learning and uses a Jester data set to pre-train a lightweight 3D residual error network. The pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, and the feature classification part is a full connection layer and a Softmax layer. During pre-training, the weights for the model are saved in the form of h 5.

And 3, step 3: the data set is divided into three parts, namely a training set, a verification set and a test set according to the ratio of 3:1: 1. The training set is mainly used for training the network model, the verification set is mainly used for verifying the performance of the model when the model is trained, parameters can be generally optimized according to the performance of the model in the verification set, and the test set is used for evaluating the generalization capability of the model. In the invention, a training set and a verification set of RGB and Depth picture sequences preprocessed in step 1 are used as input, and are respectively input into two same lightweight 3D residual networks for short-term space-time feature extraction, wherein the lightweight 3D residual networks use 3D-ResNet as a basis and replace original 3D convolution kernels 3 x 3 by using separation convolution. The split convolution is a convolution kernel that splits the 3D convolution kernel into 1 x 3 and 3 x 1, which both preserves the performance of the 3D convolution and reduces the amount of parameters of the 3D convolution.

And 4, step 4: dynamic gesture recognition is primarily based on a series of gesture actions to identify specific gesture classes. Therefore, the characteristics of the time dimension are very important for dynamic gesture recognition. In the invention, a time convolution network is used for coding the characteristic diagram sequence output in the step 3 so as to capture the relevant information among the video frames in the dynamic gesture. The time convolution network is a novel algorithm which can be used for solving the time series prediction. The main feature of a time convolutional network is to use causal convolution and to map the input sequence to an output sequence of the same length. In addition, considering the far-ahead sequences, the model uses the expansion convolution and residual connection, so that a larger receptive field is realized, and a deeper network can be trained.

Assume that the input sequence of the time convolutional network is X ═ X₁,...,x_T]The output is S ═ S₁,...,s_T]And y is_t，t<The calculation of T depends only on X ═ X₁,...,x_T]. The reason is that the calculation formula of the dilation convolution is as follows:

Is assigned through the fully connected layer with the Softmax activation function.

b_oA bias term is represented.

And 5: the Depth image contains motion information and three-dimensional structure information from the Depth channel and is insensitive to illumination variations, clothing, skin tone and other external factors. Therefore, it can be an important complement to the original RGB image. The fusion of the RGB data and the Depth data can more accurately represent the characteristics of the gesture, so that the accuracy of gesture recognition is improved. In addition, it is also important to select an appropriate fusion strategy. The approach of linear aggregation may not be sufficient to provide a strong adaptability of the neurons, and may also produce redundant information. In order to solve the problem, the invention provides a nonlinear combination mode by introducing an attention mechanism. The method can enable the network to dynamically select corresponding information in the whole feature extraction process, and realizes a strategy of weighting and fusing RGB and Depth data. Assume that the sequence of the profile of the output of the RGB branch is S_rgbThe output characteristic diagram sequence of the Depth branch is S_depthAnd the sequence of the feature map after the fusion of the two is z, the weighted summation of the two branches is as follows:

wherein

Denotes average pooling, F_fcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, S_rgb、S_depthSequence of feature maps representing the output of RGB branches, sequence of output feature maps of Depth branches, W, respectively₀And W₁The convolution weight and full-link layer weight of 1 × 1 × 1 are respectively expressed, β represents batch normalization, and δ represents the ReLU activation function.

And 6: weighting the fused information z ═ z in the step 5₁,...,z_T]The input is the full connection layer, the full connection layer multiplies the weight matrix and the input vector, and the offset is added, and n (plus infinity, minus infinity) fractions are output. Softmax maps the n (positive infinity, negative infinity) scores to a probability y of (0, 1). The calculation formula is as follows:

y＝Softmax(z)＝Softmax(W^Tz+b) (13)

w represents a weight, b represents a bias term. Softmax is calculated as follows:

the method illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN is characterized by comprising the following steps:

step 1: sampling each gesture video according to the frame rate of the gesture video in the original data set, generating pictures with the number corresponding to the frame rate of the video, sequencing and storing the pictures according to a time sequence, and unifying the picture sequence generated by sampling in a time dimension by using a window sliding method;

and 2, step: pre-training the lightweight 3D residual network by using the unified picture sequence in the step 1 as input, and saving a lightweight 3D residual network model weight file in a h5 form, wherein the weight file saves the structure of the model, the weight of the model, training configuration and the state of an optimizer so as to start from the place where the last training is interrupted;

step 6: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax;

the step 5: the method for weighting and fusing the space-time characteristic information of the RGB and Depth branch networks by using the attention mechanism specifically comprises the following steps:

wherein

Denotes average pooling, F_fcConv and AvgPool respectively represent the total connection layer, the convolution layer, the average pooling, S_rgb、S_depthSequence of profiles representing the output of RGB branches, sequence of output profiles of Depth branches, W, respectively₀And W₁Convolution weights and full-connected layer weights of 1 × 1, respectively, are indicated, β indicates batch normalization, and δ indicates the ReLU activation function.

2. The method according to claim 1, wherein the step 1: sampling an original data set video, sequencing and storing according to a time sequence, and unifying a picture sequence generated by sampling in a time dimension by using a window sliding method, wherein the method specifically comprises the following steps:

sampling each gesture video in the data set according to the frame rate of the video to generate a corresponding number of picture sequences, and sequencing and storing the picture sequences according to the time sequence; to ensure that the input data has the same dimension, a window sliding method is used to set the input reference frame number of each gesture video, the reference frame number is set to 32, for videos above 32 frames, irrelevant images at both ends are deleted, the key frame in the middle is kept, for videos less than 32 frames, some frames are repeated in a certain proportion, the process is circularly executed until the sample exceeds 32 frames, and finally, each frame is randomly cut into 224 × 224 and adjusted to be 112 × 112 pixels in size.

3. The method according to claim 2, wherein the step 2: using the normalized picture sequence in the step 1 as an input, pre-training the lightweight 3D residual network, and saving a lightweight 3D residual network model weight file in the form of h5, specifically comprising the steps of:

adopting a transfer learning idea, and pre-training a lightweight 3D residual error network by using a Jester public data set; the pre-training process is divided into two parts of feature extraction and feature classification, wherein the feature extraction part is a lightweight 3D residual error network, the feature classification part is a full connection layer and a Softmax layer, and in the pre-training process, the weight of the model is saved in a form of h 5.

4. The method according to claim 3, wherein the step 3 is as follows: loading the weight file in the step 2, taking a training set and a verification set of an RGB-D picture sequence as input, and learning short-term space-time characteristics of gestures in the video by using a lightweight 3D residual error network, wherein the method specifically comprises the following steps:

5. The method according to claim 4, wherein the step 4: inputting the feature map output in the step 3 into a time convolution network, and coding the long-term space-time features of the dynamic gestures by using the time convolution network, wherein the long-term space-time features specifically comprise the following steps:

assume that the input sequence of the time convolutional network is X ═ X₁,...,x_T]The output is S ═ S₁,...,s_T]And y is_tDepends only on X ═ X₁,...,x_T]T < T, since the swell convolution is calculated as follows:

wherein_dFor the operator of the dilation convolution, d is the dilation coefficient, h is the impulse response of the filter, m denotes the convolution kernel size, h_mRepresenting the impulse response of the filter under the m convolution kernel, the last layer s for TCN with L layers^LThe output of (2) is used for sequence classification, class labels of sequences

Is assigned by a fully connected layer with a Softmax activation function;

b_orepresenting a bias term.

6. The method according to claim 5, wherein the step 6 comprises: classifying the feature vectors output in the step 5 by using a full connection layer, and mapping a classification result into a probability value of a gesture class through Softmax, wherein the probability value is specifically as follows:

weighting the fused information z ═ z in the step 5₁,...,z_T]Inputting the data into a full connection layer, multiplying the weight matrix by the input vector by the full connection layer, adding offset, and outputting n fractions, wherein n is positive infinity, negative noneSoftmax maps the n scores to a probability y of (0,1), which is calculated as follows:

y＝Softmax(z)＝Softmax(W^Tz+b) (6)