CN111104929B

CN111104929B - Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Info

Publication number: CN111104929B
Application number: CN201911423353.1A
Authority: CN
Inventors: 彭永坚; 汪壮雄; 许冰媛; 周智恒; 彭明; 朱湘军
Original assignee: GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD; South China University of Technology SCUT
Current assignee: GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD; South China University of Technology SCUT
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-05-09
Anticipated expiration: 2039-12-31
Also published as: CN111104929A

Abstract

The invention discloses a multi-mode dynamic gesture recognition method based on 3D convolution and SPP, which comprises the following steps: data preprocessing, namely extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein the dimension of each sample is 32 multiplied by 112; data enhancement, namely amplifying a sequence sample data set through translation, overturning, noise adding and affine transformation; training the neural network, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, and respectively training three networks to judge gestures; model integration, namely integrating classification results of the sequence samples by three networks to obtain a final discrimination result; by adopting the technical scheme of the invention, the accuracy of gesture recognition can be improved.

Description

Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Technical Field

The invention relates to the technical field of image recognition, in particular to a multi-mode dynamic gesture recognition method based on 3D convolution and SPP.

Background

Gesture is one of important ways of man-machine interaction, and gesture recognition is to use a computer to recognize gesture actions made by people. Gesture recognition includes static gesture recognition and dynamic gesture recognition, and static gesture recognition focuses on the hand shape of a certain frame of image, and is relatively simple. Dynamic gesture recognition focuses not only on hand shape, but also on track and shape changes of gestures in the space-time dimension. Because of the diversity and the diversity of the dynamic gestures, the recognition accuracy of the dynamic gestures is still low, and the dynamic gestures are a challenging research direction in the field of artificial intelligence.

With the development of deep learning, dynamic gesture recognition by using a deep convolutional neural network is attracting attention of students. However, when a common 2D convolutional neural network is used for processing a video image sequence, information of a target in a time dimension is easily lost, and change information of the target in the time-space dimension cannot be effectively extracted, so that the recognition accuracy of the network is affected. Therefore, feature learning of the video space-time dimension is a key for realizing human dynamic gesture recognition.

Disclosure of Invention

In order to solve the above technical problems, an embodiment of the present invention provides a method for identifying multi-modal dynamic gestures based on 3D convolution and SPP, including:

a data preprocessing step, namely extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein the dimension of each sample is 32 multiplied by 112;

a data enhancement step of amplifying a sequence sample data set by translation, overturning, noise adding and affine transformation;

the neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, and respectively training three networks to judge gestures;

and a model integration step, namely integrating classification results of the sequence samples by the three networks to obtain a final discrimination result.

Preferably, the data preprocessing step comprises the following steps:

extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing a iDT algorithm to obtain 1080 optical flow sequence samples;

graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples;

different gesture sequence samples have different durations, each sequence sample is ordered into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112×112, and the dimension is used as the input of the neural network.

Preferably, the iDT algorithm is as follows:

the iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the later frame of image is obtained by projectively transforming the former frame of image;

and performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm.

Preferably, the data enhancement step is as follows:

the optical flow sequence sample, the gray sequence sample and the depth sequence sample corresponding to the same gesture are transformed in the same mode, wherein the transformation mode comprises the following steps:

the translation operation is as follows, the pixel point (x, y) on each channel of each sequence sample is translated by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy). Wherein Deltax is any integer of [ -0.1 xw, 0.1 xw ], deltay is any integer of [ -0.1 xh, 0.1 xh ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;

the turning operation is as follows, the data of each channel of each sequence sample is subjected to mirror image horizontal turning and mirror image up-down turning;

the noise adding operation is as follows, gaussian white noise is added to the data of each channel of each sequence sample, and the added noise obeys Gaussian distribution with the mean value of 0 and the variance of 0.1;

the affine transformation operates as follows, rotating the data of each channel of each sequence sample by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.

Preferably, the neural network training step comprises the following steps:

respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to judge the gesture, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network and training the depth sequence sample to obtain a third neural network;

the neural network is composed of a 3D convolutional neural network, SPPs and full-connection layers, the 3D convolutional neural network is used for simultaneously extracting space-time characteristics of gestures, then SPPs are used for extracting global and local characteristics, and two full-connection layers and softmax are input to obtain the scores of gesture classification.

Preferably, the 3D convolutional neural network comprises 5 convolutional layers;

each convolution layer comprises two operations of convolution operation and pooling operation, the convolution kernel adopted by the convolution operation is 3 multiplied by 3, and the step length is 1 multiplied by 1;

the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64, 128 and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operation, the pooling window of the first pooling operation is 1 multiplied by 2, the step length is 2 multiplied by 2, the pooling windows of the second pooling operation and the third pooling operation are both 2 multiplied by 2, and the step length is 2 multiplied by 2;

the fourth convolution operation and the fifth convolution operation all comprise 512 convolution kernels, the pooling window of the fourth pooling operation and the pooling window of the fifth pooling operation are 2 multiplied by 2, the step length is 2 multiplied by 1, and the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.

As a preferred scheme, the SPP network performs spatial pyramid pooling on feature graphs obtained by the 3D convolutional neural network in different scales to obtain feature vectors of (16+4+1) multiplied by 512 dimensions, the feature vectors of (16+4+1) multiplied by 512 dimensions are input into two fully-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain the score of 10 types of gestures.

Preferably, the model integration correspondingly multiplies gesture classification scores of three network pairs of sequence samples, and discriminates the samples as gesture categories with highest scores.

Compared with the prior art, the invention has the following advantages and effects:

(1) According to the method, the translation, the overturning, the noise adding and the affine transformation are utilized for data amplification, so that the generalization capability of the gesture classification model is improved;

(2) According to the invention, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, and the local characteristics and the global characteristics are extracted by utilizing the SPP network, so that the dynamic gesture recognition with high accuracy is realized;

(3) According to the method, the multi-mode sequence sample is used as input, three gesture classifiers are trained respectively, and the recognition accuracy of a gesture recognition system is improved through model integration.

Drawings

FIG. 1 is a general flow chart of the disclosed 3D convolution and SPP based multi-modal dynamic gesture recognition method;

fig. 2 is a schematic diagram of a neural network structure in the multi-mode dynamic gesture recognition method based on 3D convolution and SPP disclosed in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

the dataset SKIG used in this embodiment contains 2160 gesture video sequences, of which there are 1080 RGB video sequences and 1080 depth video sequences, all captured simultaneously by the Kinect sensor, containing 10 types of gestures.

As shown in fig. 1, the multi-mode dynamic gesture recognition method based on 3D convolution and SPP sequentially includes the following steps: a data preprocessing step, a data enhancement step, a neural network training step and a model integration step.

And in the data preprocessing step, 1080 RGB video sequences contained in the SKIG data set are subjected to optical flow characteristics extraction by using a iDT algorithm, so that 1080 optical flow sequence samples are obtained. And graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples. Different gesture sequence samples have different durations, each sequence sample is ordered into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112×112, namely each sequence sample has the dimension of 32×112×112, and the input of the neural network is used.

The iDT algorithm assumes that the relationship between two adjacent frames of images can be described by a projective transformation matrix, and the next frame of image can be obtained by projectively transforming the previous frame of image, so that the problem that the change of the two adjacent frames of images is relatively small is solved. And performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm.

The data enhancement step, the optical flow sequence sample, the gray sequence sample and the depth sequence sample corresponding to the same gesture are transformed in the same mode, the sequence sample data set is amplified, and the transformation mode comprises the following steps:

the translation operation is as follows:

the pixel point (x, y) on each channel of each sequence sample is translated by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy). Wherein Deltax is any integer of [ -0.1 xw, 0.1 xw ], deltay is any integer of [ -0.1 xh, 0.1 xh ], w is the corresponding width of each frame image, and h is the corresponding length of each frame image.

The flipping operation is as follows:

and carrying out mirror image horizontal overturn and mirror image up-down overturn on the data of each channel of each sequence sample.

The noise adding operation is as follows:

the data of each channel of each sequence sample is added with Gaussian white noise, and the added noise obeys Gaussian distribution with the mean value of 0 and the variance of 0.1.

The affine transformation operation is as follows:

the data of each channel of each sequence of samples is rotated by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.

And a neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, and respectively training three neural networks to judge the gesture. Specifically, the optical flow sequence sample is trained to obtain a first neural network, the gray sequence sample is trained to obtain a second neural network, and the depth sequence sample is trained to obtain a third neural network.

And model integration, namely correspondingly multiplying gesture classification scores of the three network pairs of sequence samples, and distinguishing the samples as gesture categories with highest scores.

As shown in fig. 2, the neural network is composed of a 3D convolutional neural network, SPPs and full-connection layers, the space-time characteristics of gestures are extracted simultaneously by using the 3D convolutional neural network, then global and local characteristics are extracted by using the SPPs, and the two full-connection layers and softmax are input to obtain the scores of gesture classification.

The 3D convolutional neural network comprises 5 convolutional layers, each convolutional layer comprises two operations of convolutional operation and pooling operation, the convolutional kernel adopted by the convolutional operation is 3 multiplied by 3, and the step size is 1 multiplied by 1.

The first convolution operation C1, the second convolution operation C2 and the third convolution operation C3 respectively comprise 64, 128 and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operation, the pooling window of the first pooling operation P1 is 1 multiplied by 2, the step length is 2 multiplied by 2, the pooling windows of the second pooling operation P2 and the third pooling operation P3 are both 2 multiplied by 2, and the step length is 2 multiplied by 2;

the fourth convolution operation C4 and the fifth convolution operation C5 each comprise 512 convolution kernels, the fourth pooling operation P4 and the fifth pooling operation P5 have a pooling window of 2 x 2, a step size of 2 x 1, the first pooling operation P1, the second pooling operation P2, the third pooling operation P3, the fourth pooling operation P4 and the fifth pooling operation P5 all adopt a mean pooling method.

And the SPP network pools the feature graphs obtained by the 3D convolutional neural network with different dimensions of space pyramids to obtain feature vectors of (16+4+1) multiplied by 512 dimensions. Inputting the obtained feature vector into two fully connected layers, wherein the number of neurons is 1024, and inputting the result into a softmax layer to obtain the score of 10 types of gestures.

In summary, the embodiment discloses a multi-mode dynamic gesture recognition method based on 3D convolution and SPP, which improves the generalization capability of a gesture classification model by performing data amplification by using translation, flipping, noise adding and affine transformation. According to the method, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, and the SPP network is utilized to extract the local characteristics and the global characteristics, so that the dynamic gesture recognition with high accuracy is realized. In addition, the method takes the multi-mode sequence sample as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The multi-mode dynamic gesture recognition method based on the 3D convolution and the SPP is characterized by comprising the following steps of:

the data preprocessing step, extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein each sample dimension is 32 multiplied by 112, and specifically: extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing a iDT algorithm to obtain 1080 optical flow sequence samples; graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples; different gesture sequence samples have different time lengths, each sequence sample is regulated into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112 multiplied by 112 and is used as the input of the neural network;

the iDT algorithm is as follows: the iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the later frame of image is obtained by projectively transforming the former frame of image; performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm;

the neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, respectively training three networks to judge gestures, specifically: respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to judge the gesture, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network and training the depth sequence sample to obtain a third neural network; the neural network consists of a 3D convolutional neural network, SPPs and full-connection layers, the 3D convolutional neural network is used for simultaneously extracting space-time characteristics of gestures, then SPPs are used for extracting global and local characteristics, and two full-connection layers and softmax are input to obtain the scores of gesture classification;

2. The method for multi-modal dynamic gesture recognition based on 3D convolution and SPP of claim 1, wherein the data enhancement step is performed as follows:

the translation operation is as follows, translating the pixel point (x, y) on each channel of each sequence sample by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy); wherein Δx is [ -0.1×w,0.1×w]Is an integer of one of the above,

is [ -0.1 Xh, 0.1 Xh]W is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;

3. The 3D convolution and SPP based multi-modal dynamic gesture recognition method of claim 1, wherein the 3D convolutional neural network comprises 5 convolutional layers;

4. The multi-mode dynamic gesture recognition method based on 3D convolution and SPP according to claim 1, wherein the SPP network performs spatial pyramid pooling on feature graphs obtained by the 3D convolution neural network in different scales to obtain feature vectors of (16+4+1) x 512 dimensions, the feature vectors of (16+4+1) x 512 dimensions are input into two fully-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain the score of 10 types of gestures.

5. The 3D convolution and SPP based multi-modal dynamic gesture recognition method of claim 1, wherein the model integration correspondingly multiplies gesture classification scores of three network pairs of sequence samples, discriminating the samples as the highest-score gesture class.