CN111104929A

CN111104929A - Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)

Info

Publication number: CN111104929A
Application number: CN201911423353.1A
Authority: CN
Inventors: 彭永坚; 汪壮雄; 许冰媛; 周智恒; 彭明; 朱湘军
Original assignee: Guangzhou Video Star Intelligent Technology Co ltd; South China University of Technology SCUT
Current assignee: Guangzhou Video Star Intelligent Technology Co ltd; South China University of Technology SCUT
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-05
Anticipated expiration: 2039-12-31
Also published as: CN111104929B

Abstract

The invention discloses a multi-modal dynamic gesture recognition method based on 3D volume sum and SPP, which comprises the following steps: data preprocessing, namely extracting optical flow characteristics and gray level characteristics from an RGB video sequence, respectively obtaining optical flow sequence samples and gray level sequence samples, and regulating each optical flow sequence sample, each gray level sequence sample and each depth sequence sample into 32 frames, wherein the dimensionality of each sample is 32 multiplied by 112; data enhancement, namely amplifying a sequence sample data set through translation, turnover, noise addition and affine transformation; training a neural network, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample into the same network structure, and respectively training three networks to perform gesture judgment; model integration, namely integrating classification results of the sequence samples by the three networks to obtain a final judgment result; by adopting the technical scheme of the invention, the accuracy of gesture recognition can be improved.

Description

Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)

Technical Field

The invention relates to the technical field of image recognition, in particular to a multi-modal dynamic gesture recognition method based on 3D volume sum and SPP.

Background

Gestures are one of important ways of man-machine interaction, and gesture recognition is to recognize gesture actions made by people by using a computer. The gesture recognition comprises static gesture recognition and dynamic gesture recognition, the static gesture recognition focuses on the shape of a hand of a certain frame of image, and the gesture recognition is relatively simple. Dynamic gesture recognition focuses not only on hand shapes, but also on trajectory and shape changes of gestures in spatiotemporal dimensions. Due to the diversity and difference of the dynamic gestures, the recognition accuracy of the dynamic gestures is still low, and the method is a challenging research direction in the field of artificial intelligence.

With the development of deep learning, the dynamic gesture recognition by using a deep convolutional neural network is concerned by students. However, when a common 2D convolutional neural network is used for processing a video image sequence, information of a target in a time dimension is easily lost, and change information of the target in a space-time dimension cannot be effectively extracted, thereby affecting the identification accuracy of the network. Therefore, feature learning of the video spatiotemporal dimension is a key for realizing human dynamic gesture recognition.

Disclosure of Invention

In order to solve the above technical problem, an embodiment of the present invention provides a multimodal dynamic gesture recognition method based on 3D volume and SPP, including:

a data preprocessing step, namely extracting optical flow characteristics and gray level characteristics from an RGB video sequence, respectively obtaining optical flow sequence samples and gray level sequence samples, and regulating each optical flow sequence sample, each gray level sequence sample and each depth sequence sample into 32 frames, wherein the dimensionality of each sample is 32 multiplied by 112;

a data enhancement step, namely amplifying a sequence sample data set through translation, turnover, noise addition and affine transformation;

a neural network training step, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample into the same network structure, and respectively training three networks to perform gesture judgment;

and a model integration step, namely integrating the classification results of the sequence samples by the three networks to obtain a final judgment result.

Preferably, the data preprocessing step process is as follows:

extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing an iDT algorithm to obtain 1080 optical flow sequence samples;

graying each frame of image of an RGB video sequence to obtain 1080 grayscale sequence samples;

different gesture sequence samples have different durations, each sequence sample is structured into 32 fixed frames by adopting a method of repeating frames or discarding frames in the nearest neighborhood, and the dimension of each frame is 112 multiplied by 112 to be used as the input of a neural network.

Preferably, the iDT algorithm is as follows:

iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the latter frame of image is obtained by the previous frame of image through projective transformation;

and (3) carrying out feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projection transformation matrix by utilizing a RANSAC algorithm.

Preferably, the data enhancement step process is as follows:

and carrying out transformation in the same way on the optical flow sequence sample, the gray level sequence sample and the depth sequence sample corresponding to the same gesture, wherein the transformation way comprises the following steps:

the shift operation is to shift the pixel point (x, y) on each channel of each sequence sample by Δ x units along the x-axis and Δ y units along the y-axis, i.e., (x ', y') (x + Δ x, y + Δ y). Wherein Δ x is any integer of [ -0.1 × w,0.1 × w ], Δ y is any integer of [ -0.1 × h,0.1 × h ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;

the turning operation comprises the following steps of carrying out mirror image horizontal turning and mirror image up-down turning on the data of each channel of each sequence sample;

adding white Gaussian noise to the data of each channel of each sequence sample, wherein the added noise follows Gaussian distribution with the mean value of 0 and the variance of 0.1;

the affine transformation operates by performing a set angular rotation of the data for each channel of each sequence sample, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.

Preferably, the neural network training step process is as follows:

respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to perform gesture discrimination, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network, and training the depth sequence sample to obtain a third neural network;

the neural network is composed of a 3D convolutional neural network, an SPP and a full connection layer, the 3D convolutional neural network is used for simultaneously extracting the space-time characteristics of the gestures, then the SPP is used for extracting the global and local characteristics, and the scores of the gesture classification are obtained by inputting the two full connection layers and softmax.

Preferably, the 3D convolutional neural network comprises 5 convolutional layers;

each convolution layer comprises convolution operation and pooling operation, the sizes of convolution kernels adopted by the convolution operation are 3 multiplied by 3, and the step length is 1 multiplied by 1;

the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64 convolution kernels, 128 convolution kernels and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operations, the pooling window of the first pooling operation is 1 x 2, the step size is 2 x 2, the pooling windows of the second pooling operation and the third pooling operation are 2 x 2, and the step size is 2 x 2;

the fourth convolution operation and the fifth convolution operation both comprise 512 convolution kernels, the pooling windows of the fourth pooling operation and the fifth pooling operation are 2 × 2 × 2, and the step size is 2 × 1 × 1, wherein the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.

As a preferred scheme, the SPP network performs spatial pyramid pooling of different scales on feature maps obtained by the 3D convolutional neural network to obtain (16+4+1) × 512-dimensional feature vectors, inputs the (16+4+1) × 512-dimensional feature vectors into two fully-connected layers, the number of neurons is 1024, and inputs the result into the softmax layer to obtain scores of 10 types of gestures.

As a preferred scheme, the model integration multiplies the gesture classification scores of the three network pair sequence samples correspondingly, and discriminates the sample as the gesture class with the highest score.

Compared with the prior art, the invention has the following advantages and effects:

(1) according to the gesture classification method, data amplification is performed by utilizing translation, overturning, noise adding and affine transformation, so that the generalization capability of the gesture classification model is improved;

(2) according to the invention, sequence samples are input into a 3D convolutional neural network, and the time-space characteristics are extracted at the same time, and the SPP network is used for extracting local characteristics and global characteristics, so that high-accuracy dynamic gesture recognition is realized;

(3) the method takes the multi-modal sequence samples as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.

Drawings

FIG. 1 is a general flow diagram of the disclosed multi-modal dynamic gesture recognition method based on 3D volume and SPP;

FIG. 2 is a schematic diagram of a neural network structure in the multi-modal dynamic gesture recognition method based on 3D volume and SPP.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

the data set, skeg, used in this embodiment contains 2160 gesture video sequences, among which 1080 RGB video sequences and 1080 depth video sequences, all of which are captured by the Kinect sensor at the same time, including 10 types of gestures.

As shown in fig. 1, the multi-modal dynamic gesture recognition method based on 3D volume sum and SPP sequentially includes the following steps: the method comprises the steps of data preprocessing, data enhancement, neural network training and model integration.

And a data preprocessing step, namely extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing an iDT algorithm to obtain 1080 optical flow sequence samples. Graying each frame of image of the RGB video sequence to obtain 1080 grayscale sequence samples. Different gesture sequence samples have different durations, each sequence sample is normalized into a fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighborhood, the dimension of each frame is 112 × 112, namely the dimension of each sequence sample is 32 × 112 × 112, and the fixed 32 frames are used as the input of the neural network.

iDT the algorithm assumes that the relationship between two adjacent frames of images can be described by a projective transformation matrix, and the latter frame of image can be obtained from the former frame of image by projective transformation, thereby solving the problem of small change between two adjacent frames of images. And (3) carrying out feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projection transformation matrix by utilizing a RANSAC algorithm.

And a data enhancement step, namely performing the same-mode transformation on the optical flow sequence sample, the gray level sequence sample and the depth sequence sample corresponding to the same gesture, and amplifying a sequence sample data set, wherein the transformation mode comprises the following steps:

the translation operation is as follows:

the pixel point (x, y) on each channel of each sequence sample is shifted by Δ x units along the x-axis and Δ y units along the y-axis, i.e., (x ', y') (x + Δ x, y + Δ y). Where Δ x is any integer of [ -0.1 × w,0.1 × w ], Δ y is any integer of [ -0.1 × h,0.1 × h ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image.

The turning operation is as follows:

and carrying out mirror image horizontal inversion and mirror image up-down inversion on the data of each channel of each sequence sample.

The noise addition operation is as follows:

white gaussian noise is added to the data of each channel of each sequence sample, and the added noise follows a gaussian distribution with a mean value of 0 and a variance of 0.1.

The affine transformation operation is as follows:

the data for each channel of each sequence sample is rotated by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.

And a neural network training step, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample corresponding to the same gesture into the same network structure, and respectively training three neural networks to judge the gesture. Specifically, a first neural network is obtained through optical flow sequence sample training, a second neural network is obtained through gray level sequence sample training, and a third neural network is obtained through depth sequence sample training.

And model integration, namely correspondingly multiplying the gesture classification scores of the three network pairs of sequence samples, and judging the samples as the gesture class with the highest score.

As shown in fig. 2, the neural network is composed of a 3D convolutional neural network, an SPP and a fully connected layer, the 3D convolutional neural network is used to simultaneously extract spatiotemporal features of gestures, then the SPP is used to extract global and local features, and the scores of gesture classification are obtained by inputting two fully connected layers and softmax.

The 3D convolutional neural network comprises 5 convolutional layers, each convolutional layer comprises a convolution operation and a pooling operation, the sizes of convolution kernels adopted by the convolution operations are 3 x 3, and the step size is 1 x 1.

The first convolution operation C1, the second convolution operation C2 and the third convolution operation C3 respectively comprise 64, 128 and 256 convolution kernels, and adopt a BN layer and a ReLU activation function after the convolution operations, the pooling window of the first pooling operation P1 is 1 × 2 × 2, the step size is 2 × 2 × 2, the pooling windows of the second pooling operation P2 and the third pooling operation P3 are both 2 × 2 × 2 × 2, and the step size is 2 × 2 × 2;

the fourth convolution operation C4 and the fifth convolution operation C5 each contain 512 convolution kernels, and the pooling windows of the fourth pooling operation P4 and the fifth pooling operation P5 are 2 × 2 × 2 and the step size is 2 × 1 × 1, wherein the first pooling operation P1, the second pooling operation P2, the third pooling operation P3, the fourth pooling operation P4 and the fifth pooling operation P5 all employ a mean pooling method.

The SPP network performs spatial pyramid pooling of different scales on the feature map obtained by the 3D convolutional neural network to obtain a (16+4+1) × 512-dimensional feature vector. And inputting the obtained feature vectors into two full-connection layers, wherein the number of the neurons is 1024, and inputting the result into a softmax layer to obtain the scores of 10 types of gestures.

In summary, the embodiment discloses a multi-modal dynamic gesture recognition method based on 3D convolution and SPP, which improves the generalization capability of the gesture classification model by performing data amplification through translation, flipping, noise addition and affine transformation. According to the method, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, the local characteristics and the global characteristics are extracted by utilizing the SPP network, and the high-accuracy dynamic gesture recognition is realized. In addition, the method takes the multi-modal sequence samples as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A multi-modal dynamic gesture recognition method based on 3D volume Sum (SPP), comprising:

2. The method of claim 1, wherein the data preprocessing step is performed as follows:

3. The method of claim 2, wherein the iDT algorithm is as follows:

4. The method of claim 1, wherein the data enhancement step is performed by the following steps:

5. The method of claim 1, wherein the neural network training step is performed as follows:

6. The method of claim 5, wherein the 3D convolutional neural network comprises 5 convolutional layers;

7. The multi-modal dynamic gesture recognition method based on 3D convolution and SPP according to claim 5, wherein the SPP network performs spatial pyramid pooling of different scales on feature maps obtained by a 3D convolution neural network to obtain feature vectors of (16+4+1) x 512 dimensions, the feature vectors of (16+4+1) x 512 dimensions are input into two full-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain scores of 10 types of gestures.

8. The method of claim 1, wherein the model integration multiplies gesture classification scores of three network pair sequence samples correspondingly, and discriminates the sample as the gesture class with the highest score.