CN113052112B

CN113052112B - Gesture motion recognition interaction system and method based on hybrid neural network

Info

Publication number: CN113052112B
Application number: CN202110361015.0A
Authority: CN
Inventors: 王立军; 于霄洋; 李争平
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2023-06-02
Anticipated expiration: 2041-04-02
Also published as: CN113052112A

Abstract

The invention discloses a projection gesture motion recognition interaction method and system based on a 3D CNN and RNN hybrid neural network. The invention can obtain the depth information of the hand information, can improve the accuracy of identification, achieves the most advanced performance on the data set built by the user, combines the 3DCNN and the RNN mixed neural network, and has the fusion effect which is greatly improved compared with the prior algorithm effect of CNN+RNN.

Description

Gesture motion recognition interaction system and method based on hybrid neural network

Technical Field

The invention belongs to the technical field of image recognition, and relates to a gesture motion recognition interaction system and method based on a hybrid neural network.

Background

In recent years, with the rise of artificial intelligence, machine learning and deep learning have rolled up the surge of computers. Human-machine interaction has become a major issue in research in the field of machine vision today. Intelligent devices with man-machine interaction function are rapidly developing in the market. Gestures, which are the most commonly used human interaction in people's daily lives, have been applied to many smart devices.

Gestures and hand gestures are a common form of human communication. It is therefore also natural for humans to interact with machines using this communication. For example, simple interactive human-machine interaction can improve the comfort and safety of automobiles; the simple gesture interaction can more conveniently perform interaction of the intelligent home; the gesture recognition with high recognition accuracy can enable the VR/AR gesture recognition to run more smoothly.

Gesture recognition is in turn classified into static gesture recognition and dynamic gesture recognition. The sample for static gesture recognition training is a static picture. The dynamic hand gesture recognition training sample is dynamic hand motion, namely motion performed by the hand is detected in a real-time video. Gesture recognition is the meaning of interpreting the actions of a human hand. In gesture recognition systems today, a variety of gesture recognition techniques based on data such as depth cameras, color cameras, distance sensors, wearable inertial sensors, or other modality type sensors have been proposed by many researchers. Some of gesture recognition based on computer vision is static gesture recognition, and these methods can only be static gestures, and gesture recognition is unnatural. In real systems for human-computer interaction, automatic detection and classification of dynamic gestures is challenging because (1) people have great differences in making gestures, recognition, and classification; (2) The system must work online to avoid significant delays between performing the gesture and classifying.

Disclosure of Invention

In order to solve the problems, the invention provides a projection gesture motion recognition interaction method and a projection gesture motion recognition interaction system based on a 3D CNN and RNN hybrid neural network, which are characterized in that firstly, depth image videos, color image videos and infrared image videos of hands are acquired through a depth camera, the videos are subjected to format unification, then, video files are grouped and sent into a 3DCNN (three-dimensional convolutional neural network) network to perform motion learning of the videos, then image features are output, then, the RNN (recursive neural network) network is required to be used for cyclic training, and finally, recognition results are output.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a projection gesture motion recognition method based on a 3D CNN and RNN hybrid neural network comprises the following steps:

step one, image video dataset acquisition

Collecting hand data by using a depth camera, and creating a data set;

when the model is input, the model input of RGB three channels is converted into the model input of RGB+HSV six channels, HSV respectively represents hue, saturation and brightness, and the expression is as follows:

max＝max(R/255,G/255,B/255) (1)

min＝min(R/255,G/255,B/255) (2)

（3）

V＝max (5)

wherein R, G, B is the red, green and blue component value of each frame of image;

and secondly, performing video learning on video data in the data set by adopting a three-dimensional convolutional neural network, and outputting image characteristics.

And thirdly, performing cyclic training on the image features output in the second step by adopting a recurrent neural network.

Further, the first step includes the following sub-steps:

1) Using a depth camera to shoot 10 sections of depth video, color video and infrared video in each gesture scene, presetting 10 gesture operations in a data set, wherein the gesture operations are as follows: gesture a, gesture B, gesture C, gesture D, gesture E, gesture F, gesture G, gesture H, gesture I, gesture J;

2) Adjusting the video sizes to maintain a uniform size;

3) And putting the video obtained in the previous step into different folders to generate a gesture label file.

4) And integrating the folders to complete the creation of the data set.

Further, the three-dimensional convolutional neural network in the second step performs the following operations:

the three-dimensional convolutional neural network performs frame sampling on the video, and 7 frames of images are extracted every second to serve as network input; extracting 5 channels of information from each frame, wherein the information of the three channels of gray, gradient-x and gradient-y is directly obtained by operating each frame respectively, and the information of the two channels of optflow-x and (optflow-y) needs to be extracted by using the information of the two frames;

the output of the above layer is used as input, the convolution operation is carried out on 5 input channel information by using 3D convolution kernels with the size of 7 x 3, and the layer adopts two different 3D convolution kernels;

performing max mapping operation, wherein the characteristic maps number after downsampling is kept unchanged;

the two groups of characteristic maps divided before are respectively operated by adopting a convolution kernel of 7 < 6 > -3, and in order to increase the quantity of the characteristic maps, three different convolution kernels are adopted by the 3D CNN to respectively carry out convolution operation on the two groups of characteristic maps;

sampling work is performed, down-sampling operation is performed on each characteristic maps by using a 3*3 kernel, and convolution operation is performed on each characteristic maps by using 7*4 2D convolution kernels.

The projection gesture motion recognition system based on the 3D CNN and RNN hybrid neural network comprises an image video data set acquisition module, a three-dimensional convolution neural network and a recurrent neural network; the image video data set acquisition module is used for acquiring hand data by adopting a depth camera; the three-dimensional convolutional neural network is used for carrying out video learning on video data in the data set to output image characteristics; the recurrent neural network is used for carrying out cyclic training on the image characteristics output by the three-dimensional convolutional neural network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the invention, the TOF depth camera is adopted to collect hand data, and compared with RGB video used by most common gesture data sets, depth information of hand information can be obtained by the depth image and IR video. The invention introduces a new challenging multi-mode dynamic gesture data set, the data set is captured by depth, color and stereo infrared sensors, and the model input of RGB three channels is converted into the model input of RGB+HSV six channels during model input, so that the accuracy of recognition can be improved, and a guarantee is provided for a gesture recognition control scheme. The invention achieves the most advanced performance on the self-built data set, combines the 3DCNN and the RNN mixed neural network, and has the fusion effect which is greatly improved compared with the prior CNN+RNN algorithm effect. The gesture recognition method and the gesture recognition device enable gesture recognition to be more effective and simpler through simple interaction operation, and the recognition effect is obvious.

Drawings

Fig. 1 is a schematic flow chart of a gesture motion recognition interaction method based on a hybrid neural network.

Fig. 2 is a schematic diagram of an image video dataset acquisition step.

Fig. 3 is a schematic diagram of a convolution operation performed by the 3D CNN on an image sequence (video) using a 3D convolution kernel.

Fig. 4 is a schematic diagram of a 3D CNN architecture.

Fig. 5 is a schematic diagram of a simple recurrent neural network structure.

Fig. 6 is a schematic diagram of the input-output principle of the recurrent neural network.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The gesture motion recognition interaction method based on the hybrid neural network provided by the invention comprises the following steps as shown in the figure:

step one, image video dataset acquisition

Compared with the RGB video used by most common gesture data sets, the depth image and the IR video can obtain the depth information of the hand information, the TOF depth camera is adopted to collect the hand data, the collection step is shown in fig. 2, and the method specifically comprises the following steps:

1) A depth camera was used to capture 10 segments of depth video, color video, infrared video in each gesture scene. 10 gesture operations are preset in the dataset, and the gesture operations are respectively as follows: gesture a, gesture B, gesture C, gesture D, gesture E, gesture F, gesture G, gesture H, gesture I, gesture J.

2) The video sizes are adjusted so that they remain of a uniform size, in this example 640 x 420.

4) And integrating the folders to complete the creation of the data set.

In order to enhance the accuracy of model identification, the invention converts the model input of RGB three channels into the model input of RGB+HSV six channels during model input, thus being capable of improving the accuracy of identification. HSV stands for Hue, saturation, value, respectively.

max＝max(R/255,G/255,B/255) (1)

min＝min(R/255,G/255,B/255) (2)

（3）

V＝max (5)

Where R, G, B is the red, green and blue component value of each frame of image.

Step two, adopting a 3DCNN (three-dimensional convolutional neural network) network to perform video learning on video data in the data set

The conventional 2DCNN recognizes each frame of image of the video by using CNN, and performs convolution operation by using a 2D convolution kernel, without considering inter-frame motion information of a time dimension. The 3D CNN can better capture the characteristic information of time and space in the video, and as shown in fig. 3, the 3D CNN carries out convolution operation on the image sequence (video) by adopting a 3D convolution kernel.

In fig. 3, the time dimension of the convolution operation is 3, that is, the convolution operation is performed on three consecutive frames of images, and the 3D convolution is performed by stacking a plurality of consecutive frames to form a cube, and then applying a 3D convolution kernel in the cube. In this structure, each feature map in the convolution layer is connected to a plurality of adjacent consecutive frames in the previous layer, thereby capturing motion information.

3D CNN is well suited for spatiotemporal feature learning. Compared to 2D CNNs, 3D CNNs are able to better model time information through 3D convolution and 3D pooling operations. In 3D CNNs, convolution and pooling operations are performed spatiotemporally, whereas in 2D CNNs they are done only spatially. Whereas 3D convolution can preserve the time information of the input signal. The 3D CNN structure is shown in fig. 4.

The 3D CNN network samples the video frames, extracting 7 frames of 60 x 40 images per second as network inputs. The information of 5 channels extracted from each frame, namely gray, gradient-x and gradient-y, can be directly obtained by operating each frame respectively, and the information of two channels (optflow-x and optflow-y) can be extracted by using the information of two frames, so that the characteristic maps number of the H1 layer: (7+7+7+6+6=33), the size of the feature maps is still 60×40.

And then taking the output of the upper layer as input, and carrying out convolution operation on the input 5 channel information by using 3D convolution kernels with the size of 7 x 3 respectively. In order to increase the number of feature maps, two different 3D convolution kernels are used at this layer, so the number of feature maps of the C2 layer is: ((7-3) +1) 3+ ((6-30+1) 2) 2=23×2, the size of the characteristic maps is:

((60-7)+1)*((40-7)+1)＝54*34。

next, a max mapping operation is performed, and the number of characteristic maps after downsampling remains unchanged, so that the number of characteristic maps of the S3 layer remains as follows: 23 x 2, the size of the feature maps is: ((54/2) × (34/2) =27×17.

Next, the two sets of characteristic maps divided before are respectively operated by using a convolution kernel of 7 6 3, and in order to increase the number of the characteristic maps, three different convolution kernels are adopted by the 3D CNN to respectively perform convolution operation on the two sets of characteristic maps. Characteristic maps number of C4 layer: 13×3×2=13×6, and the size of the characteristic maps of the C4 layer is: ((27-7) +1) ((17-6) +1) =21×12.

Next, the sampling operation needs to be performed, and a core of 3*3 is used for each characteristic maps, where the size of each map is: 7*4. Convolving each feature map with a 2D convolution kernel of 7*4, the size of each map: 1*1.

The present invention proposes a network using 3D CNN and connection-oriented time classification (CTC). CTC enables gesture classification based on the kernel period of the gesture without explicit pre-segmentation. The problem of low accuracy of detecting gestures and the problem of serious delay are solved, and the problem is a key element of gesture interaction.

And outputting image characteristics after the action learning of the video is performed through the steps.

Step three, adopting an RNN (recurrent neural network) network to carry out cyclic training on the image characteristics output in the step two

RNN (Recurrent Neural Network) is a type of neural network for processing sequence data. Time series data refers to data collected at different points in time, and such data reflects the state or degree of change over time of something, a phenomenon, or the like.

A simple recurrent neural network is shown in fig. 5, consisting of an input layer, a hidden layer and an output layer. In the figure, x is a vector, which represents the value of the input layer; s is a vector representing the value of the hidden layer; u is the weight matrix from the input layer to the hidden layer, o is also a vector, which represents the value of the output layer; v is the hidden layer to output layer weight matrix.

Fig. 6 is a schematic diagram of input and output of a recurrent neural network, where X is a data input, h (hidden state) is used to extract features and output y, and passes on to the next layer, such that each previous layer is represented at the next layer. The method for processing the sequence by the cyclic neural network is as follows: traversing all sequence elements and saving a state containing information about the viewed content. The RNN is a for loop that reuses the results of the calculations of the previous iteration of the loop. Such a structure allows us to efficiently process the sequence data extracted by the 3D CNN.

In the step, the RNN network is used for carrying out cyclic training, and finally, the recognition result is output.

The invention also provides a gesture motion recognition interaction system based on the hybrid neural network, which comprises an image video data set acquisition module, a three-dimensional convolution neural network and a recurrent neural network; the image video data set acquisition module is used for acquiring hand data by adopting a depth camera, and the first step is specifically realized; the three-dimensional convolutional neural network performs video learning on video data in the data set to output image features, and specifically realizes the second content of the step, and the recurrent neural network performs cyclic training on the image features output by the three-dimensional convolutional neural network, and specifically realizes the third content of the step.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The projection gesture motion recognition method based on the 3D CNN and RNN hybrid neural network is characterized by comprising the following steps of:

step one, image video dataset acquisition

Collecting hand data by using a depth camera, and creating a data set;

max = max(R/255,G/255,B/255) （1）

min = min(R/255,G/255,B/255) （2）

（3）

（4）

V = max （5）

secondly, video learning is carried out on video data in the gesture action data set by adopting a three-dimensional convolutional neural network, and image features are output;

the three-dimensional convolutional neural network performs the following operations:

the three-dimensional convolutional neural network performs frame sampling on the video, and 7 frames of images are extracted every second to serve as network input; extracting 5 channels of information from each frame, wherein the information of the three channels of gray, gradient-x and gradient-y is directly obtained by operating each frame respectively, and the information of the two channels of optflow-x and optflow-y is extracted by using the information of the two frames;

the method comprises the steps that 7 x 6 x 3 convolution kernels are respectively adopted for two groups of characteristic maps divided before, and three different convolution kernels are adopted for 3D CNN to respectively carry out convolution operation on the two groups of characteristic maps in order to increase the number of the characteristic maps;

sampling work is carried out, down-sampling operation is carried out on each characteristic map by adopting a 3*3 kernel, and convolution operation is carried out on each characteristic map by adopting a 7*4 2D convolution kernel;

and thirdly, performing cyclic training on the image features output in the second step by adopting a recurrent neural network, and finally outputting gesture motion recognition results.

2. The method for recognizing the projected gesture motion based on the 3D CNN and RNN hybrid neural network according to claim 1, wherein the first step comprises the following sub-steps:

2) Adjusting the video sizes to maintain a uniform size;

3) Putting the video obtained in the previous step into different folders to generate a gesture label file;

4) And integrating the folders to complete the creation of the data set.

3. The projection gesture motion recognition system based on the 3D CNN and RNN hybrid neural network is characterized by comprising an image video data set acquisition module, a three-dimensional convolution neural network and a recurrent neural network, wherein the projection gesture motion recognition system based on the 3D CNN and RNN hybrid neural network is used for realizing the projection gesture motion recognition method based on the 3D CNN and RNN hybrid neural network according to any one of claims 1-2; the image video data set acquisition module is used for acquiring hand data by adopting a depth camera; the three-dimensional convolutional neural network is used for carrying out video learning on video data in the data set to output image characteristics; the recurrent neural network is used for carrying out cyclic training on the image characteristics output by the three-dimensional convolutional neural network.