CN113269089A

CN113269089A - Real-time gesture recognition method and system based on deep learning

Info

Publication number: CN113269089A
Application number: CN202110574202.7A
Authority: CN
Inventors: 宋海涛; 盛斌; 王资凯; 王天逸; 谭峰; 李佳佳; 赵亦博; 鞠睿
Original assignee: Shanghai Artificial Intelligence Research Institute Co ltd
Current assignee: Shanghai Artificial Intelligence Research Institute Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-17
Anticipated expiration: 2041-05-25
Also published as: CN113269089B

Abstract

The invention discloses a real-time gesture recognition method and a system based on deep learning, which comprises the following steps: collecting an image and extracting a hand depth image in the image by using a target detection network; converting the hand depth image into 3D voxelized data, and inputting the data into a V2V-PoseNet network to obtain hand key point data; the V2V-PoseNet network is a V2V-PoseNet network for pruning; and preprocessing the hand key point data, inputting the preprocessed hand key point data into a classification network, and classifying the gesture actions to obtain a gesture classification. The method provided by the invention combines the advanced learning model, avoids introducing artificial definition characteristics, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirements, and the speed of the model is improved on the premise of not influencing the precision. The detection of key points and the classification of actions on the data set MSRAHandd have good effect.

Description

Real-time gesture recognition method and system based on deep learning

Technical Field

The invention relates to the technical field of deep learning and gesture recognition, in particular to a real-time gesture recognition method and system based on deep learning.

Background

Today, human science and technology are rapidly developed, human-computer interaction technology is widely used in daily life of people, and richer applications are continuously developed. Human interaction technology enables people to communicate with machine equipment in various ways and languages, including gesture languages. The gesture is a natural and intuitive interpersonal communication mode used in people's life, and as human-computer interaction gradually shifts to people as a center, research on gesture recognition gradually becomes a hot point of research. It provides a means for users to interact naturally between virtual environments, which is one of the most popular human interface technologies. However, gesture recognition based on vision is a highly challenging multidisciplinary cross-research topic due to the diversity, ambiguity, and temporal and spatial diversity of gestures, and the discomfort of human hands in complex morphs and vision. With the continuous development of related technologies such as image processing and pattern recognition and the wide application of natural human-computer interaction, people begin to focus on the study of gesture recognition technology.

Existing gesture recognition techniques suffer from certain drawbacks and deficiencies including: low precision, slow speed, high power consumption, non-transparent algorithm and the like. In addition, some methods have limitations in application range. For example, although the template matching method often used in static gesture recognition is fast, it can only process static gestures, and cannot recognize continuous gesture actions composed of multiple frames of video.

Disclosure of Invention

In view of the above, the present invention provides a method and a system for real-time gesture recognition based on deep learning, which can accurately recognize gesture actions in real time, and have high speed and high precision.

In order to achieve the purpose, the invention provides the following technical scheme:

the invention provides a real-time gesture recognition method based on deep learning, which comprises the following steps:

collecting an image and extracting a hand depth image in the image by using a target detection network;

converting the hand depth image into 3D voxelized data, and inputting the data into a V2V-PoseNet network to obtain hand key point data; the V2V-PoseNet network is a V2V-PoseNet network for pruning;

and preprocessing the hand key point data, inputting the preprocessed hand key point data into a classification network, and classifying the gesture actions to obtain a gesture classification.

Further, the hand depth image is obtained by:

acquiring a depth image and an RGB image;

inputting the RGB image into a YOLOv3 network to obtain a hand bounding box;

and aligning the depth image with the RGB image, cutting the depth image according to the coordinates of the hand surrounding frame, and separating the hand region from the background region to obtain the hand depth image.

Further, the hand key point data is realized by the following steps:

the 3D voxelized data was performed as follows: converting the depth image into a 3D volume form, re-projecting the points to a 3D space, discretizing the continuous space, and setting voxel values of discrete spaces according to the voxel space position and the target object;

and 3D voxelization data is used as the input of the V2V-PoseNet network, the likelihood that each key point belongs to each voxel is calculated, the position corresponding to the highest likelihood of each key point is identified, and the position is converted into the coordinates of the real world to become hand key point data.

Further, the pretreatment comprises the following steps:

determining an initial position: taking a palm root point of the first frame of hand image as a reference point;

determining the size of the hand: adjusting the average distance from the palm root to the five finger root of the hand image to be a preset value, and enabling all coordinates to be subjected to equal-scale transformation according to the following formula:

wherein, y_ijFor adjusting the coordinates, x, of the j-th joint point in the ith frame_ijFor adjusting the coordinates, x, of the j-th joint point in the ith frame₀₀Is the coordinate of the palm root of frame 0,

index the base of the t-th finger.

Further, the gesture motion classification is performed according to the following steps:

predefining gesture actions as static gestures and dynamic gestures according to the hand key point data;

establishing a static gesture classification network and a dynamic gesture classification network;

and selecting a corresponding classification network according to the static gesture and the dynamic gesture to classify the gesture.

Further, the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time graph convolution network model; the space-time graph convolutional network model is classified according to the following steps: establishing a multi-frame hand joint point space-time diagram and inputting the space-time diagram into a space-time diagram convolution network model to obtain a full diagram feature vector; the classification result is obtained using a fully connected network.

Further, the multi-frame hand joint point space-time diagram is established according to the following steps:

acquiring continuous T-frame gesture images, wherein each gesture image has N key points;

combining and simplifying the space-time diagrams formed by the multi-frame hand joint points, combining the node information through a certain corresponding relation, and calculating the combined node value according to the following formula:

wherein, y_ijIs the feature vector of j joint point of the ith frame after combination, A_jSet of indices of pre-merged joint corresponding to jth merged joint, w_αThe corresponding coefficients for that class.

Wherein the value of each node is calculated according to the following formula:

wherein, y_ijIs the feature vector of the jth joint point of the ith frame in the next layer, A_ijtSet of indices for points in the spatio-temporal skeleton map at a distance t from the jth joint point of the ith frame, w_jtH is the pre-specified maximum range for the corresponding coefficient.

The invention also provides a real-time gesture recognition system based on deep learning, which comprises a hand depth image extraction unit, a hand key point detection unit and a gesture action classifier;

the hand depth image extraction unit is used for acquiring an image and extracting a hand depth image in the image by using a target detection network; the hand depth image is obtained by: acquiring a depth image and an RGB image; inputting the RGB image into a YOLOv3 network to obtain a hand bounding box; aligning the depth image with the RGB image, cutting the depth image according to the coordinates of the hand surrounding frame, and separating a hand region from a background region to obtain a hand depth image;

the hand key point detection unit is used for converting the hand depth image into 3D voxelized data and inputting the data into a V2V-PoseNet network to obtain hand key point data; the hand key point data is realized by the following steps: the 3D voxelized data was performed as follows: converting the depth image into a 3D volume form, re-projecting the points to a 3D space, discretizing the continuous space, and setting voxel values of discrete spaces according to the voxel space position and the target object; taking the 3D voxelization data as the input of a V2V-PoseNet network, calculating the likelihood that each key point belongs to each voxel, identifying the position corresponding to the highest likelihood of each key point, and converting the position into the coordinates of the real world to become hand key point data;

the gesture action classifier is used for preprocessing the hand key point data and inputting the preprocessed hand key point data into a classification network to classify the gesture actions to obtain a gesture class; the gesture motion classification is carried out according to the following steps: predefining gesture actions as static gestures and dynamic gestures according to the hand key point data; establishing a static gesture classification network and a dynamic gesture classification network; selecting a corresponding classification network according to the static gesture and the dynamic gesture to classify the gesture; the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time graph convolution network model; the space-time graph convolutional network model is classified according to the following steps: establishing a multi-frame hand joint point space-time diagram and inputting the space-time diagram into a space-time diagram convolution network model to obtain a full diagram feature vector; the classification result is obtained using a fully connected network.

Further, the pretreatment comprises the following steps:

index the base of the t-th finger.

wherein, y_ijIs the feature vector of the jth joint point of the ith frame in the next layer, A_ijtIs a set of indices of points in the spatio-temporal skeleton map with a distance t from the jth joint point of the ith frame, w_jtH is the pre-specified maximum range for the corresponding coefficient.

The invention has the beneficial effects that:

the invention provides a real-time gesture recognition method and a system based on a deep learning method, which are mainly divided into three parts: firstly, extracting a hand surrounding frame based on an RGB image by using a depth learning target detection method for a color and depth (RGBD) image acquired by a depth camera, and separating a hand depth image based on the surrounding frame and the depth image; secondly, converting the depth information into voxel representation, and detecting the positions of key points of the hand by using a three-dimensional convolution network; and finally, according to the positions of the key points, a classification method for using different network models for static gesture actions and dynamic gesture actions is respectively provided. The method provided by the invention combines the advanced learning model, avoids introducing artificial definition characteristics, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirements, and the speed of the model is improved on the premise of not influencing the precision. The detection of key points and the classification of actions on the data set MSRAHandd have good effect.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a system framework diagram YOLOv3 structure diagram ST-GCN network structure diagram V2V.

Fig. 2 is a schematic view of a hand detection process.

FIG. 3 is a diagram illustrating recognition results of single-frame and multi-frame gesture actions.

Fig. 4 is a time-space diagram formed by multiple frames of hand joint points.

FIG. 5 is a schematic view of a joint fusion procedure.

Fig. 6 is a schematic diagram illustrating an example of an object detection extraction hand bounding box.

Fig. 7 is a schematic diagram of the test on subset 0 of the MSRA data set.

FIG. 8 is a diagram of a multithreaded pipeline accelerated processing architecture.

In the figure, 1 represents hand depth image extraction, 2 represents hand key point detection, and 3 represents gesture motion classification based on key point positions; the input of hand detection is denoted by 4, the processing of hand detection is denoted by 5, the output of hand detection is denoted by 6, and the hand bounding box is denoted by 7.

Detailed Description

The present invention is further described with reference to the following drawings and specific examples so that those skilled in the art can better understand the present invention and can practice the present invention, but the examples are not intended to limit the present invention.

Example 1

As shown in fig. 1, fig. 1 is a schematic diagram of a system framework diagram YOLOv3, a diagram of a network structure of V2V, and a schematic diagram of ST-GCN, and the gesture recognition process in the system provided by the present embodiment can be divided into three main stages: extracting a hand depth image, detecting a hand key point, and classifying the gesture based on the key point position.

First, a depth image and an RGB image that can be aligned with each other are obtained from a depth camera. Inputting the RGB image into a YOLOv3 network, extracting a hand surrounding frame based on the RGB image by using a deep learning target detection method, obtaining the hand surrounding frame, and separating a hand depth image based on the surrounding frame and the depth image;

then, converting the depth information into voxel representation, and detecting the positions of key points of the hand by using a three-dimensional convolution network;

and aligning the depth image with the RGB image, reasonably cutting the depth image according to the coordinates of the bounding box, separating the hand region from the background region by using a threshold method, and taking the hand depth image as the key point detection input of the next stage. In the key point detection stage, the 2D depth map is subjected to reprojection and 3D voxelization, the positions of the hand key points in the depth map are predicted by using a V2V network, and the three-dimensional coordinates of the predicted hand key points are output.

Finally, in the classification stage, classification methods of static gesture actions and dynamic gesture actions using different network models are respectively provided according to the positions of key points, the key point coordinates predicted by the V2V network are input into a full-connection network or an ST-GCN classification network, and the specific category of the gesture is output.

The system used in the embodiment combines the advanced learning model, avoids introducing artificial definition characteristics, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirements, and the speed of the model is improved on the premise of not influencing the precision. The detection of key points and the classification of actions on the data set MSRAHandd have good effect.

By adopting the structure, the three parts have certain independence, the modules can be optimized independently, and the improvement work in the series connection of the models can be considered.

In the embodiment, the hand bounding box is obtained by using a target detection model YOLOv3, the hand key point position is detected by using a V2V network based on 3D convolution, and the concept of space-time graph convolution network (ST-GCN) is applied to gesture classification. The model is pruned and scene optimized according to the problem, and the model identification speed is improved on the premise of not influencing the precision. Therefore, the model in the embodiment has the advantages of real-time performance, high precision, good stability and the like.

The target detection can be divided into two types, one type is a two-stage model, and the model needs to generate a target candidate frame, namely a target position, according to an algorithm and then classify and regress the candidate frame; the other is a one-stage model that directly predicts the classes and positions of different objects using only one convolutional neural network CNN. The first category of methods is highly accurate, but slow; the second category of methods is fast, but less accurate. The first type of model is based on R-CNN proposed by RossGirshick et al. RossGirshick also improved it, proposing FastR-CNN, a model that is faster than R-CNN. MaskR-CNN is proposed by KaimingHe and the like on the basis of R-CNN, and the model has higher accuracy. The second class of models is represented primarily by SSD by Weiliu et al and YOLO by Joseph Redmon et al.

Depth-based 3D hand keypoint identification: the gesture recognition method may be classified into a generation method, an authentication method, and a hybrid method. The generation method assumes a pre-set hand model and fits it to the input depth image by minimizing the cost function of the hand generation. Particle Swarm Optimization (PSO), iterative closest point of approach (ICP), and combinations thereof are commonly used algorithms for obtaining optimal hand pose results. The identification method locates the hand joints directly from the input depth map. The random forest based method provides fast and accurate performance. However, due to the need to use manual features, it has been replaced by recent CNN-based methods. Tompson et al first locates hand keypoints using CNN by estimating a two-dimensional heat map for each hand joint. Ge et al extend this approach by using multi-view CNN to estimate a two-dimensional heat map for each view. Ge, etc. converts the two-dimensional input depth map into a three-dimensional form, and directly estimates three-dimensional coordinates through three-dimensional CNN. Guo et al propose a network of regional collections to accurately estimate the three-dimensional coordinates of hand keypoints, and Chen et al improve the network by iteratively refining the estimated pose. The hybrid approach combines the generation approach and the identification approach. Oberweger et al trained the identification and generation of discriminative CNNs and generative CNNs through a feedback loop. Zhou et al predefines a hand model and estimates the parameters of the model rather than predicting the three-dimensional coordinates directly. Ye et al utilize a spatial attention mechanism and a hierarchical PSO.

And (3) a classification algorithm: classification is the training of a classifier among a group of samples whose class labels are known, so that the classifier can classify some unknown sample. The naive Bayes classification method proposed by QiongWang et al is simple in logic and easy to implement. Kotsiantis et al used a Support Vector Machine (SVM) to classify and can effectively solve the high-dimensional problem. Classification using decision trees is applicable to non-linear problems. With the development of deep learning, classification by using artificial neural networks is becoming a focus of research. On the basis of the human body motion classification problem of the key point skeleton sequence, a manual feature-based method and a deep learning method can be adopted. The first method describes the dynamics of joint motion manually. The second method uses deep learning to build an end-to-end mode learning action recognition model, and each part of the human body needs to be modeled in order to improve the precision. The framework sequence is firstly proposed to be regarded as a space-time diagram, features are automatically extracted and classified by using diagram convolution, manually defined traversal rules and body part definitions are not required to be introduced, and the characteristics of the framework sequence are fully utilized.

Example 2

The gesture recognition system based on deep learning provided by the embodiment comprises a hand depth image extraction unit, a hand key point detection unit and a gesture action classifier, and specifically comprises the following components:

a hand depth image extraction unit: the method comprises the steps of performing gesture/gesture detection by adopting a depth motion sensing camera such as Microsoft Kinect, separating a human body/hand from a background through a segmentation strategy, identifying a specific part through a machine learning model, generating a skeleton model consisting of key points, and finishing the identification of actions based on the skeleton.

The embodiment adopts the RGBD image-based hand extraction-key point identification-action identification. However, for the hand to be detected extracted and separated in the first step, a target detection method is used instead of the segmentation strategy in this embodiment.

The specific process is as follows: inputting an RGB picture, extracting a bounding box of a hand by using a machine learning model for target detection, clipping a depth map corresponding to the RGB picture by using a bounding box range, separating the hand from a background region by using a threshold method for the clipped depth map, and acquiring depth information directly related to the hand.

The reason for adopting the target detection method comprises (1) the detection surrounding frame has smaller cost compared with a method for carrying out pixel-level segmentation, and the current target detection method based on machine learning is efficient and stable; (2) regarding the separation of the background and the hand in depth, due to the particularity of the gesture recognition task, namely that the hand is usually closest to the camera, the background object is usually far away from the hand, in general practical applications, a depth threshold filtering method can be directly used, or an OTSU method is adopted to binarize a depth image to form a mask (mask) of the hand, which can be used for removing the background and retaining the depth information related to the hand. Also, in the keypoint detection method used in the present embodiment, the clipped hand depth map and mask may be used directly as input.

In the embodiment, a target detection network YOLOv3 is adopted, which has the advantages of detection speed and higher precision and is one of the most widely used target detection models. Yolov3 can predict on three scales, and has better performance particularly for small-scale object detection. In order to further improve the detection speed, a network structure for performing channel pruning on the model is used, pre-training is performed on an opening source data set Oxford hand, and single-target detection can be performed on the hand in an RGB picture. In the experiment, the pre-training model is found to have high precision for most tasks and can correctly extract the bounding box. In the case where a plurality of bounding boxes are detected, the contents of the largest bounding box are retained by default. The model is insensitive to only a small part of objects with hands relatively close to the shooting position, namely, the objects with the hands occupying most parts of the picture, and for the part of the picture, bounding box clipping can be omitted, and the depth map is directly subjected to depth filtering to remove irrelevant backgrounds and then is used as next-stage input.

As shown in fig. 2, fig. 2 is a hand detection process, after a cut-out hand depth image is obtained, hand keypoint detection can be performed by a hand keypoint detection unit, the reference method is V2V-PoseNet, and the keypoint detection process specifically includes: first, the 2D depth image is converted into a 3D volumetric form, points are re-projected into 3D space, and the continuous space is discretized. After the 2D image is voxelized, the V2V network takes the 3D voxelized data as input and estimates the likelihood that each keypoint belongs to each voxel. And then, identifying the position corresponding to the highest likelihood of each key point, converting the position into real world coordinates, and outputting the final result.

The overall architecture of the model V2V network provided by this embodiment is as follows:

generation of input data, input data generation process of V2V network: the V2V network takes 3D voxelized data as input to the model. A voxel is the smallest unit of digital data divided in three dimensions, conceptually similar to the smallest unit pixel in two dimensions. A pixel may have its coordinates represented by a two-dimensional vector and thus a voxel may also have its spatial location represented by a three-dimensional vector.

The pixels on the two-dimensional depth map of the hand are re-projected to a three-dimensional space according to camera parameters, and the three-dimensional space is divided into discrete spaces according to the size of a predefined voxel. The input value for a certain voxel is set to 1 if the spatial position of this voxel is just occupied by the target object, and to 0 otherwise. After this step, all voxels are assigned 0 or 1 to indicate whether the three-dimensional spatial position of the voxel is occupied by the target object, and these voxel values will be used as input to the V2V network to predict the keypoint coordinates of the hand. It can be seen that the input data of the V2V network is three-dimensional voxel data, while most existing model inputs for gesture recognition using depth learning are two-dimensional depth image data. In fact, there are two significant drawbacks to directly regressing the three-dimensional keypoint coordinates using two-dimensional depth maps: one is that the two-dimensional depth image has the problem of perspective distortion (perspective distortion), so that a distorted hand is seen when the two-dimensional depth image is directly input into a neural network model; secondly, a highly nonlinear corresponding relation exists between the two-dimensional depth image and the three-dimensional key point coordinates, and the highly nonlinear corresponding relation prevents the neural network from accurately predicting the key point coordinates. The two-dimensional depth image is converted into three-dimensional voxel data which is used as input of a model, so that the two problems are avoided, the three-dimensional voxel data is similar to a point cloud (pointcloud) and is used for representing hands without perspective distortion, in addition, the corresponding relation between the three-dimensional voxel and three-dimensional key point coordinates is relatively simpler, the three-dimensional voxel and the three-dimensional key point coordinates are consistent in dimension, the non-linearity degree is not higher than that between the two-dimensional depth image and the three-dimensional key point coordinates, and therefore the model is better trained.

The V2V network is built, requiring four components: the first is a volume basic block (volumetrics basic block) consisting of three-dimensional convolution, normalization and activation functions, which is located in the first and last part of the network; the second block is a volumetric residual block (volumetricrescidualblock) derived from the two-dimensional residual block; the third is a volume downsampling block (volumetricdownsamplingblock), which is the same as the volume max pooling layer; the last one is a volume upsampling block (volumetriciscuprammingblock) which consists of a volume deconvolution layer, a volume batch normalization layer and an activation function, and the addition of the batch normalization layer and the activation function in the deconvolution layer helps to simplify the learning process.

The V2V network performs voxel to voxel prediction. Therefore, it is based on a 3D convolutional network architecture, handling the Z-axis as an additional spatial axis. As shown in fig. 3, the V2V network is first a 7 x 7 volume basis block and downsample block. After downsampling the feature map, useful local features are extracted by using three continuous residual blocks. The output of the volume residual block is sequentially passed through an encoder and a decoder. The network inputs the output of the codec to two 1 × 1 × 1 volume basic blocks and one 1 × 1 × 1 volume convolution layer, and finally obtains the likelihood that the key point belongs to each voxel, which is the output of the network and the final target of the model.

Pruning of V2V network: in the embodiment, the V2V network is used to predict the keypoint location, however, although the V2V network has high accuracy and a small predicted average keypoint error, the V2V network uses 3D convolution, which results in a long prediction time and is difficult to implement real-time gesture recognition. Therefore, in the embodiment, when the V2V network is used to predict the positions of the key points, the V2V network is pruned to some extent, so that the network model is simplified, and the calculation speed of the model is increased. The V2V-PoseNet network provided in this embodiment performs pruning, that is, the V2V-PoseNet network is a V2V-PoseNet network that performs pruning; the V2V-PoseNet network pruning processing is to set the output dimension of the encoder of the V2V-PoseNet network to be lower than the originally set output dimension value, specifically, the output dimension of the encoder is modified in this embodiment, the output dimension of the original V2V network encoder is 128 dimensions, and the output dimension of the encoder is reduced to 96 dimensions in this embodiment. Accordingly, the input dimension of the decoder is reduced from the original 128 dimensions to 96 dimensions, and the output dimension of the decoder can be actually determined.

Through testing, the modified model improves the operation speed on the premise of ensuring the precision.

Gesture action classifier: after the hand keypoints are acquired, the gestures may be classified into predefined action semantics based on the hand keypoints. The gestures can be divided into two types, one type is static gestures, namely a single-frame picture can form a gesture containing semantics, as shown in fig. 3, fig. 3 is a schematic diagram of a gesture action recognition result of a single frame and a multi-frame, and any one frame is taken out from the gestures (a) and (b) in fig. 3 and can respectively represent '1' and '2'; the other type is dynamic gesture, that is, a sequence of continuous frames can form a gesture containing semantics, while a single frame is meaningless, such as (c) and (d) in fig. 3 constitute a leftward waving motion and a rightward waving motion, respectively, but it may not be meaningfully meaningful to take any one frame. When the two types of gestures are classified, different classification modes are adopted in consideration of the respective characteristics of the two types of gestures. The left column in fig. 3 forms semantics for a single frame gesture. The right column is semantic for multi-frame gestures and single frame is meaningless.

Pretreatment: in the gesture recognition problem, the initial position and size of the hand should not affect the gesture semantics, but in reality these data may be very different, thus requiring preprocessing of the hand key point data.

In order to process the initial position problem, the palm root point of the first frame (the only frame if the gesture is a static gesture) of the gesture is used as a reference point, and the input coordinates of all joint points of the frame and the subsequent frame are the coordinates of the actual coordinates relative to the reference point; to address the hand size issue, the average distance from the palm root to the five fingers root is adjusted to 1 and all coordinates are scaled equally as shown in equation (1).

is at the tThe index of the root of each finger.

Classification models: for static gestures, a fully connected network is used for classification.

For dynamic gestures, considering that the input scale is large, under the condition that an input framework is 256 frames and each frame comprises three-dimensional coordinates of 21 joint points of a hand, the input scale is about 16000, a large number of parameters are generated by adopting full connection, the efficiency is seriously influenced, and the training is difficult. The proposed graph convolution method is used for the gesture classification problem.

Regarding the continuous T frames as a gesture, wherein each gesture has N key points, and establishing a multi-frame hand joint point space-time diagram according to the following rules: each joint of each frame is a node, and T multiplied by N in total; the same joint points of adjacent frames are connected by edges; in the same frame, the actually adjacent joint points of the hand are connected by edges; the graph structure can be established as shown in fig. 4, and fig. 4 is a time-space graph formed by a plurality of frames of hand joint points.

Conventional graph convolution methods generally do not change the number of nodes and topology of the graph. However, when the number of layers is large, this may cause that irrelevant information interferes seriously after multiple convolutions, for example, in the gesture recognition problem, the coordinate of the index finger tip should not be directly related to the coordinate of the little finger root, and the convolution leads to the information of the node being recorded into the irrelevant node after multiple times of information diffusion between the nodes. Therefore, the graphs are combined and simplified, and correlation reference is provided for the model according to the priori knowledge in reality. The specific method is that a graph which is simpler than the original graph is constructed, the graph is generated according to the connection relation in reality but the number of nodes is less than the total number of joints, the node information is merged through a certain corresponding relation during convolution, and the merged node value is as the formula (2):

wherein, y_ijIs the feature vector of j joint point of the ith frame after combination, A_jThe set of indexes of the pre-joint corresponding to the jth joint is provided by prior knowledge, w_αIs a pair of the kindThe corresponding coefficient. The merged graph used this time is shown in fig. 5, and the white, red, green, yellow, blue, and black nodes in fig. 5 (b) contain the white, red, green, yellow, blue, and black node information in fig. 5 (a), respectively. Fig. 5 is a schematic diagram of a joint fusion method, in which a graph convolution method can be used, and the value of each node is calculated according to the following formula (3):

wherein, y_ijIs the feature vector of the jth joint point of the ith frame in the next layer, A_ijtIs a set of indices of points in the spatio-temporal skeleton map with a distance t from the jth joint point of the ith frame, w_jtH is the pre-specified maximum range for the corresponding coefficient. After being laminated by a plurality of graphs, the feature vector of the whole graph can be obtained, and finally, a classification result is obtained by using a full-connection network.

In this example, hand detection was achieved by PyTorch, and the following experimental results were all tested on Nvidia1080 TiGPU. The mAP of target detection can reach 0.76 on the test set of the original data set by the pruned YOLOv 3. The model can also correctly identify the RGB pictures with defects shot by the depth camera, and only shows a detection example due to the lack of real labels (GroudTruth) in the tests, and the RGB picture example in FIG. 6 is selected from NYHUHand data sets. In the test, the detection time when accelerating using CUDA can be kept around 30 ms. Fig. 6 is a schematic diagram of an object detection and extraction hand bounding box.

In testing the accuracy of the V2V network predicted keypoint locations, the MSRA gesture data set is utilized in this embodiment. The MSRA gesture data set contains 9 subsets, each subset containing 17 gesture categories, with a total of 21 labeled keypoints on each hand. In evaluating the accuracy of the model for predicting the locations of the key points, the performance indicators utilized in this embodiment are the average key errors. The average keypoint error refers to the distance, in millimeters (mm), between the position of the hand keypoint predicted by the model and the position of the true hand keypoint.

The following is a comparison of the test results of the V2V network against other currently better hand keypoint prediction models on the MSRA gesture data set.

Table 1: average keypoint error contrast on MSRA gesture datasets for V2V networks and other models

From the results, multiview; occlusion occluded view; cross nets can be collectively referred to as a deep learning image algorithm model; DeepPrior + + deep 3D hand gesture recognition model.

The V2V network has good effect on MSRA gesture data set, and the average key point error is only 7.49mm, which is far smaller than other hand key point prediction models. In the experiment in this example, when subset 0 is used as the test set and subsets 1-8 are used as the training set, the better result is obtained, i.e. the average keypoint error is only 7.38 mm. Fig. 7 shows the relationship between the allowable spatial error and the ratio of the keypoints in the prediction range in the subset 0 of the data set. Fig. 7 is a test on subset 0 of the MSRA data set.

In addition, in the embodiment, the V2V network is pruned, so that the calculation speed of the model is increased on the premise of not influencing the accuracy. In order to verify the changes of the model precision and speed before and after pruning, in this embodiment, a test is performed on the 3 rd subset of the MSRA gesture data set, and the average key point error and the average time of the model are explored. Experiments show that the average key point error before pruning is 10.64mm, and the average key point error after pruning is 11.02 mm; the average detection time before pruning is 0.485s, and the average detection time after pruning is 0.32 s. From the experimental results, the accuracy of the model is not reduced too much through pruning operation, and the detection speed is greatly improved.

Classifying the gesture actions: trained and tested on the MSRA data set. The accuracy is shown in Table 2. In table 2, continuous sampling is to intercept continuous T frames from 500 frames in the data set, and random sampling is to randomly select T frames in the data set to connect as a gesture.

Table 2: single-frame and multi-frame gesture classification precision

The MSRA data set is a single-frame gesture, so that the single-frame precision is high; when continuous sampling is carried out, the model can further determine the gesture type, so that the precision is higher when multiple frames are continuously sampled; because the frames in the data set are independent from each other, the gestures are chaotic when randomly sampled, unexpected gesture categories (for example, the same direction movement of the gestures which are 1 and 2 is regarded as the same gesture) can be generated, and therefore the multi-frame gesture classification precision is reduced. It is further noted that the classification accuracy is highly dependent on the coordinate regression accuracy, and therefore the regressed joint coordinates should be tested as input.

In this embodiment, a real-time gesture recognition system based on deep learning is designed and implemented, a conventional process is partially improved, and a new effective idea is provided, for example, a method for extracting a gesture is converted from segmentation to target detection, a 2D depth map is converted into 3D voxels and then processed, and feature extraction is performed by using spatio-temporal image convolution. In the embodiment, the problems are reasonably divided, so that each sub-problem can be optimized respectively according to requirements, a current best (stateoflook) deep learning method on the related sub-problems is reasonably combined into a system, the generalization capability and the expression capability of the model are strong, and the expansibility is good. In specific implementation, according to the task requirements, pruning and optimization in use methods are performed on the existing network, so that the model efficiency and the precision in the gesture recognition problem are improved, and the real-time performance and the accuracy are improved.

In the system, the input data of the next module is from the output of the previous module, so that the precision of the next module depends on the previous module. To address this problem, in order to enhance the robustness of the system, the improvement considered in the present embodiment is to adopt a network structure that is input for multiple purposes. For example, the depth information generated by the first partial hand detection may be reused as supplementary information when the third partial motion is classified.

In the experiment, the classifier model has high speed and the key point detection time is long, and the real-time property is expected to be enhanced by improving the processing speed of the system through multithreading. And for gestures made up of multiple frames, the classification network needs to complete the processing of the multiple frames as a group. Aiming at the problems of system real-time and multi-frame picture continuous processing, a pipeline working mode shown in fig. 8 is designed. For example, assuming that the pre-classification processing time is 3 times of the acquisition time of each picture, 4 threads are used for sequentially performing pre-processing to complete the detection of the key points, as shown in (a) in fig. 8, wherein the leftmost is busy time, the second is idle time, the third is busy time, and the rightmost is idle time, a certain idle margin is set to prevent collision, then the processed key point skeletons are stacked in time in a unified area to generate time sequence data, as shown in (b) in fig. 8, and finally a classification model is input. The method can be used to improve the processing capacity when the space is more sufficient and the preprocessing speed is lower. FIG. 8 is a diagram of a multithreaded pipeline accelerated processing architecture.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A real-time gesture recognition method based on deep learning is characterized in that: the method comprises the following steps:

2. The deep learning based real-time gesture recognition method of claim 1, characterized in that: the hand depth image is obtained by:

acquiring a depth image and an RGB image;

inputting the RGB image into a YOLOv3 network to obtain a hand bounding box;

3. The deep learning based real-time gesture recognition method of claim 3, characterized in that: the hand key point data is realized by the following steps:

4. The deep learning based real-time gesture recognition method of claim 1, characterized in that: the pretreatment comprises the following steps:

index the base of the t-th finger.

5. The deep learning based real-time gesture recognition method of claim 1, characterized in that: the gesture motion classification is carried out according to the following steps:

6. The deep learning based real-time gesture recognition method of claim 5, characterized in that: the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time graph convolution network model; the space-time graph convolutional network model is classified according to the following steps: establishing a multi-frame hand joint point space-time diagram and inputting the space-time diagram into a space-time diagram convolution network model to obtain a full diagram feature vector; the classification result is obtained using a fully connected network.

7. The deep learning based real-time gesture recognition method of claim 6, characterized in that: the multi-frame hand joint point space-time diagram is established according to the following steps:

8. The utility model provides a real-time gesture recognition system based on deep learning which characterized in that: the hand depth image extraction unit, the hand key point detection unit and the gesture action classifier are included;

9. The deep learning based real-time gesture recognition system of claim 8, wherein: the pretreatment comprises the following steps:

index the base of the t-th finger.

10. The deep learning based real-time gesture recognition system of claim 8, wherein: the multi-frame hand joint point space-time diagram is established according to the following steps:

wherein, t_ijIs the feature vector of the jth joint point of the ith frame in the next layer, A_ijtIs a set of indices of points in the spatio-temporal skeleton map with a distance t from the jth joint point of the ith frame, w_jtH is the pre-specified maximum range for the corresponding coefficient.