CN110046544A

CN110046544A - Digital gesture identification method based on convolutional neural networks

Info

Publication number: CN110046544A
Application number: CN201910147442.1A
Authority: CN
Inventors: 张国山; 赵阳
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2019-07-23

Abstract

The invention relates to a digital gesture recognition method based on a convolutional neural network, comprising the following steps: using a Kinect depth camera to collect 10 types of digital gesture images, and filtering the gesture images; Sample set, the method is as follows: perform morphological preprocessing on the filtered gesture image; classify and mark image information, obtain sample set of each digital gesture representation, and classify into training set and test set; construct convolutional neural network-CNN; Input the training sample set, extract image features, and perform classification training; use the trained convolutional neural network to identify the images in the test data set.

Description

Digital gesture recognition method based on convolutional neural network

技术领域technical field

本发明涉及深度学习、图像处理领域，具体涉及一种基于卷积神经网络(CNN)的手势识别。The invention relates to the fields of deep learning and image processing, in particular to a gesture recognition based on a convolutional neural network (CNN).

背景技术Background technique

手势识别一直是一个热门的研究课题，数字手势识别要解决数据的采集、图像的处理及选择、输入样本表达的选择、模式识别分类器的选择，以及基于样本集对识别器进行有监督的训练等诸多问题。Gesture recognition has always been a popular research topic. Digital gesture recognition needs to solve data collection, image processing and selection, selection of input sample expression, selection of pattern recognition classifiers, and supervised training of recognizers based on sample sets. and many other issues.

手势是人与人沟通、交流中不可分割的一部分,而手势识别技术也开辟了人类与机器、设备或电脑间互动的崭新局面.随着科技的发展,手势识别技术已经从借助外界辅助设备的数据手套时代发展到基于计算机视觉的模式分类阶段.目前流行的基于视觉的手势识别分为分割、特征提取和识别3个阶段.手势分割是手势识别的基础,目标是从背景复杂的图像中分割出手势.由于肤色在色彩空间中有一定的聚类特性,目前绝大多数的手势分割方法都是利用肤色的颜色特征(YUV,HSV,YCbCr等)或几何特征(如椭圆模型、图模型)完成.。目前研究的主要方向是：目前的研究工作都是将手势检测和识别分开进行，在识别技术不断求精的基础上，如何将应用数学形态学、神经网络算法、遗传算法的新技术运用到手势识别当中。Gestures are an inseparable part of human-to-human communication, and gesture recognition technology has also opened up a new situation in the interaction between humans and machines, devices or computers. The era of data gloves has developed to the stage of pattern classification based on computer vision. The current popular vision-based gesture recognition is divided into three stages: segmentation, feature extraction and recognition. Gesture segmentation is the basis of gesture recognition, and the goal is to segment images from complex backgrounds. Gestures. Since skin color has certain clustering characteristics in color space, most of the current gesture segmentation methods use color features (YUV, HSV, YCbCr, etc.) or geometric features (such as ellipse model, graph model) of skin color. Finish.. The main direction of the current research is: the current research work is to separate gesture detection and recognition. On the basis of continuous refinement of recognition technology, how to apply new technologies of mathematical morphology, neural network algorithm and genetic algorithm to gestures being identified.

手势识别研究的最大难点在于：数据处理阶段系统将摄像头采集的视频进行帧分离处理，把单一手势图像从视频帧中分离出来，并对数据作平滑、锐化等预处理。然后检测是否有手势图像，如果检测出手势图像，则将手势图像与背景进行分离处理。手势分析阶段对手势进行特征检测，再用选定的手势模型来估算出The biggest difficulty in gesture recognition research is that in the data processing stage, the system performs frame separation processing on the video captured by the camera, separates a single gesture image from the video frame, and preprocesses the data such as smoothing and sharpening. Then it is detected whether there is a gesture image, and if a gesture image is detected, the gesture image is separated from the background. In the gesture analysis stage, the gesture features are detected, and then the selected gesture model is used to estimate the

相应特征参数。识别分类阶段通过特征提取与模型参数估计，利用各分类算法将参数空间中的点或轨迹分类到不同的子空间中，最后将识别信息转化为特定含义供实际应用使用。光照和像素等的影响都会对识别系统的正确率带来不同程度的影响。而Kinect深度图像与环境光照和阴影无关，像素点能够清晰地表达景物的表面几何形状，Kinect深度相机是微软为其Xbox360游戏主机和Windows平台PC打造的一款运动感知输入设备，作为一款体感外设，它实际上是一个采用全新空间定位技术(Light Coding)的3D体感摄像头，Kinect 有三个镜头，中间的镜头是RGB彩色摄影机，左右两边镜头则分别为红外线发射器和红外线CMOS摄影机所构成的3D深度感应器。利用即时动态捕捉、影像辨识、麦克风输入、语音辨识、社群互动等功能，允许玩家使用身体姿势和语音命令通过自然用户界面技术与 Xbox 360交互。corresponding characteristic parameters. In the identification and classification stage, through feature extraction and model parameter estimation, various classification algorithms are used to classify points or trajectories in the parameter space into different subspaces, and finally the identification information is converted into specific meanings for practical application. The influence of illumination and pixels will have different degrees of influence on the accuracy of the recognition system. The Kinect depth image has nothing to do with ambient lighting and shadows, and the pixels can clearly express the surface geometry of the scene. The Kinect depth camera is a motion sensing input device created by Microsoft for its Xbox360 game console and Windows platform PC. As a somatosensory Peripheral, it is actually a 3D somatosensory camera using a new spatial positioning technology (Light Coding). Kinect has three lenses, the middle lens is an RGB color camera, and the left and right lenses are respectively composed of infrared emitters and infrared CMOS cameras. 3D depth sensor. Utilize real-time motion capture, image recognition, microphone input, voice recognition, community interaction, and more, allowing players to interact with the Xbox 360 through natural user interface technology using body gestures and voice commands.

作为计算机智能接口的重要组成部分，数字手势识别具有重要的意义，该技术的完善可以大大提高计算机的使用效率，是办公自动化、智能家居、机器人交互控制等领域未来最理想的输入方式。目前，手势识别存在问题主要体现在三个方面：1)数据集的采集问题；2) 手势图像的预处理问题，如何准确地将图片中的手势位姿检测分离出来，是图像检测问题中的主要的方面；3)将数字手势识别和神经网络结合起来，使识别效果达到最佳。As an important part of computer intelligent interface, digital gesture recognition is of great significance. The improvement of this technology can greatly improve the efficiency of computer use. It is the most ideal input method in the fields of office automation, smart home, and robot interactive control in the future. At present, the problems of gesture recognition are mainly reflected in three aspects: 1) the problem of data set collection; 2) the problem of preprocessing of gesture images, how to accurately separate the gesture pose detection in the picture is one of the problems in image detection. The main aspects; 3) Combining digital gesture recognition and neural network to achieve the best recognition effect.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种识别效果更佳的基于卷积神经网络的数字手势自动识别方法。技术方案如下：The purpose of the present invention is to provide a digital gesture automatic recognition method based on a convolutional neural network with better recognition effect. The technical solution is as follows:

一种基于卷积神经网络的数字手势识别方法，包括下列步骤：A digital gesture recognition method based on convolutional neural network, comprising the following steps:

(1)利用Kinect深度相机采集10类数字的手势图像，对手势图像进行滤波处理；利用滤波后的图像建立各数字手势表征的样本集，方法如下：对滤波后的手势图像进行形态学预处理；分类标记图像信息，获得各数字手势表征的样本集，并分类制成训练集和测试集；(1) Use the Kinect depth camera to collect 10 types of digital gesture images, and filter the gesture images; use the filtered images to establish a sample set for each digital gesture representation, the method is as follows: perform morphological preprocessing on the filtered gesture images ; Classify and mark image information, obtain sample sets of digital gesture representations, and classify them into training sets and test sets;

(2)构建卷积神经网络—CNN：(2) Build a convolutional neural network—CNN:

(2a)将各个类别的数字手势图像导入卷积神经网络，作为inputs层，其大小为[320,320,3,59]；(2a) Import the digital gesture images of each category into the convolutional neural network as the inputs layer, whose size is [320, 320, 3, 59];

(2b)构建8层卷积神经网络，对输入图像的每个像素进行卷积、下采样、池化等操作，得到每层的maps特征图；(2b) Construct an 8-layer convolutional neural network, and perform operations such as convolution, downsampling, and pooling on each pixel of the input image to obtain the maps feature map of each layer;

(2c)将每层的输出作为下一层的输入，经过前后8个层，最后汇聚于全连接fc层，通过输出层softmax分类器输出结果；(2c) Use the output of each layer as the input of the next layer, go through 8 layers before and after, and finally converge on the fully connected fc layer, and output the result through the output layer softmax classifier;

(3)输入训练样本集，提取图像特征，进行分类训练；(3) Input the training sample set, extract image features, and perform classification training;

(3a)采用softmax分类器，对图像特征向量进行分类；(3a) Using the softmax classifier to classify the image feature vector;

(3b)采用卷积神经网络算法，对训练样本集进行训练，得到训练后的模型.mat文件；(3b) using the convolutional neural network algorithm to train the training sample set to obtain the model .mat file after training;

(4)利用训练后的卷积神经网络识别测试数据集里的图像。(4) Use the trained convolutional neural network to identify images in the test dataset.

其中，步骤(1)的滤波方法最好如下：基于联合双边滤波器的深度图像滤波算法，将Kinect镜头同一时刻采集的手势图像的深度图像和彩色图像作为输入，用高斯核函数计算出深度图像的空间距离权值和RGB彩色图像的灰度权值，将这两个权值相乘得到联合滤波权值，设计联合双边滤波器，用此滤波器的滤波结果与噪声图像进行卷积运算实现Kinect深度图像滤波。Wherein, the filtering method of step (1) is preferably as follows: based on the depth image filtering algorithm of joint bilateral filter, the depth image and the color image of the gesture image collected by the Kinect lens at the same moment are used as input, and the depth image is calculated with a Gaussian kernel function. The spatial distance weight and the grayscale weight of the RGB color image are multiplied to obtain a joint filtering weight, and a joint bilateral filter is designed. The filtering result of this filter is used to perform a convolution operation with the noise image. Kinect depth image filtering.

本发明的基于卷积神经网络(CNN)的手势识别，可以高效地实现手势特征深度图像的采集和进一步的去噪处理以及手势数字特征图像的自动识别和输出，识别准确率可以达到 93％左右。算法对光照变化、简单几何形变，以及附加噪声都具有一定的鲁棒性，可用于数字手势特征识别的相关领域；算法经过扩展后，也可用于其它手势特征的自动识别。The gesture recognition based on the convolutional neural network (CNN) of the present invention can efficiently realize the acquisition of gesture feature depth images and further denoising processing, as well as the automatic recognition and output of gesture digital feature images, and the recognition accuracy can reach about 93%. . The algorithm has a certain robustness to illumination changes, simple geometric deformation, and additional noise, and can be used in the related fields of digital gesture feature recognition; after the algorithm is extended, it can also be used for automatic recognition of other gesture features.

附图说明Description of drawings

图1是本发明的算法流程图。Fig. 1 is the algorithm flow chart of the present invention.

图2预处理完的数据集。Figure 2 The preprocessed dataset.

图3为网络的整体训练图。Figure 3 shows the overall training diagram of the network.

图4为测试集图片的识别结果。Figure 4 shows the recognition results of the images in the test set.

具体实施方式Detailed ways

本发明的目的是利用Kinect深度相机采集数据集，通过图像形态学预处理去噪后，再基于构建的卷积神经网络来实现数字手势的自动识别，以达到实用化的要求。主要包含以下步骤：The purpose of the present invention is to use the Kinect depth camera to collect the data set, to realize the automatic recognition of digital gestures based on the constructed convolutional neural network after denoising through image morphological preprocessing, so as to meet the practical requirements. It mainly includes the following steps:

(1)获得各数字手势表征的样本集；(1) Obtain a sample set of digital gesture representations;

(1a)采集10类数字的表征数据集；(1a) Collect 10 types of digital representation data sets;

(1b)对采集到的图像进行形态学预处理；(1b) Morphological preprocessing is performed on the collected images;

(1c)分类标记图像信息，将采集到的数字手势数据集分类制成训练集和测试集；(1c) classify and mark image information, and classify the collected digital gesture data set into training set and test set;

(4)利用训练后的卷积神经网络自动识别测试数据集里的图片。(4) Use the trained convolutional neural network to automatically identify the pictures in the test data set.

将测试样本集输入到训练好的卷积神经网络中，即通过输入训练好的模型.mat文件对测试样本集进行测试，实现各个数字手势图片自动识别，输出测试结果。Input the test sample set into the trained convolutional neural network, that is, test the test sample set by inputting the trained model .mat file, realize the automatic recognition of each digital gesture picture, and output the test result.

结合附图1对本发明的具体步骤描述如下：The specific steps of the present invention are described as follows in conjunction with accompanying drawing 1:

(1)获得各个数字手势特征的样本集；(1) Obtain a sample set of each digital gesture feature;

采集10类包含数字手势特征的图像作为数据集，其中，包括表示0、1、2、3、4、5、 6、7、8、9的数字手势各1050张，总共10500张训练数据集，每个类别的数字手势特征图是利用Kinect深度相机分别从不同的角度和光线下进行采集。针对Kinect镜头采集的深度图像一般有噪声和黑洞现象，直接应用于识别其效果较差，我们利用基于联合双边滤波器的深度图像滤波算法，将Kinect镜头同一时刻采集的深度图像和彩色图像作为输入。首先，用高斯核函数计算出深度图像的空间距离权值和RGB彩色图像的灰度权值，然后将这两个权值相乘得到联合滤波权值，并利用快速高斯变换替换高斯核函数设计出联合双边滤波器。最后，用此滤波器的滤波结果与噪声图像进行卷积运算实现Kinect深度图像滤波。然后随机抽取其中10000张作为训练样本集，进行人工分类标记，剩下500张作为测试样本集；最终获得训练样本集10000个，测试样本集500个。10 types of images containing digital gesture features are collected as datasets, including 1050 digital gestures representing 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, for a total of 10,500 training datasets. The digital gesture feature maps of each category are collected from different angles and lights using the Kinect depth camera. In view of the fact that the depth images collected by the Kinect lens generally have noise and black holes, and the effect of direct application is poor, we use the depth image filtering algorithm based on the joint bilateral filter to take the depth image and color image collected by the Kinect lens at the same time as the input. . First, use the Gaussian kernel function to calculate the spatial distance weight of the depth image and the grayscale weight of the RGB color image, then multiply these two weights to obtain the joint filtering weight, and use the fast Gaussian transform to replace the Gaussian kernel function design out a joint bilateral filter. Finally, the filtering result of this filter is used to perform convolution operation with the noise image to realize the Kinect depth image filtering. Then randomly select 10,000 of them as the training sample set, carry out manual classification and labeling, and the remaining 500 are used as the test sample set; finally, 10,000 training sample sets and 500 test sample sets are obtained.

(2)构造卷积神经网络——CNN：(2) Construct Convolutional Neural Network - CNN:

将各个类别的数字手势图像导入卷积神经网络，作为inputs层，其大小为[320,320,3,59]；构建8层卷积神经网络，对输入图像的每个像素进行卷积、下采样、池化等操作，得到每层的maps特征图；将每层的输出作为下一层的输入，经过前后8个层，最后汇聚于全连接fc 层，通过输出层softmax分类器输出结果；The digital gesture images of each category are imported into the convolutional neural network as the inputs layer, the size of which is [320, 320, 3, 59]; an 8-layer convolutional neural network is constructed, and each pixel of the input image is convolved, down-sampled, Pooling and other operations are performed to obtain the maps feature map of each layer; the output of each layer is used as the input of the next layer, and after 8 layers before and after, it is finally converged in the fully connected fc layer, and the result is output through the output layer softmax classifier;

输入训练样本集，提取图像特征，进行分类训练；采用softmax分类器，对图像特征向量进行分类；采用卷积神经网络算法，对训练样本集进行训练，得到训练后的模型.mat文件；Input the training sample set, extract the image features, and perform classification training; use the softmax classifier to classify the image feature vector; use the convolutional neural network algorithm to train the training sample set, and obtain the trained model .mat file;

卷积神经网络算法的基本流程如下：随机初始化网络权值和神经元的阈值；根据公式(1) 进行前向传播：The basic flow of the convolutional neural network algorithm is as follows: randomly initialize the network weights and neuron thresholds; carry out forward propagation according to formula (1):

分层计算隐层神经元和输出神经元的输入和输出；其中E代表输出误差，d代表真实性， w_jk，v_ij分别代表各层的权值和阈值。The input and output of hidden layer neurons and output neurons are calculated hierarchically; where E represents the output error, d represents the authenticity, w _jk , vi _ij represent the weights and thresholds of each layer, respectively.

根据公式(2)进行误差反向传播：Error backpropagation is performed according to formula (2):

其中，θ是反向传播算法的学习速率参数(本发明中θ＝0.001)，n代表输入向量的个数 (本发明中n＝320*320*3*59)，m代表隐层输出向量的个数(本发明中m随卷积层输出向量的改变而改变)，l代表输出层输出向量的个数(本发明中l＝1*1*4096*59)，上式中的负号表示权空间中梯度下降，即使得E的值下降的权值改变方向。通过以上公式修正权值和阈值，直到满足终止条件。Among them, θ is the learning rate parameter of the backpropagation algorithm (theta=0.001 in the present invention), n represents the number of input vectors (n=320*320*3*59 in the present invention), m represents the output vector of the hidden layer The number (in the present invention m changes with the change of the output vector of the convolution layer), l represents the number of output vectors of the output layer (in the present invention l=1*1*4096*59), the negative sign in the above formula represents Gradient descent in the weight space, that is, the weights that make the value of E decrease change direction. The weights and thresholds are modified by the above formulas until the termination conditions are met.

(3)利用训练后的卷积神经网络自动识别数字手势特征集。(3) Using the trained convolutional neural network to automatically identify the digital gesture feature set.

将测试样本集输入到训练好的卷积神经网络中，即通过输入训练好的模型.mat文件对测试样本集进行测试，实现各个数字手势图片的自动识别，输出测试结果。Input the test sample set into the trained convolutional neural network, that is, test the test sample set by inputting the trained model .mat file, realize the automatic recognition of each digital gesture picture, and output the test result.

本发明与现有技术相比，具有以下特点和优点：Compared with the prior art, the present invention has the following features and advantages:

第一：本发明将卷积神经网络应用到数字手势特征识别中，数据集包括：10类数字手势特征图像，其中包括表示0、1、2、3、4、5、6、7、8、9的数字手势各1050张，总共 10500张训练数据集，每个类别的数字手势特征图是利用Kinect深度相机分别从不同的角度和光线下进行采集，然后对采集到的10500张图片进行形态学滤波去噪处理。对测试集的测试实验结果表明，大部分的手势特征图像都识别正确，如图4所示。表2为当前传统算法与本发明方法的识别准确率对比。从表中可以看出，本发明方法的识别准确率相对较好。First: the present invention applies the convolutional neural network to digital gesture feature recognition, the data set includes: 10 types of digital gesture feature images, including 0, 1, 2, 3, 4, 5, 6, 7, 8, There are 1050 digital gestures of 9 each, a total of 10500 training data sets. The digital gesture feature maps of each category are collected from different angles and light by the Kinect depth camera, and then the collected 10500 images are morphologically processed. Filtering and denoising processing. The test results of the test set show that most of the gesture feature images are correctly recognized, as shown in Figure 4. Table 2 is the comparison of the recognition accuracy of the current traditional algorithm and the method of the present invention. It can be seen from the table that the recognition accuracy of the method of the present invention is relatively good.

第二，本发明构建了并行的Pooling层，该结构的好处在于：在训练数据集时当产生相同维度的输出时可有效降低top-1(概率最大的正确答案)和top-5(前5个概率最高的中包含正确答案)。在CNN的结构中，特征提取层可以将每个神经元的输入与前一层的局部接受域相连，同时提取该层的局部的特征。一旦局部特征被提取之后，它与其他特征向量之间的位置关系也随之确定，有助于特征向量的提取。Second, the present invention builds a parallel Pooling layer. The advantage of this structure is that it can effectively reduce top-1 (the correct answer with the highest probability) and top-5 (the top 5 the highest probability contains the correct answer). In the structure of CNN, the feature extraction layer can connect the input of each neuron with the local receptive field of the previous layer, and extract the local features of this layer at the same time. Once a local feature is extracted, its positional relationship with other feature vectors is also determined, which is helpful for feature vector extraction.

第三，本发明在采集数字手势图像时采用了基于联合双边滤波器的Kinect深度图像滤波算法，可以较好地保留原始图像的相关特征，有助于提高识别正确率。Third, the present invention adopts the Kinect depth image filtering algorithm based on joint bilateral filter when collecting digital gesture images, which can better retain the relevant features of the original image and help to improve the recognition accuracy.

Claims

1. A digital gesture recognition method based on convolutional neural network, comprising the following steps:

(1) Use the Kinect depth camera to collect 10 types of digital gesture images, and filter the gesture images; use the filtered images to establish a sample set for each digital gesture representation, the method is as follows: perform morphological preprocessing on the filtered gesture images ; Classify and mark image information, obtain sample sets of digital gesture representations, and classify them into training sets and test sets;

(2) Build a convolutional neural network—CNN:

(2a) Import the digital gesture images of each category into the convolutional neural network as the inputs layer, whose size is [320, 320, 3, 59];

(2b) Construct an 8-layer convolutional neural network, and perform operations such as convolution, downsampling, and pooling on each pixel of the input image to obtain the maps feature map of each layer;

(2c) Use the output of each layer as the input of the next layer, go through 8 layers before and after, and finally converge on the fully connected fc layer, and output the result through the output layer softmax classifier;

(3) Input the training sample set, extract image features, and perform classification training;

(3a) Using the softmax classifier to classify the image feature vector;

(3b) using the convolutional neural network algorithm to train the training sample set to obtain the model .mat file after training;

(4) Use the trained convolutional neural network to identify images in the test dataset.

2. method according to claim 1 is characterized in that, the filtering method of step (1) is as follows: based on the depth image filtering algorithm of joint bilateral filter, the depth image and the color image of the gesture image that Kinect lens collects at the same moment As the input, use the Gaussian kernel function to calculate the spatial distance weight of the depth image and the grayscale weight of the RGB color image, multiply these two weights to obtain the joint filtering weight, design a joint bilateral filter, and use this filter The result of filtering is convolved with the noise image to realize the Kinect depth image filtering.