CN108509839A

CN108509839A - One kind being based on the efficient gestures detection recognition methods of region convolutional neural networks

Info

Publication number: CN108509839A
Application number: CN201810105589.XA
Authority: CN
Inventors: 张勋; 陈亮; 朱雪婷
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2018-02-02
Filing date: 2018-02-02
Publication date: 2018-09-07

Abstract

The present invention relates to one kind being based on the efficient gestures detection recognition methods of region convolutional neural networks, includes the following steps：Chinese character gesture letter sample image is pre-processed；It builds and strengthens images of gestures data set；Gestures detection identification is carried out using based on region convolutional neural networks Faster R CNN networks, gesture feature is first extracted by feature extraction network, and the characteristic pattern of extraction is divided into two parts, first part is directly entered Fast R CNN networks and does profound convolution, second part, which enters after RPN networks generating region is suggested, inputs Fast R CNN networks, and the characteristic pattern obtained with first part enters the ponds RoI layer jointly, position is obtained after full context layer again to return in gesture classification score, it is final to realize gestures detection identification；Training network model realizes the detection identification of Chinese character gesture letter.The present invention can promote recognition speed and accuracy rate.

Description

One kind being based on the efficient gestures detection recognition methods of region convolutional neural networks

Technical field

The present invention relates to gestures detection identification technology fields, efficient based on region convolutional neural networks more particularly to one kind Gestures detection recognition methods.

Background technology

Gesture identification application field is extensive in recent years, such as the robot control that the translation of deaf-mute's gesture, gesture identification are taken pictures The Intelligent housing etc. of system, door and window household electrical appliances etc..Classify by gesture acquisition mode, there are two types of mode classifications for gesture identification：It is a kind of It is based on wearing technology, one is based on machine vision.Although Gesture Recognition based on wearable device has gesture The advantages such as accurate positioning, data are relatively easy, response processing speed is very fast, but can not compensating cost is high, inconvenient, study Disadvantages, these disadvantages such as of high cost, manipulation distance is limited, usage scenario limitation lead to the gesture identification method based on wearing technology It is difficult to be widelyd popularize, so the necessarily ideal gesture identification of the gesture technology based on machine vision.Based on machine vision Gesture identification method core be computer gesture target detection recognizer.

Traditional gesture target detection recognizer generally comprises Hand Gesture Segmentation, feature extraction, identifies these three steps.It is logical Hand Gesture Segmentation often can be done with the methods of the model based on movable information, Motion mask, Skin Color Information, then again to segmentation after Gesture carries out doing feature extraction with HOG, LBP, Fourier converter technique scheduling algorithm, finally recycles SVM, Adaboost, MLP etc. Algorithm carries out Classification and Identification.Traditional gesture target detection recognizer can not evade engineer's gesture feature defect, therefore algorithm Obtained model plasticity is poor.

Convolutional neural networks (Convolutional Neural Network, CNN) are deep learning (Deep Learning, DL) it is theoretical in a very important algorithm, it solves traditional artificial definition description and selection target feature Drawback can automatically extract the target of input picture deeper time feature by powerful self-learning ability and classify.

2014, Girshick R. proposed the convolutional neural networks model R-CNN based on region, according to Selective Search Edge boxes generate candidate region, then carry out feature to the candidate region of generation with convolutional neural networks and carry It takes, although there is precision deficiency and input image size limitation, is established in target detection for the thinking of RPN+CNN Basis.Then Fast R-CNN models are proposed in Girshick R. in 2015, it is proposed that Region ofInterest Pooling layers, the shortcomings that R-CNN, is improved, but due to its network to clarification of objective still by hand-designed, and And evaluation work is only completed on CPU, the accuracy of such model is low and candidate region calculates time length still becomes the network The drawbacks of.It is further boosting algorithm recognition efficiency after R-CNN and Fast R-CNN, Microsoft in 2015 Shaoqing.Ren etc. proposes Faster R-CNN models.Suggestion section is generated with region proposed way, is substituted The methods of Selective Search, Edge boxes, and and detection network share convolution feature, so that region suggest meter Evaluation time greatly shortens.

Invention content

Technical problem to be solved by the invention is to provide one kind being based on the efficient gestures detection of region convolutional neural networks Recognition methods can promote recognition speed and accuracy rate.

The technical solution adopted by the present invention to solve the technical problems is：It provides a kind of high based on region convolutional neural networks The gestures detection recognition methods of effect, includes the following steps：

(1) Chinese character gesture letter sample image is pre-processed；

(2) it builds and strengthens images of gestures data set, be divided into training set and test set；

(3) gestures detection identification, the network packet are carried out using based on region convolutional neural networks Faster R-CNN networks It includes：Feature extraction network, the regions RPN suggest network and Fast R-CNN networks, and the feature extraction network is for extracting gesture Feature, and the characteristic pattern of extraction is divided into first part and second part, the first part is directly entered Fast R-CNN nets Network does profound convolution, and the second part, which enters after RPN networks generating region is suggested, inputs Fast R-CNN networks, and with the The characteristic pattern that a part obtains enters the ponds RoI layer jointly, then obtains position after full context layer and return in gesture classification score, It is final to realize gestures detection identification；

(4) training network model：This network is trained using Chinese character manual alphabet training set, obtains network parameter；Finally use Test set or in real time the acquisition gesture video input trained network realize the detection identification of Chinese character gesture letter.

The step (1) is specially：Chinese character gesture letter video is recorded, and it is image that video, which is taken out frame, removal smear is tight The image of weight and serious shielding, and enhancing processing is carried out using the method for high-pass filtering to image.

The images of gestures data set built in the step (2) includes original sample image and is carried out to original sample image Label image after mark by hand, wherein the image tagged frame of markup information record is corresponded with original image；Using to original The mode that beginning image does minute surface symmetrical treatment re-flags correspondence image, to achieve the purpose that strengthen static sign language data set.

Feature extraction network in the step (3) is 13 layers of VGG16 networks for removing 3 layers of full articulamentum.

The regions RPN in the step (3) suggest that network is suggested using the direct generating region of CNN convolutional neural networks, The region that one time obtains multiple dimensioned more length-width ratios is slided by sliding window on last convolutional layer to suggest to extract detection zone Domain, the regions RPN suggest that network also carries out end-to-end training by backpropagation and stochastic gradient descent.

Suggest that network does sliding sash on the characteristic pattern that last layer of convolution obtains using a convolution kernel and sweeps in the regions RPN It retouches, which connect with the window on characteristic pattern entirely every time, obtains a low-dimensional vector, this low-dimensional vector is sent to Two full articulamentums, i.e. bezel locations return layer and target classification layer, and the bezel locations return layer for predicting Suggestion box The corresponding coordinate of anchor, the target classification layer is for judging that Suggestion box is target or background.

Suggest that the loss function of network is in the regions RPN Wherein, p_iIt is the probability that i-th of anchor rectangle frame is target,It is sample label；t_iIt is used to indicate that the parametrization that prediction obtains Frame coordinate,It is the parametrization coordinate of positive sample；N_clsIndicate most small quantities of amount of images in input network, N_regIndicate anchor coordinate Sum；L_clsFor the loss function for classification；L_regTo return loss function.

The regions RPN suggest that network and Fast R-CNN networks use feature shared mechanism in the step (3), using alternately Training stage convolutional layer feature is shared.

Advantageous effect

Due to the adoption of the above technical solution, compared with prior art, the present invention having the following advantages that and actively imitating Fruit：

The present invention realizes that the detection of static Chinese character gesture letter identifies using region convolutional neural networks Faster R-CNN, Feature extraction is done with the VGG16 networks of the network, region proposed mechanism (RPN) comes formation zone and suggests (Region Proposals), the region of generation suggests that entering back into Fast R-CNN networks does gesture target detection and classification；Due to directly defeated It is gesture picture to enter, and output is also gesture picture after identification, so the frame has an advantage end to end, the above characteristic, and no The speed for improving only gestures detection identification, more greatly improves recognition accuracy.

Description of the drawings

Fig. 1 is the schematic diagram of the gestures detection identification based on region convolutional neural networks of the present invention；

Fig. 2 is that network RPN structural schematic diagrams are suggested in region；

Fig. 3 is inventive network training flow chart；

Fig. 4 is the experimental result picture of gestures detection identification of the present invention.

Specific implementation mode

Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, people in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.

Embodiments of the present invention are related to a kind of static sign language based on modified single multi-target detection device and identify in real time Method, as shown in Figure 1, including the following steps：Chinese character gesture letter sample image is pre-processed；It builds and strengthens gesture figure As data set, it is divided into training set and test set；Gesture is carried out using based on region convolutional neural networks Faster R-CNN networks Detection identification, the network are divided into three parts：Feature extraction network, the regions RPN suggest network and Fast R-CNN networks, described Feature extraction network extracts gesture feature, and the characteristic pattern of extraction is divided into two parts, and first part is directly entered Fast R-CNN nets Network does profound convolution again, and second part, which enters after RPN networks generating region is suggested, inputs Fast R-CNN networks, with first The characteristic pattern got enters the ponds RoI layer jointly, then position recurrence and gesture classification score are obtained after full context layer, finally Realize gestures detection identification；Training network model：This network is trained using Chinese character manual alphabet training set, obtains network parameter； It finally can be used test set or camera to acquire the gesture video input trained network in real time, realize the detection of Chinese character gesture letter Identification.It is specific as follows：

Step 1：Chinese character gesture letter sample image is pre-processed.This experimental data is adopted by high definition monocular cam Collection is completed.It is representative to carry out choosing 5 letters in experiment in 26 Chinese letters gestures of static Sign Language Recognition, respectively A, B, C、D、E.Experimental data is completed by 8 people, everyone is complete to each letter difference recorded video, then by Matlab videos pumping frame program It at frame is taken out, removes that smear is serious, image of serious shielding manually, high-pass filtering is used for the poor image of certain display effects Method enhancing processing is done to image, be convenient for target identification, obtained preliminary data collection, picture size is 640*480.

Step 2：Images of gestures data set is built and strengthened, training set and test set are divided into.The Chinese character gesture letter of structure Image containing original sample and the label image after manual mark, the image tagged of markup information record are carried out to original sample image Frame is corresponded with original image；By the way of doing minute surface symmetrical treatment to original image, and correspondence image is re-flagged, reached To the purpose for strengthening static sign language data set.Final data collection is as shown in table 1, wherein each letter training set picture is 2500 Opening and closing meter 15000 is opened, and test set is that 500 opening and closing meters 2500 are opened.Handmarking is carried out with LabelImg programs to obtain really Target labels file.

The static sign language data set table of table 1

Step 3：Gestures detection identification is carried out using based on region convolutional neural networks Faster R-CNN networks.The network Core is divided into three parts：Network and Fast R-CNN networks are suggested in feature extraction network, the regions RPN.The Principles of Network are summarized For：Gesture feature is first extracted by feature extraction network VGG16, this feature figure is divided into two parts, and first part is directly entered Fast R-CNN networks do profound convolution again, and second part, which enters after RPN networks generating region is suggested, inputs Fast R-CNN networks, The characteristic pattern obtained with first part enters the ponds RoI layer jointly, then obtains position after full context layer and return in gesture classification Score, it is final to realize gestures detection identification.

The it is proposed of network (Region Proposal Networks, RPN) is suggested in region, for solving in Fast R-CNN The generating mode of candidate region is the method based on selective search (Selective Search), due to this method calculation amount Greatly, the speed of strong influence algorithm.Suggest network as shown in Fig. 2, RPN detailed processes are as follows in region：Use a small convolution Core (be usually 3*3 sizes) does sliding sash scanning on the characteristic pattern that last layer of convolution obtains, the sliding convolution kernel every time with spy The window of n*n on sign figure connect (VGG16 of the present invention using 228 pixels) entirely, and then obtaining a low-dimensional vector, (VGG16 is 512d), this low-dimensional vector is finally sent to two full articulamentums, i.e. bezel locations return layer (reg layer) and target point Class layer (clslayer), bezel locations return layer for predicting the corresponding coordinate of the anchor of Suggestion box, and target classification layer is for judging Suggestion box is target or background.

Loss function by RPN networks consists of two parts：1) it is used for the loss function L of classification_cls, to describe certain figure As whether region is target；2) loss function L is returned_reg, to describe between the regions RP and real goal (Ground Truth) Gap.The part total losses function representation is：

Wherein, p_iIt is the probability that i-th of anchor (Anchor) rectangle frame is target,It is sample label (1 corresponding anchor matrix It is target, 0 on the contrary)；t_iIt is used to indicate that parametrization frame coordinate that prediction obtains, is a four-dimensional coordinate,It is positive sample Parametrization coordinate specifically as shown in formula (1-4).N_clsIndicate most small quantities of amount of images in input network, N_regIndicate anchor coordinate Sum, the two are all normalized weight parameter.λ is adjusting the two-part balance of formula.For frame Classification Loss function L_cls, indicated using log loss functions；Loss function L is returned for frame_reg, computational methods are as follows：

Known by formula (1-1), when sample be timing, i.e.,Shi Caihui activates frame to return loss function.Frame, which returns, to be made With being the coordinate for correcting anchor rectangle frame and true frame, keep the two closer, it is calculated using the coordinate of parametrization：

In formula, x, y, w, h indicate the center point coordinate of prediction frame, the width and height of frame respectively；x_a,y_a,w_a,h_aRespectively Indicate the coordinate of candidate frame key store, the width and height of frame；x^*,y^*,w^*,h^*The center point coordinate of practical frame is indicated respectively, The width and height of frame；t_x, t_w,Loss is returned for calculating, i.e. returning from suggestion areas frame to neighbouring true frame Return.

According to the multitask loss function of definition, the optimization algorithm that the present invention uses is SGD, is joined in the hope of optimal weight Number.

When training RPN networks, pass through backpropagation (Back-Propagation, BP) and stochastic gradient descent (Stochastic Gradient Descent, SGD) carries out end-to-end (end-to-end) training.

In the present invention, RPN mechanism uses feature shared mechanism with Fast R-CNN, that is, shares the convolutional layer of 13 layers of VGG, It is shared using alternately training (Alternating training) stage convolutional layer feature, it avoids in Faster R-CNN networks Learn two networks.

Step 4：The region convolutional Neural of training method training step 3 stage by stage is used using the gesture training set of step 2 Network, the iterations that setting four-stage is arranged are respectively 40k, 20k, 40k and 40k times, and each stage is learned using fixed The mode of habit rate, learning rate are fixed as 0.001, using stochastic gradient descent method optimum results.Fig. 3 is network training flow Figure.Through being repeatedly finely adjusted to network, selects one group of preferable model parameter of effect as final mask, be used for experiment test.

Fig. 4 is the experimental result picture of gestures detection identification of the present invention.Randomly select part of test results, gesture in every figure Recognition result includes gesture class label and probability size.It can be seen that using present embodiment based on region convolutional Neural net The method of network identifies gestures detection highly effective.

It is not difficult to find that the present invention need not describe Chinese character hand gesture feature, the convolution of use using hand-designed language Neural network can obtain deeper characteristic information so that the plasticity of model is good；Using RPN mechanism do region suggest it can and The convolution feature of entire detection network share full figure so that region suggests that the time used is less, is conducive to algorithm speed and improves；With The detection and identification of the final gesture target of Fast R-CNN real-time performances；All of above characteristic so that the present invention program have compared with Good recognition speed, particularly improves a lot on gestures detection recognition accuracy.

Claims

1. one kind being based on the efficient gestures detection recognition methods of region convolutional neural networks, which is characterized in that include the following steps：

(1) Chinese character gesture letter sample image is pre-processed；

(3) gestures detection identification is carried out using based on region convolutional neural networks Faster R-CNN networks, which includes：It is special Sign extraction network, the regions RPN suggest that network and Fast R-CNN networks, the feature extraction network are used to extract gesture feature, And the characteristic pattern of extraction is divided into first part and second part, the first part is directly entered Fast R-CNN networks and does depth Level convolution, the second part, which enters after RPN networks generating region is suggested, inputs Fast R-CNN networks, and and first part Obtained characteristic pattern enters the ponds RoI layer jointly, then obtains position after full context layer and return in gesture classification score, final real Existing gestures detection identification；

(4) training network model：This network is trained using Chinese character manual alphabet training set, obtains network parameter；Finally with test Collection or in real time the acquisition gesture video input trained network realize the detection identification of Chinese character gesture letter.

2. according to claim 1 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the step (1) is specially：Chinese character gesture letter video is recorded, and it is image that video, which is taken out frame, removal smear is serious and hides Serious image is kept off, and enhancing processing is carried out using the method for high-pass filtering to image.

3. according to claim 1 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the images of gestures data set built in the step (2) includes original sample image and carried out by hand to original sample image Label image after mark, wherein the image tagged frame of markup information record is corresponded with original image；Using to original graph Mode as doing minute surface symmetrical treatment re-flags correspondence image, to achieve the purpose that strengthen static sign language data set.

4. according to claim 1 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the feature extraction network in the step (3) is 13 layers of VGG16 networks for removing 3 layers of full articulamentum.

5. according to claim 1 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the regions RPN in the step (3) suggest that network is suggested using the direct generating region of CNN convolutional neural networks, pass through cunning Dynamic window slides the region that one time obtains multiple dimensioned more length-width ratios on last convolutional layer suggests to extract detection zone, described Suggest that network also carries out end-to-end training by backpropagation and stochastic gradient descent in the regions RPN.

6. according to claim 5 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the regions RPN suggest that network does sliding sash scanning using a convolution kernel on the characteristic pattern that last layer of convolution obtains, should Sliding convolution kernel is connect with the window on characteristic pattern entirely every time, obtains a low-dimensional vector, this low-dimensional vector is sent to two Full articulamentum, i.e. bezel locations return layer and target classification layer, and the bezel locations return the anchor pair that layer is used to predict Suggestion box The coordinate answered, the target classification layer is for judging that Suggestion box is target or background.

7. according to claim 5 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the regions RPN suggest that the loss function of network isIts In, p_iIt is the probability that i-th of anchor rectangle frame is target,It is sample label；t_iIt is used to indicate that the parametrization frame that prediction obtains Coordinate,It is the parametrization coordinate of positive sample；N_clsIndicate most small quantities of amount of images in input network, N_regIndicate the total of anchor coordinate Number；L_clsFor the loss function for classification；L_regTo return loss function.

8. according to claim 1 be based on the efficient gestures detection recognition methods of region convolutional neural networks, feature exists In the regions RPN suggest that network and FastR-CNN networks use feature shared mechanism in the step (3), using alternately training rank Section convolutional layer feature is shared.