CN107808143A

CN107808143A - Dynamic gesture identification method based on computer vision

Info

Publication number: CN107808143A
Application number: CN201711102008.9A
Authority: CN
Inventors: 王爽; 焦李成; 方帅; 王若静; 杨孟然; 权豆; 孙莉; 侯彪; 马晶晶; 刘飞航
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-03-16
Anticipated expiration: 2037-11-10
Also published as: CN107808143B

Abstract

The invention discloses a kind of dynamic gesture identification method based on computer vision.Solves the problems, such as the Dynamic Recognition of the gesture under complex background.Implementation step is：Collection gesture data collection is simultaneously manually marked, the priori frame of cluster acquisition training is carried out to the true frame of image set of mark, structure end to end can simultaneously future position, size and classification convolutional neural networks, training network obtains weight, load weight to be identified to network, input images of gestures, the position coordinates and generic information that the processing of non-maxima suppression method obtains, final recognition result image is obtained, identification information is recorded in real time and obtains dynamic gesture interpretation result.The defects of being carried out instant invention overcomes hand detection in gesture identification in the prior art and classification identification substep, it greatly simplify the process of gesture identification, the degree of accuracy and the speed of identification are improved, enhances the robustness of identifying system, and realizes the function to dynamic gesture interpretation.

Description

Dynamic gesture identification method based on computer vision

Technical field

The invention belongs to technical field of image processing, further relates to the target identification technology of image, is specifically one kind Dynamic gesture identification method based on computer vision.Available in image gesture position detection and state recognition, so as to More accurately information is provided for the application such as the follow-up sign language interpreter of gesture identification, game interactive.

Background technology

In recent years, with the development of the related disciplines such as computer vision and machine learning, human-computer interaction technology (human Computer interaction) just gradually from by " centered on computer " to so that " artificial " center " changes.Made with human body itself Interactive experience more directly perceived, comfortable is provided for operator for the natural user interface of intercommunion platform, is known including face Not, gesture identification and body posture identification etc..Gesture wherein in daily life possesses very as intuitively exchange way naturally Good application prospect：The smart machine in virtual reality is controlled using the gesture provided；As sign language interpreter, solve The communicating questions of deaf-mute；Unmanned automatic identification traffic police gesture.Therefore, gesture identification have critically important researching value and Meaning.

Gesture identification is concentrated mainly on two aspects, and one kind is to be based on sensing equipment (such as：Data glove+position tracking instrument) Gesture identification, another kind is the gesture identification of view-based access control model.Because the gesture identification of view-based access control model can make operator with more Natural mode is added to carry out man-machine interaction, and flexibility is bigger, so having obtained more researchs and concern.Current most gestures Identification is all based on carrying out position detection and identification to the gesture in image, using first detecting hand position, then determines gesture class Other two steps recognition methods.

Paper " the Real-Time Hand Gesture Recognition Using that Zhi-hua Chen et al. are delivered Finger Segmentation”(The scientific world journal,2014(3):267872) one kind is proposed in Method based on hand detection and SHAPE DETECTION.This method extracts hand region and binaryzation first with Background difference, so After be partitioned into finger and palm, recycle finger quantity and content (content refers to the title of finger, such as：Thumb, forefinger, Middle finger etc.) gesture target is classified from original 13 templates.But this method is strict to image background requirement, only Hand position can be just partitioned under single background by having.In addition, the gesture shape of the method identification is single, poor robustness is difficult To promote.

Paper " the A Real-time Hand Gesture Recognition and Human- that Pei Xu are delivered Proposed in Computer Interaction System " (In CVPR, IEEE, 2017) a kind of based on hand detection and CNN The algorithm of identification.This method obtains the binary image for only including hand using the primary image processing method such as filtering, morphology, It is then enter into convolutional neural networks LeNet and carries out feature extraction and identify, improves the degree of accuracy.But this method Need to pre-process image, high is required to background color, and the detection and identification of gesture are carried out in two steps, i.e., first obtain The position of gesture, then current gesture is classified to obtain state, identification step is cumbersome and time-consuming.

The content of the invention

It is an object of the invention to the deficiency for prior art, propose a kind of accuracy rate it is higher, it is more efficient based on The dynamic gesture identification method of computer vision.

The present invention is a kind of dynamic gesture identification method based on computer vision, it is characterised in that includes following step Suddenly：

(1) images of gestures is gathered：The images of gestures of collection is divided into training set and test set, respectively to gesture therein Manually marked, obtain the classification and coordinate data of True Data frame；

(2) cluster obtains priori frame：The True Data frame that manually marks is clustered, using the overlapping degree of the area of frame as Loss metric, obtain several preliminary examination priori frames；

(3) convolutional neural networks of the position that can predict target gesture simultaneously end to end, size and classification are built：To change The GoogLeNet networks entered are rolled up end to end as network frame with the loss function structure of constrained objective position simultaneously, classification Product neutral net；

(4) end to end network is trained：

(4a) batch reads in the images of gestures of training set sample；

(4b) is scaled at random using bilinear interpolation method to image, and size selection is 32 multiple, is obtained The images of gestures of reading after scaling；

(4c) carries out size scaling using the method for bilinear interpolation to input picture, zooms to fixed size, obtains energy The image being input in convolutional network；

The convolutional neural networks that the fixed size image that (4d) is obtained using step (4c) is built to step (3) are instructed Practice, weight corresponding to the convolutional neural networks built；

(5) weight is loaded：Weight corresponding to the convolutional neural networks that step (4d) is obtained is loaded into step (3) structure In convolutional neural networks；

(6) position and the classification of gesture are predicted：Images of gestures to be identified is read in, is input to the convolution god for having loaded weight Through being identified in network, while obtain the position coordinates and generic information of gesture target identification to be identified；

(7) redundant prediction frame is removed：The position coordinates obtained and generic letter are handled using non-maxima suppression method Breath, obtains final prediction block：

(7a) arranges the score descending of all prediction blocks, chooses best result and its corresponding frame；

(7b) travels through remaining frame, if being more than certain threshold value with the overlapping area IOU of current best result frame, just by this frame Delete；

(7c) continues to select highest scoring from untreated frame, repeats said process, that is, performs (7a) and arrive (7c), The prediction frame data remained；

(8) visualization of prediction result：Prediction frame data is mapped in artwork, prediction block is drawn in artwork and is marked Go out gesture target generic label；

(9) record and analyze：The classification and positional information of record gesture in real time, the real time data of gained is analyzed, to dynamic Gesture is interpreted, and interpretation result is directly displayed at into screen.

It is of the invention that gesture is identified end to end using depth convolutional neural networks, can not only be real to dynamic gesture When identify, and higher accuracy rate can be kept under complex background.

The present invention has advantages below compared with prior art：

1st, gesture is identified using convolutional neural networks by the present invention, the position detection and identification of gesture target in image One step is completed, and step is succinct, and recognition speed is fast, overcomes two steps in the prior art and separately handles, first detects hand position, then know The defects of real-time can not be ensured during other gesture.Network can extract the feature of images of gestures well simultaneously, in any angle pair The identification of gesture has very high accuracy rate, and the background of image is not required, even also can be accurate under the background of complexity Gesture really is identified, image background in the prior art is overcome and requires the defects of single；

2nd, the present invention is in training convolutional neural networks using the method for random scaling images of gestures size, and every iteration is several times The size that images of gestures will be changed is input in convolutional neural networks.Algorithm uses every 10 batches, and network will be randomly A new dimension of picture is selected, allows network to be attained by a good prediction effect in different input sizes, it is same Network can be detected on different resolution.So that identical network can predict the detection of different resolution, robust Property and generalization are stronger.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the natural scene gesture figure that the present invention uses in emulation experiment；

Fig. 3 is the gesture target identification result figure obtained in emulation experiment；

Fig. 4 is recognition result figure of the present invention to dynamic gesture, and wherein Fig. 4 (a) is semantic for the dynamic of " object " in sign language The a certain frame of state gesture, Fig. 4 (b) are a certain frames of the process testing result；

Fig. 5 is the record figure to dynamic hand gesture recognition process gesture center point coordinate.

Embodiment

The present invention is described in detail below in conjunction with the accompanying drawings.

Embodiment 1

Gesture possesses good application prospect as intuitively exchange way naturally：Using the gesture provided to void The smart machine intended in reality is controlled；As sign language interpreter, solve the communicating questions of deaf-mute；Unmanned automatic identification Traffic police's gesture etc..Conventional method is substantially all used currently for the Gesture Recognition of view-based access control model, i.e., is first partitioned into gesture, then Gesture is classified, this mode requires high to photographic quality, and is difficult to handle the gesture under complex background.Therefore limit The development of gesture identification application.The present invention is directed to above-mentioned present situation, expands research and innovation, proposes that one kind is regarded based on computer The dynamic gesture identification method of feel, referring to Fig. 1, including have the following steps：

(1) images of gestures is gathered：The images of gestures of collection is divided into training set and test set, training set, which is used to train, to be rolled up Product neutral net, test set are used for the accuracy rate for calculating the Network Recognition.The gesture in the images of gestures collected is marked, is obtained Most press close to the rectangle frame size and center point coordinate of gesture, and the classification of corresponding gesture.Realization enters pedestrian to gesture therein Work marks, and obtains the classification and coordinate data of True Data frame.

(2) cluster obtains priori frame：Cluster centre number is chosen, the True Data frame manually marked is clustered, by frame The overlapping degree of area is clustered as loss metric, obtains several preliminary examination priori frames.Cluster centre number is set in this example 9 are set to, after using overlapping degree as the cluster of loss metric, obtained 9 preliminary examination priori frames, with this 9 preliminary examination priori Preliminary examination prediction block of the frame as convolutional neural networks, the convergence time of convolutional neural networks can be shortened.Generally, cluster centre number Size depend on the dense degrees of target numbers in picture, and target numbers are more in picture, the middle calculation of the cluster of setting It is more.

(3) convolutional neural networks of the position that can predict target gesture simultaneously end to end, size and classification are built：To change The GoogLeNet networks entered coordinate constrained objective position, size, the loss function structure end of classification simultaneously as network frame To the convolutional neural networks at end.The convolutional neural networks end to end of constrained objective position and classification simultaneously are capable of in design one, The network can predict position, size and the classification of target gesture simultaneously.The convolutional neural networks that the present invention is built make use of together When constrained objective position, size and its classification loss function so that the network possesses while future position, size and class Other function.The network calculations amount is small, and is easy to restrain, can be to 9000 target classifications on ImageNet data sets.

(4) end-to-end convolutional neural networks are trained：In order to strengthen robustness of the convolutional neural networks to picture size, batch After reading in images of gestures, the images of gestures of reading is scaled twice.It is for the first time random from the images of gestures being originally inputted Arbitrary dimension is zoomed to, is to zoom to specified size again from the arbitrary dimension image after scaling for the second time, will finally zoom to It is trained in the images of gestures input convolutional neural networks of specified size, obtains training weight, specifically comprise the following steps：

(4a) batch reads in the images of gestures of training set sample；

(4b) is scaled at random using bilinear interpolation method to the images of gestures of reading so that the gesture figure after scaling As the multiple that size is 32, the images of gestures of the reading after being scaled.This is done to the yardstick for increasing data is more Sample, strengthen the robustness of network, and then improve recognition accuracy.

(4c) carries out size scaling using the method for bilinear interpolation to input picture again, zooms to fixed size, obtains The image that can be input in convolutional network, in this example, the size of fixed size is 672*672.Image scaling is to fixed size It is relevant with the structure of convolutional neural networks.

The convolutional neural networks that the fixed size image that (4d) is obtained using step (4c) is built to step (3) are instructed Practice, obtain weight corresponding to convolutional neural networks.

(5) weight is loaded：The network weight that step (4d) obtains is loaded into the convolutional neural networks of step (3) structure In；This weight is network parameter required when predicting.

(6) position and the classification of gesture are predicted：Images of gestures to be identified is read in, network first contracts the images of gestures of input The size into 4 (c) is put, then is input in the network for loaded weight and is identified, while obtains the position of gesture target identification Put coordinate, size and generic information.

(7) redundant prediction frame is removed：The position seat for obtaining gesture in images of gestures is handled using non-maxima suppression method Mark and generic information, obtain final prediction block.The prediction result of same target is likely to be obtained multiple identification frames, with non-pole Big restrainable algorithms remove the identification frame of redundancy, retain the data of a maximum identification frame of confidence level, and concrete operations are as follows：

(7a) arranges framed confidence score descending, chooses frame corresponding to confidence level best result；

(7b) travels through remaining frame, if being more than certain threshold with the overlapping area IOU of current confidence score highest frame Value, just deletes frame；

(7c) continues to select highest scoring from untreated frame, repeats said process, that is, performs (7a) and arrive (7c), The prediction frame data remained；Position of the data of prediction block including frame, size, classification.

(8) visualization of prediction result：The coordinate data and size of Forecasting recognition frame are with respect under 4 (c) size, also It is the fixed dimension of scaling, the prediction frame data under fixed dimension is mapped in artwork size, artwork size is to be identified Images of gestures size, prediction block is drawn in artwork and marks gesture target generic label.

(9) record and analyze：Identification of the present invention to single photo only needs 0.02 second, up to wanting for real-time gesture identification Ask.Camera, the convolutional neural networks trained with this are called by opencv, the classification and position for recording gesture in real time are believed Breath, the real time data of gained is analyzed, dynamic gesture is interpreted, interpretation result is directly displayed at screen.

The present invention builds convolutional neural networks end to end with the loss function of constrained objective position simultaneously, classification, simultaneously Position, size and the classification of target are predicted, to simplify gesture identification step, improves the speed of identification；In the training stage, random contracting Put in images of gestures feeding convolutional neural networks to be identified and train, enhance the robustness of network, improve the accurate of identification Rate.

Embodiment 2

Based on the dynamic gesture identification method of computer vision with embodiment 1, in step (2) of the present invention to artificial mark True Data frame cluster, specifically include and have the following steps：

(2a) reads the true frame data of the artificial mark of training set and test set sample；

(2b) sets cluster centre number, using k-means clustering algorithms, loss metric d according to the following formula (box, Centroid) clustered, obtain priori frame：

D (box, centroid)=1-IOU (box, centroid)

Wherein, centroid represents the cluster centre frame randomly selected, and box represents other true frames in addition to Main subrack, IOU (box, centroid) represents the similarity degree of other frames and Main subrack, that is, the ratio of the overlapping area of two frames, leads to The common factor divided by union for crossing Main subrack and both other frames calculate.

The most representational several priori frames of true frame that the present invention can be gathered manually by cluster, priori frame are For the preliminary examination frame of neural network prediction.The estimation range for being determined to reduce convolutional neural networks of priori frame, accelerates network Convergence.

Embodiment 3

Based on the dynamic gesture identification method of computer vision with embodiment 1-2, the structure convolution in step (3) of the present invention Neutral net, including have the following steps：

(3a) based on GoogLeNet convolutional neural networks, using simple 1*1 and 3*3 convolution kernels, structure includes G The convolutional neural networks of individual convolutional layer and 5 pond layers, G takes 25 in this example.

The convolutional network of the loss function training structure of (3b) according to the following formula：

Wherein, the Section 1 of loss function is lost for the center point coordinate of prediction target frame, wherein λ_coordLost for coordinate Coefficient, 1≤λ_coord≤ 5,3 are taken as in this example, this point is to ensure that the positional information of prediction gesture is accurate；S²Represent that picture is drawn The number of subnetting lattice, B represent the number of each grid forecasting frame；When indicating target, j-th of prediction in i-th of grid Whether frame is responsible for the prediction of this target；(x_i,y_i) the true frame center point coordinate of target is represented,Represent prediction block center Point coordinates.Function Section 2 is lost for prediction frame width is high, (w_i,h_i) represent that the width of true frame is high,Represent prediction block It is wide high.Function Section 3 and Section 4 are that the probability comprising target loses in prediction block, wherein λ_noobjWhen expression does not include target Loss coefficient, 0.1≤λ_coord1 is taken in≤1 this example, to ensure that convolutional neural networks can distinguish target and background block； When expression does not contain target, whether j-th of prediction block in i-th of grid is responsible for the prediction of this target；C_iExpression includes mesh Target true probability,Represent the probability that prediction includes target.Function Section 5 is prediction class probability loss,Represent i-th Individual grid contains target's center's point；p_i(c) real goal classification is represented,Represent the target classification of prediction；C represents classification Number.

The position detection of gesture and classification identify that a step is completed in the embodiment of the present invention.Using convolutional neural networks to original Images of gestures carries out feature extraction, then loses training network by reducing position loss and classification, makes network in detection gesture Gesture species is identified while position.

Embodiment 4

Based on the dynamic gesture identification method of computer vision with embodiment 1-3, the use described in step (4b) of the present invention Bilinear interpolation method is scaled at random to image, and size selection is 32 multiple, the input picture after being scaled, Carry out as follows：

4b1：Read in a switch cubicle image to be identified.

4b2：Image is scaled at random using bilinear interpolation method, size selection is 32 multiple, is obtained Input picture after scaling.

The pending high-voltage board switch image inputted in the embodiment of the present invention as shown in Figure 2, switchs the pixel of image Scope is [600-1000], and the selection of picture size size is 32 multiple { 480,512 ... 832 } after scaling, minimum 480*480, Maximum 832*832, the input picture after being scaled.

The present invention scales images of gestures size in training convolutional neural networks at random, to increase convolutional neural networks to figure As the robustness of size.Algorithm uses every 10 batches, just randomly scales images of gestures, allows network in different input chis A good prediction effect is attained by very little, consolidated network can be detected on different resolution.So that identical net Network can predict the images of gestures of different resolution, and robustness and generalization are stronger.

Below in conjunction with the accompanying drawings, providing a more complete example, the present invention will be further described.

Embodiment 5

Based on the dynamic gesture identification method of computer vision with embodiment 1-4.Referring to accompanying drawing 1, specific implementation step bag Include：

Step 1：Images of gestures is gathered, images of gestures is shot with camera, includes：" stone ", " scissors ", " cloth ", " rod ", " OK ", " love " etc., referring to Fig. 2 (a)-(f).Fig. 2 (a) is that clench fist gesture, Fig. 2 (b) of positive and negative is positive and negative " scissors " gesture, Fig. 2 (c) is positive and negative palm hand gesture, and Fig. 2 (d) is tree thumb gesture, and Fig. 2 (e) is " OK " gesture, and Fig. 2 (f) is " love " hand Gesture.Some complex backgrounds are also included in per secondary images of gestures, and same gesture possesses a variety of anglecs of rotation.By collection Images of gestures is divided into training set and test set, and the gesture in the images of gestures of collection is manually marked respectively, obtains true The classification and coordinate data of real frame.

The natural scene images of gestures collection of collection totally 2500 width, 6 kinds of representative gestures are chosen in this example, uniformly point The test set of training set and 500 width for 2000 width, referring to accompanying drawing 2.The shooting of image set uses 12,000,000 mobile phone camera, Screening and artificial mark are carried out to the image of shooting

Step 2：Cluster obtains priori frame.

Read the true frame data of training set and test set sample.

In the present embodiment, the true frame of training set and test set sample is the target frame coordinate and class manually marked in image Other information.

Using k-means clustering algorithms, loss metric d (box, centroid) according to the following formula is clustered, and is obtained first Test frame：

D (box, centroid)=1-IOU (box, centroid)

Wherein, centroid represents the cluster centre frame randomly selected, and box represents other true frames in addition to Main subrack, IOU (box, centroid) represents the similarity degree of other frames and Main subrack, is calculated by the common factor divided by union of the two.

The cluster centre frame number chosen in this example is that 5, IOU (box, centroid) calculates acquisition according to the following formula：

Wherein, ∩ represents the intersection area area of two frames of centroid and box, and ∪ represents centroid and box two The union refion area of frame.

Step 3：Build convolutional neural networks.

Based on GoogLeNet convolutional neural networks, using simple 1*1 and 3*3 convolution kernels, structure includes G volume The convolutional neural networks of lamination and 5 pond layers, G takes 23 in this example.

The convolutional network of loss function training structure according to the following formula：

Wherein, the Section 1 of loss function is lost for the center point coordinate of prediction target frame, wherein λ_coordLost for coordinate Coefficient, 5 are taken as in this example；Function Section 3 and Section 4 are that the probability comprising target loses in prediction block, wherein λ_noobjRepresent Loss coefficient during not comprising target, 0.5 is taken in this example.

Even same gesture, different shooting angle can also obtain different images.It is difficult in existing method Accomplish to identify the stable of the different angle of same gesture, but the convolutional neural networks that the present invention is built can overcome same gesture Possess the problem of multi-rotation angle is difficult to, there is good stability to gesture identification.

Step 4：Training network.

Batch reads in the images of gestures of training set sample.In the present embodiment, the training set image that network is read in per batch is 64 width.

Image is scaled at random using bilinear interpolation method, the images of gestures size selection after scaling is 32 Multiple, the input picture after being scaled.

As shown in Figure 2, the pixel coverage of images of gestures is [500- to the pending images of gestures inputted in the present embodiment 800], the multiple { 480,512 ... 732 } that the selection of picture size size is 32 after scaling, minimum 480*480, maximum 732*732, Images of gestures after being scaled.

Size scaling is carried out to the images of gestures after scaling using the method for bilinear interpolation again, zoomed to fixed big It is small, obtain the image that can be input in convolutional network.In this example, the size that images of gestures zooms to fixed size is 608*608.

The convolutional neural networks that structure is input to using the images of gestures of fixed size are trained, and obtain convolutional Neural net Network weight, weight are exactly the parameter of convolutional neural networks, are used as when testing.Using training set sample training network, iteration 2 Obtain weight for ten thousand times, training is completed.

Step 5：The network weight that step 4 is obtained i.e. parameter is loaded into the convolutional neural networks that step 3 is built, to survey Have a fling at preparation.

Step 6：Images of gestures to be identified in test set is read in, is input in the network for loaded weight and is identified, Size, position coordinates and the generic information of gesture target identification are obtained, referring to Fig. 3, Fig. 3 (a)-(f) is that the present invention is right Answer Fig. 2 (a)-(f) recognition result.

Step 7：The position obtained and generic information are handled using non-maxima suppression method, obtain final prediction Frame.

All prediction blocks are arranged according to confidence score descending, choose best result and its corresponding frame；

Remaining prediction block is traveled through, if being more than certain threshold with the overlapping area IOU of current confidence score highest frame Value, just deletes frame；

Continue to select highest scoring from untreated frame, repeat said process, the prediction block remained Data；

Step 8：Prediction frame data is mapped in artwork, the classification and positional information of gesture is obtained, is drawn in artwork Prediction block and target generic label is marked, referring to accompanying drawing 3, Fig. 3 (a) -3 (f), each prediction block upper left corner of every width figure The gesture class label as predicted.

Step 9：The classification and positional information of record gesture in real time, referring to accompanying drawing 4, the real time data of gained is analyzed, to dynamic State gesture is interpreted, and interpretation result is directly displayed at into screen, referring to table 1.

The real-time testing result of the dynamic hand gesture recognition of table 1

Predict gesture central point abscissa	Predict gesture central point ordinate	Gesture classification
			1164	371	Scissor
318	372	Scissor
			1152	373	Scissor
364	384	Scissor
			1097	380	Scissor
388	388	Scissor
			1061	381	Scissor
1027	383	Scissor
			430	409	Scissor
452	395	Scissor
			1001	380	Scissor
465	397	Scissor
			989	381	Scissor
510	395	Scissor
			960	381	Scissor
524	392	Scissor
			951	384	Scissor
557	395	Scissor
			918	394	Scissor
561	396	Scissor

The data of table 1 are the portions for the dynamic process that the present invention inwardly moves horizontally to two gestures represented by Fig. 4 from both sides Member record data.Fig. 4 (a) is a certain frame of the semantic dynamic gesture for " object " in sign language, and Fig. 4 (b) is the dynamic gesture mistake The a certain frame of journey testing result.From the data analysis of table 1, gesture keeps the state of " scissors " constant.To the number of coordinates of table 1 According to visualization, figure expression is converted to, abscissa represents abscissa of the gesture central point in current frame image in as Fig. 5, Fig. 5, Ordinate represents ordinate of the gesture central point in current frame image.Point in Fig. 5 represents gesture central point in current frame image Coordinate, be two " scissors " gestures, the coordinate record of dynamic mobile from outside to inside.As can be known from Fig. 5, the dynamic shown in figure The central point ordinate of gesture is basically unchanged, and abscissa changes greatly, and it is that two " scissors " gesture levels are drawn close to illustrate the process, The implication of " object " in corresponding sign language, referring to Fig. 4.

In the embodiment of the present invention, by calculating the distribution histogram of movement locus, to judge the motion conditions of gesture, then tie The change of gesture state in resultant motion, to judge that gesture expresses implication in whole dynamic process, the gesture of static state is both contained Identification, contains dynamic gesture interpretation analysis again.

The technique effect of the present invention is explained again with reference to emulation.

Embodiment 6

Based on the dynamic gesture identification method of computer vision with embodiment 1-5.

Emulation experiment condition：

The hardware platform of emulation experiment of the present invention is：Dell Computer Intel (R) Core5 processors, dominant frequency 3.20GHz, Internal memory 64GB；Simulation Software Platform is：Visual Studio softwares (2015) version.

Emulation experiment content and interpretation of result：

The emulation experiment of the present invention is specifically divided into two emulation experiments.

The data set position coordinate and categorical data of first manual markings collection, and it is fabricated to PASCAL VOC formatted datas The 80% of collection, wherein data set is used as training set sample, and 20% is used as test set sample.

Emulation experiment 1：The contrast of the present invention and prior art：Using the present invention with the prior art based on hand detection and The method of SHAPE DETECTION, based on hand detection and CNN know method for distinguishing, be trained respectively with identical training set sample, then use Same test collection sample is evaluated various methods.Evaluation result is as shown in table 2, and the Alg1 in table 2 represents the side of the present invention Method, Alg2 represent the method based on hand detection and SHAPE DETECTION, and Alg3 represents to know method for distinguishing based on hand detection and CNN.

2 three kinds of method emulation experiment test set accuracys rate of table

Test image	Alg1	Alg2	Alg3
				Accuracy rate (%)	98.0	31.3	78.6
Every width time (s)	0.02	0.13	0.94

From Table 2, it can be seen that the present invention is detected compared to the method based on hand detection and SHAPE DETECTION, based on hand Know method for distinguishing with CNN, gesture identification accuracy rate has obvious advantage, and discrimination is respectively increased nearly 67% and 20%, identification speed Degree is also faster than 6 times and 47 times respectively relative to other two methods.Discrimination of the present invention, which is higher than the reason for other two kinds of algorithms, is, The present invention can ensure very high discrimination to the multiple angles of complex background, gesture.Recognition speed of the present invention is higher than other The reason for two kinds of algorithms is that the present invention constructs a convolutional neural networks end to end, can predict the position of gesture simultaneously And classification, without being divided to two progress.Simulation result shows, the present invention have when carrying out gesture target identification discrimination height, The better performances such as speed is fast, particularly under complex background condition.

Embodiment 7

Based on the dynamic gesture identification method of computer vision with embodiment 1-5, simulated conditions and content with embodiment 6.

Emulation experiment 2：Using the inventive method, different switch image scaling size conducts is used respectively on test set The input of network, test evaluation result are as shown in table 2.

The heterogeneous networks of table 3 input the recognition result of size

From table 3 it is observed that the present invention when input picture zooms to certain size, target identification accuracy rate there is no Significant change, so comprehensive discrimination and recognition rate etc. consider that it is 608*608 size images of gestures conducts to select fixed dimension The optimum size of convolutional neural networks.

It is proposed by the present invention that gesture target identification can be obtained more preferably based on the dynamic gesture identification method of computer vision Recognition accuracy, and real-time gesture identification can be carried out.

In summary, a kind of dynamic gesture identification method based on computer vision disclosed by the invention.Solve multiple The Dynamic Recognition problem of gesture under miscellaneous background.Its step is：Collection gesture data collection is simultaneously manually marked；To the image of mark Collect true frame and carry out the priori frame that cluster obtains training；Structure end to end can simultaneously future position, size and classification Convolutional neural networks；Training network obtains weight；Weight is loaded to network；Input images of gestures is identified；Non- maximum suppression The position coordinates and generic information that method processing processed obtains；Obtain final recognition result image；The letter of record identification in real time Breath obtains dynamic gesture interpretation result.Instant invention overcomes hand detection in gesture identification in the prior art and classification identification substep The defects of progress, the process of gesture identification is greatly simplify, improve the degree of accuracy and the speed of identification, enhance identifying system Robustness, and realize to dynamic gesture interpretation function.Present invention can apply to the man-machine interaction in virtual reality, The fields such as sign language interpreter, unmanned traffic police's gesture automatic identification.

Claims

1. a kind of dynamic gesture identification method based on computer vision, it is characterised in that including having the following steps：

(1) images of gestures is gathered：The images of gestures of collection is divided into training set and test set, gesture therein carried out respectively Artificial mark, obtains the classification and coordinate data of True Data frame；

(2) cluster obtains priori frame：The True Data frame manually marked is clustered, loss is used as using the overlapping degree of the area of frame Measurement, obtains several preliminary examination priori frames；

(3) convolutional neural networks of the position that can predict target gesture simultaneously end to end, size and classification are built：With improved GoogLeNet networks are as network frame, and with the loss function structure of constrained objective position simultaneously, classification, convolution is refreshing end to end Through network；

(4) end-to-end convolutional neural networks are trained：In order to strengthen robustness of the convolutional neural networks to picture size, batch is read in After images of gestures, the images of gestures of reading is scaled twice.For the first time scaled at random from the images of gestures being originally inputted To arbitrary dimension, it is to zoom to specified size again from the arbitrary dimension image after scaling for the second time, will finally zooms to specified It is trained in the images of gestures input convolutional neural networks of size, obtains training weight, specifically comprise the following steps：

(4a) batch reads in the images of gestures of training set sample；

(4b) is scaled at random using bilinear interpolation method to image, and size selection is 32 multiple, is scaled The images of gestures of reading afterwards；

Images of gestures after the scaling that (4c) is obtained using the method for bilinear interpolation to step 4 (b) carries out size scaling again, Zoom to fixed size, obtain the image that can be input in convolutional network；

The convolutional neural networks that the fixed size image that (4d) is obtained using step (4c) is built to step (3) are trained, and are obtained To weight corresponding to convolutional neural networks；

(5) weight is loaded：The network weight that step (4d) obtains is loaded into the convolutional neural networks of step (3) structure；

(6) position and the classification of gesture are predicted：Images of gestures to be identified is read in, is input in the network for loaded weight and carries out Identification, while obtain the position coordinates and generic information of gesture target identification；

(7) redundant prediction frame is removed：The position coordinates obtained and generic information are handled using non-maxima suppression method, obtained Obtain prediction block finally：

(7a) arranges framed score descending, chooses best result and its corresponding frame；

(7b) continues to select highest scoring from untreated frame, repeats said process, that is, performs (7a) and arrive (7c), obtain The prediction frame data remained；

(7c) continues to select highest scoring from untreated frame, repeats said process, the prediction block remained Data；

(8) visualization of prediction result：Prediction frame data is mapped in artwork, prediction block is drawn in artwork and marks hand Gesture target generic label；

(9) record and analyze：The classification and positional information of record gesture in real time, the real time data of gained is analyzed, to dynamic gesture It is interpreted, interpretation result is directly displayed at screen.

2. the dynamic gesture identification method according to claim 1 based on computer vision, it is characterised in that wherein step (2) being clustered to the True Data frame that manually marks described in, specifically includes and has the following steps：

(2a) reads the true frame data of images of gestures training set and test set sample；

(2b) uses k-means clustering algorithms, and loss metric d (box, centroid) according to the following formula is clustered, and obtains first Test frame：

D (box, centroid)=1-IOU (box, centroid)

3. the dynamic gesture identification method according to claim 1 based on computer vision, it is characterised in that wherein step (3) the structure convolutional neural networks described in, including have the following steps：

(3a) based on GoogLeNet convolutional neural networks, using simple 1*1 and 3*3 convolution kernels, structure includes G volume The convolutional neural networks of lamination and 5 pond layers；

<mrow> <mi>l</mi> <mi>o</mi> <mi>s</mi> <mi>s</mi> <mo>=</mo> <msub> <mi>&lambda;</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>I</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <mo>&lsqb;</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mover> <mi>x</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mover> <mi>y</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>&rsqb;</mo> </mrow>

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>I</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <mo>&lsqb;</mo> <msup> <mrow> <mo>(</mo> <msqrt> <msub> <mi>w</mi> <mi>i</mi> </msub> </msqrt> <mo>-</mo> <msqrt> <msub> <mover> <mi>w</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> </msqrt> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msqrt> <msub> <mi>h</mi> <mi>i</mi> </msub> </msqrt> <mo>-</mo> <msqrt> <msub> <mover> <mi>h</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> </msqrt> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>&rsqb;</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>I</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mover> <mi>C</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msub> <mi>&lambda;</mi> <mrow> <mi>n</mi> <mi>o</mi> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>I</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>n</mi> <mi>o</mi> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mover> <mi>C</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <msubsup> <mi>I</mi> <mi>i</mi> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <munder> <mo>&Sigma;</mo> <mrow> <mi>c</mi> <mo>&Element;</mo> <mi>c</mi> <mi>l</mi> <mi>a</mi> <mi>s</mi> <mi>s</mi> <mi>e</mi> <mi>s</mi> </mrow> </munder> <msup> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>(</mo> <mi>c</mi> <mo>)</mo> <mo>-</mo> <msub> <mover> <mi>p</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>(</mo> <mi>c</mi> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mtd> </mtr> </mtable> </mfenced>

Wherein, the Section 1 of loss function is lost for the center point coordinate of prediction target frame, wherein λ_coordFor coordinate loss coefficient, 5 are taken as in this example；S²The number of picture grid division is represented, B represents the number of each grid forecasting frame；Indicate target When, whether j-th of prediction block in i-th of grid is responsible for the prediction of this target；(x_i,y_i) represent the true frame central point of target Coordinate,Represent prediction block center point coordinate.Function Section 2 is lost for prediction frame width is high, (w_i,h_i) represent true frame Width it is high,Represent that the width of prediction block is high.Function Section 3 and Section 4 are that the probability comprising target damages in prediction block Lose, wherein λ_noobjLoss coefficient when representing not including target, takes 0.5 herein；When expression does not contain target, i-th Whether j-th of prediction block in grid is responsible for the prediction of this target；C_iThe true probability for including target is represented,Represent prediction Probability comprising target.Function Section 5 is prediction class probability loss,Represent that i-th of grid contains target's center's point；p_i (c) real goal classification is represented,Represent the target classification of prediction；C represents classification number.

4. described in the dynamic gesture identification method according to claim 1 based on computer vision, wherein step (4b) Image is scaled at random using bilinear interpolation method, the selection of images of gestures size is 32 multiple, is scaled Input picture afterwards, carry out as follows：

4b1：Read in an images of gestures to be identified；

4b2：Images of gestures is scaled at random using bilinear interpolation method, size selection is 32 multiple, is obtained The images of gestures of reading after scaling.