CN107808143B

CN107808143B - Dynamic gesture recognition method based on computer vision

Info

Publication number: CN107808143B
Application number: CN201711102008.9A
Authority: CN
Inventors: 王爽; 焦李成; 方帅; 王若静; 杨孟然; 权豆; 孙莉; 侯彪; 马晶晶; 刘飞航
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2021-06-01
Anticipated expiration: 2037-11-10
Also published as: CN107808143A

Abstract

The invention discloses a dynamic gesture recognition method based on computer vision. The problem of dynamic recognition of gestures in a complex background is solved. The method comprises the following implementation steps: acquiring a gesture data set, carrying out manual labeling, clustering the real frames of the labeled image set to obtain a trained prior frame, constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target, training the network to obtain weights, loading the weights to the network, inputting a gesture image for identification, processing the obtained position coordinate and the belonged category information by a non-maximum suppression method to obtain a final identification result image, and recording identification information in real time to obtain a dynamic gesture interpretation result. The invention overcomes the defects that the hand detection and the category identification are carried out step by step in the gesture identification in the prior art, greatly simplifies the gesture identification process, improves the identification accuracy and speed, enhances the robustness of an identification system, and realizes the function of dynamic gesture interpretation.

Description

Dynamic gesture recognition method based on computer vision

Technical Field

The invention belongs to the technical field of image processing, and further relates to an image target recognition technology, in particular to a dynamic gesture recognition method based on computer vision. The gesture recognition method can be used for position detection and state recognition of gestures in the image, so that more accurate information can be provided for subsequent sign language translation, game interaction and other applications of gesture recognition.

Background

In recent years, with the development of related disciplines such as computer vision and machine learning, human computer interaction (human computer interaction) is gradually shifting from "computer-centered" to "human-centered". The natural user interface with the human body as the communication platform provides more visual and comfortable interactive experience for an operator, wherein the interactive experience comprises face recognition, gesture recognition, posture recognition and the like. The gestures in daily life are used as natural and visual communication modes, and have good application prospects: controlling intelligent equipment in virtual reality by using a specified gesture; the sign language translation device is used for sign language translation and solves the communication problem of the deaf-mute; the unmanned automatic recognition traffic police gesture. Therefore, the gesture recognition has important research value and significance.

Gesture recognition focuses mainly on two aspects, one is gesture recognition based on sensing devices (such as a data glove and a position tracker), and the other is gesture recognition based on vision. Much research and attention has been directed to vision-based gesture recognition as it enables human-machine interaction by operators in a more natural way and with greater flexibility. At present, most gesture recognition is based on position detection and recognition of gestures in an image, and a two-step recognition method of firstly detecting hand positions and then determining gesture types is adopted.

A method based on Hand detection and shape detection is proposed in The paper "Real-Time Hand position Recognition Using Finger Segmentation" (The scientific world journal,2014(3):267872) published by Zhi-hua Chen et al. The method comprises the steps of firstly extracting a hand area by using a background difference method, carrying out binarization, then segmenting fingers and a palm, and classifying gesture objects from original 13 templates by using the number and content of the fingers (the content refers to the names of the fingers, such as a thumb, an index finger, a middle finger and the like). However, this method has strict requirements on the image background, and the hand position can be segmented only in a single background. In addition, the gesture recognized by the method is single in shape, poor in robustness and difficult to popularize.

An algorithm based on Hand detection and CNN Recognition is proposed In the paper "A Real-time Hand position Recognition and Human-Computer Interaction System" (In CVPR, IEEE, 2017) published by Pei Xu. According to the method, a binarization image only containing hands is obtained by using basic image processing methods such as filtering and morphology, and then the binarization image is input into a convolutional neural network LeNet for feature extraction and identification so as to improve accuracy. However, the method needs to preprocess the image, has high requirements on background color, and the detection and recognition of the gesture are performed in two steps, namely, the position of the gesture is obtained first, and then the current gesture is classified to obtain the state, so that the recognition step is complicated and time-consuming.

Disclosure of Invention

The invention aims to provide a dynamic gesture recognition method based on computer vision, which has higher accuracy and higher efficiency aiming at the defects of the prior art.

The invention relates to a dynamic gesture recognition method based on computer vision, which is characterized by comprising the following steps of:

(1) acquiring a gesture image: dividing the acquired gesture images into a training set and a testing set, and manually labeling the gestures in the training set and the testing set respectively to obtain the category and coordinate data of a real data frame;

(2) clustering to obtain a prior frame: clustering the manually marked real data frames, and taking the overlapping degree of the areas of the frames as loss measurement to obtain a plurality of preliminary test prior frames;

(3) constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target gesture: constructing an end-to-end convolutional neural network by using an improved GoogLeNet network as a network framework and simultaneously constraining loss functions of target positions and classes;

(4) training the end-to-end network:

(4a) reading in gesture images of training set samples in batch;

(4b) randomly zooming the image by a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and obtaining a zoomed read-in gesture image;

(4c) scaling the size of an input image by a bilinear interpolation method to a fixed size to obtain an image which can be input into a convolution network;

(4d) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the constructed convolutional neural network;

(5) loading weight: loading the weights corresponding to the convolutional neural network obtained in the step (4d) into the convolutional neural network constructed in the step (3);

(6) predicting the location and category of the gesture: reading a gesture image to be recognized, inputting the gesture image into a convolutional neural network loaded with weights for recognition, and simultaneously obtaining position coordinates and category information of the gesture target to be recognized;

(7) removing redundant prediction blocks: processing the obtained position coordinates and the category information by adopting a non-maximum value inhibition method to obtain a final prediction frame:

(7a) sorting the scores of all the prediction frames in a descending order, and selecting the highest score and the frame corresponding to the highest score;

(7b) traversing the other frames, and deleting the frame if the overlapping area IOU of the frame with the current highest frame is larger than a certain threshold value;

(7c) continuing to select one of the unprocessed frames with the highest score, and repeating the processes, namely executing (7a) to (7c) to obtain the reserved prediction frame data;

(8) visualization of prediction results: mapping the prediction frame data to an original image, drawing a prediction frame in the original image and marking a category label to which a gesture target belongs;

(9) recording and analysis: recording the category and position information of the gesture in real time, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpreted result on a screen.

The invention utilizes the deep convolutional neural network to identify the gesture end to end, not only can identify the dynamic gesture in real time, but also can keep higher accuracy under a complex background.

Compared with the prior art, the invention has the following advantages:

1. the method uses the convolutional neural network to identify the gesture, completes the position detection and identification of the gesture target in the image in one step, has simple steps and high identification speed, and overcomes the defect that the real-time property cannot be ensured when the hand position is detected and the gesture is identified after the two steps are separately processed in the prior art. Meanwhile, the network can well extract the characteristics of the gesture image, has high accuracy in gesture recognition at any angle, has no requirement on the background of the image, can accurately recognize the gesture even under the complex background, and overcomes the defect that the image background in the prior art has single requirement;

2. the invention adopts a method of randomly scaling the size of the gesture image when training the convolutional neural network, and the size of the gesture image can be changed and input into the convolutional neural network every iteration for several times. The algorithm adopts 10 batches, the network can randomly select a new picture size, so that the network can achieve a good prediction effect on different input sizes, and the same network can carry out detection on different resolutions. Therefore, the same network can predict detection with different resolutions, and robustness and generalization are stronger.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of natural scene gestures used in simulation experiments with the present invention;

FIG. 3 is a diagram of gesture target recognition results obtained in a simulation experiment;

FIG. 4 is a diagram of the recognition result of the dynamic gesture according to the present invention, in which FIG. 4(a) is a frame of the dynamic gesture with "object" semantic in sign language, and FIG. 4(b) is a frame of the detection result of the process;

FIG. 5 is a record of the coordinates of the gesture center point for the dynamic gesture recognition process.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Example 1

The gesture is used as a natural and intuitive communication mode, and has good application prospect: controlling intelligent equipment in virtual reality by using a specified gesture; the sign language translation device is used for sign language translation and solves the communication problem of the deaf-mute; unmanned automatic recognition of traffic police gestures, and the like. At present, the gesture recognition technology based on vision generally adopts a traditional method, namely, gestures are firstly segmented and then classified, and the mode has high requirements on the quality of photos and is difficult to process the gestures under a complex background. Thus limiting the development of gesture recognition applications. The invention develops research and innovation aiming at the current situation, provides a dynamic gesture recognition method based on computer vision, and the method is shown in figure 1 and comprises the following steps:

(1) acquiring a gesture image: and dividing the acquired gesture images into a training set and a testing set, wherein the training set is used for training the convolutional neural network, and the testing set is used for calculating the accuracy of the network recognition. And marking the gesture on the acquired gesture image to obtain the size and the center point coordinate of the rectangular box closest to the gesture and the category of the corresponding gesture. And manually marking the gesture in the real data frame to obtain the category and coordinate data of the real data frame.

(2) Clustering to obtain a prior frame: and selecting the number of clustering centers, clustering the manually marked real data frames, and clustering by taking the overlapping degree of the areas of the frames as loss measurement to obtain a plurality of preliminary test prior frames. In this example, the number of clustering centers is set to 9, 9 initial test prior frames are obtained after clustering with the overlapping degree as the loss metric, and the 9 initial test prior frames are used as initial test prediction frames of the convolutional neural network, so that the convergence time of the convolutional neural network can be shortened. In general, the size of the cluster center number depends on how dense the number of objects in the picture is, and the more objects in the picture, the more clusters center number is set.

(3) Constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target gesture: and constructing an end-to-end convolutional neural network by taking the improved GoogLeNet network as a network framework and matching with a loss function which simultaneously constrains the position, size and class of a target. Designing an end-to-end convolutional neural network capable of simultaneously constraining the position and the category of the target, wherein the network can simultaneously predict the position, the size and the category of the target gesture. The convolutional neural network constructed by the invention utilizes the loss function which simultaneously restrains the position, the size and the type of the target, so that the network has the function of simultaneously predicting the position, the size and the type of the target. The network is computationally inexpensive and easy to converge, enabling 9000 targets to be classified on the ImageNet dataset.

(4) Training an end-to-end convolutional neural network: in order to enhance the robustness of the convolutional neural network to the image size, after gesture images are read in batch, the read gesture images are zoomed twice. The first time, randomly zooming from an originally input gesture image to an arbitrary size, the second time, zooming from the zoomed image of the arbitrary size to a specified size again, and finally inputting the gesture image zoomed to the specified size into a convolutional neural network for training to obtain a training weight, wherein the training weight specifically comprises the following steps:

(4a) reading in gesture images of training set samples in batch;

(4b) and randomly zooming the read gesture image by adopting a bilinear interpolation method to enable the size of the zoomed gesture image to be a multiple of 32, so as to obtain the zoomed read gesture image. The purpose of this is to increase the scale diversity of the data, enhance the robustness of the network, and further improve the identification accuracy.

(4c) And scaling the input image by a bilinear interpolation method to a fixed size to obtain an image which can be input into the convolution network, wherein the fixed size is 672 × 672 in the example. Scaling of an image to a fixed size is related to the structure of the convolutional neural network.

(4d) And (4) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the convolutional neural network.

(5) Loading weight: loading the network weight obtained in the step (4d) into the convolutional neural network constructed in the step (3); the weight is the network parameter required for prediction.

(6) Predicting the location and category of the gesture: and (3) reading in a gesture image to be recognized, firstly zooming the input gesture image to the size in 4(c) by the network, then inputting the input gesture image into the network loaded with the weight for recognition, and simultaneously obtaining the position coordinate, the size and the category information of the gesture target recognition.

(7) Removing redundant prediction blocks: and processing by adopting a non-maximum suppression method to obtain the position coordinates and the category information of the gesture in the gesture image, and obtaining a final prediction frame. The method comprises the following steps that a plurality of identification frames can be obtained from a prediction result of the same target, redundant identification frames are removed by using a non-maximum inhibition algorithm, and data of one identification frame with the maximum confidence coefficient is reserved, and the specific operation is as follows:

(7a) sorting the confidence scores of all the frames in a descending order, and selecting the frame corresponding to the highest confidence score;

(7b) traversing the rest of the frames, and if the IOU (input output) of the frame with the highest confidence score is larger than a certain threshold, deleting the frame;

(7c) continuing to select one of the unprocessed frames with the highest score, and repeating the processes, namely executing (7a) to (7c) to obtain the reserved prediction frame data; the data of the prediction box includes the position, size, category of the box.

(8) Visualization of prediction results: and (c) the coordinate data and the size of the predicted recognition frame are in a relative 4(c) size, namely, in a fixed scaling mode, the predicted frame data in the fixed scaling mode are mapped into the original drawing size, namely the size of the gesture image to be recognized, the predicted frame is drawn in the original drawing, and the category label of the gesture target is marked.

(9) Recording and analysis: the invention only needs 0.02 second for identifying a single photo, and can meet the requirement of real-time gesture identification. And calling a camera through opencv, recording the category and position information of the gesture in real time by using the trained convolutional neural network, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpreted result on a screen.

According to the method, an end-to-end convolutional neural network is constructed by using a loss function which simultaneously restrains the position and the type of the target, and the position, the size and the type of the target are simultaneously predicted, so that the gesture recognition step is simplified, and the recognition rate is improved; in the training stage, the gesture image to be recognized is randomly zoomed and sent to the convolutional neural network for training, so that the robustness of the network is enhanced, and the recognition accuracy is improved.

Example 2

Similar to embodiment 1, the dynamic gesture recognition method based on computer vision includes the following steps:

(2a) reading the manually marked real frame data of the training set and the test set samples;

(2b) setting the number of clustering centers, clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:

d(box,centroid)＝1-IOU(box,centroid)

the centroid represents a randomly selected clustering center frame, the box represents other real frames except the center frame, and the IOU (box, centroid) represents the similarity between the other frames and the center frame, namely the proportion of the overlapping area of the two frames, and the centroid is calculated by dividing the intersection of the center frame and the other frames by the union.

The invention can obtain a plurality of prior frames which are most representative of the manually collected real frames through clustering, wherein the prior frames are initial test frames of neural network prediction. The determination of the prior frame can reduce the prediction range of the convolutional neural network and accelerate the convergence of the network.

Example 3

The dynamic gesture recognition method based on computer vision is the same as the embodiment 1-2, and the method for constructing the convolutional neural network in the step (3) comprises the following steps:

(3a) based on the google lenet convolutional neural network, a convolutional neural network was constructed containing G convolutional layers and 5 pooling layers, using simple 1 × 1 and 3 × 3 convolutional kernels, in this example G is 25.

(3b) Training the constructed convolutional network according to the loss function of the following formula:

wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λ_coordIs a coordinate loss coefficient of 1 ≦ λ_coordThe number is less than or equal to 5, and the number is 3 in the example, which is to ensure that the position information of the predicted gesture is accurate; s²The number of the picture dividing grids is represented, and B represents the number of each grid prediction frame;

whether a jth prediction box in the ith grid is responsible for the prediction of the target or not is shown when the target exists; (x)_i,y_i) Representing the coordinates of the center point of the real frame of the target,

representing the coordinates of the center point of the prediction box. The second term of the function is the predicted frame width height loss, (w)_i,h_i) The width and height of the real box are represented,

indicating the width and height of the prediction box. The third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λ_noobjDenotes a loss coefficient when no object is included, 0.1. ltoreq. lambda_coord1 is taken in the example of less than or equal to 1 so as to ensure that the convolutional neural network can distinguish the target block from the background block;

whether the jth prediction box in the ith grid is responsible for the prediction of the target or not when the target is not contained is shown; c_iThe true probability of containing the object is represented,

representing the probability of predicting the inclusion of an object. The fifth term of the function is the prediction class probability loss,

indicating that the ith grid contains a target center point; p is a radical of_i(c) The real object class is represented by the representation,

representing a predicted target category; c represents the number of categories.

The position detection and the category identification of the gesture are completed in one step in the embodiment of the invention. The method comprises the steps of extracting features of an original gesture image by adopting a convolutional neural network, and then training the network by reducing position loss and category loss to enable the network to identify gesture types while detecting gesture positions.

Example 4

The dynamic gesture recognition method based on computer vision is the same as the embodiment 1-3, the image is randomly zoomed by adopting a bilinear interpolation method in the step (4b), the size is selected to be a multiple of 32, the zoomed input image is obtained, and the method is carried out according to the following steps:

4b 1: a switch cabinet image to be identified is read in.

4b 2: and randomly zooming the image by a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and obtaining a zoomed input image.

Referring to fig. 2, the switch image to be processed input in the embodiment of the present invention is shown, where the pixel range of the switch image is [ 600-.

The invention randomly scales the size of the gesture image when training the convolutional neural network so as to increase the robustness of the convolutional neural network to the image size. The algorithm randomly scales the gesture images every 10 batches, so that the network can achieve a good prediction effect on different input sizes, and the same network can detect the gesture images on different resolutions. The same network can predict gesture images with different resolutions, and robustness and generalization are stronger.

The invention will now be described more fully hereinafter with reference to the accompanying drawings.

Example 5

The dynamic gesture recognition method based on computer vision is the same as embodiments 1-4. Referring to fig. 1, the specific implementation steps include:

step 1: gather gesture image, shoot gesture image with the camera, including: "stone", "scissors", "cloth", "stick", "OK", "love", etc., see fig. 2(a) - (f). Fig. 2(a) is a front and back fist making gesture, fig. 2(b) is a front and back "scissors" gesture, fig. 2(c) is a front and back palm gesture, fig. 2(d) is a tree thumb gesture, fig. 2(e) is an "OK" gesture, and fig. 2(f) is a "love heart" gesture. Each gesture image also contains some complex backgrounds, and the same gesture has various rotation angles. And dividing the acquired gesture images into a training set and a testing set, and manually labeling the gestures in the acquired gesture images to obtain the category and coordinate data of the real frame.

The collected natural scene gesture image set is 2500, in this example, representative 6 gestures are selected, and are uniformly divided into 2000 training sets and 500 testing sets, see fig. 2. The image set is shot by adopting 1200 ten thousand mobile phone cameras, and the shot images are screened and manually marked

Step 2: and clustering to obtain a prior frame.

Real box data of training set and test set samples are read.

In this embodiment, the real frame of the training set and the test set sample is the coordinate and the category information of the manually labeled target frame in the image.

Clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:

d(box,centroid)＝1-IOU(box,centroid)

the centroid represents a randomly selected clustering center frame, the box represents other real frames except the center frame, the IOU (box, centroid) represents the similarity degree of other frames and the center frame, and the intersection of the two frames is divided by the union for calculation.

The number of the cluster center boxes selected in this example is 5, and the IOU (box, centroid) is obtained by calculation according to the following formula:

where, n denotes an intersection area of two frames of centroid and box, and u denotes a union area of two frames of centroid and box.

And step 3: and constructing a convolutional neural network.

Based on the google lenet convolutional neural network, a convolutional neural network was constructed containing G convolutional layers and 5 pooling layers, using simple 1 × 1 and 3 × 3 convolutional kernels, in this example G is 23.

Training the constructed convolutional network according to the loss function of the following formula:

wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λ_coordThe coordinate loss coefficient is taken as 5 in this example; the third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λ_noobjThis represents the loss factor when the target is not included, and is 0.5 in this example.

Even if the gesture is the same, different shooting angles can obtain different images. In the existing method, stable recognition of different angles of the same gesture is difficult to achieve, but the convolutional neural network constructed by the invention can overcome the problem that the same gesture has multiple rotation angles and is difficult to recognize, and has good stability for gesture recognition.

And 4, step 4: and training the network.

And reading in gesture images of the training set samples in batch. In this embodiment, 64 training set images are read in by the network per batch.

And randomly zooming the image by adopting a bilinear interpolation method, wherein the size of the zoomed gesture image is selected to be a multiple of 32, and the zoomed input image is obtained.

As shown in fig. 2, the range of the pixels of the gesture image is [ 500-.

And (4) carrying out size scaling on the scaled gesture image again by adopting a bilinear interpolation method, scaling to a fixed size, and obtaining an image which can be input into the convolution network. In this example, the gesture image is scaled to a fixed size 608 x 608.

And inputting the gesture image with a fixed size into the constructed convolutional neural network for training to obtain the weight of the convolutional neural network, wherein the weight is a parameter of the convolutional neural network and is used in testing. And (5) adopting a training set sample training network, iterating for 2 ten thousand times to obtain weights, and finishing training.

And 5: and (4) loading the network weight, namely the parameters obtained in the step (4), into the convolutional neural network constructed in the step (3) to prepare for testing.

Step 6: reading in the gesture image to be recognized in the test set, inputting the gesture image into a network loaded with weights for recognition, and obtaining the size, position coordinates and belonging category information of the gesture target recognition, referring to fig. 3, wherein fig. 3(a) - (f) are recognition results corresponding to fig. 2(a) - (f) of the invention.

And 7: and processing the obtained position and the class information by adopting a non-maximum value inhibition method to obtain a final prediction frame.

Sorting all the prediction frames in a descending order according to the confidence score, and selecting the highest score and the frame corresponding to the highest score;

traversing the rest of the prediction boxes, and if the IOU (input output) of the box with the highest confidence score is larger than a certain threshold value, deleting the box;

continuously selecting one frame with the highest score from the unprocessed frames, and repeating the process to obtain the reserved prediction frame data;

and 8: and mapping the data of the prediction frame to an original graph to obtain the category and position information of the gesture, drawing the prediction frame in the original graph and marking a category label to which the target belongs, referring to the attached drawing 3 and fig. 3(a) -3(f), wherein the upper left corner of each prediction frame of each graph is the predicted gesture category label.

And step 9: and recording the category and position information of the gesture in real time, referring to fig. 4, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpretation result on a screen, referring to table 1.

TABLE 1 dynamic gesture recognition real-time detection results

Predicting gesture center point abscissa	Predicting gesture center point ordinate	Gesture classification
			1164	371	Scissor
318	372	Scissor
			1152	373	Scissor
364	384	Scissor
			1097	380	Scissor
388	388	Scissor
			1061	381	Scissor
1027	383	Scissor
			430	409	Scissor
452	395	Scissor
			1001	380	Scissor
465	397	Scissor
			989	381	Scissor
510	395	Scissor
			960	381	Scissor
524	392	Scissor
			951	384	Scissor
557	395	Scissor
			918	394	Scissor
561	396	Scissor

The data in table 1 is part of the log data of the present invention for the dynamic process of the two-gesture horizontal movement from both sides, as represented in fig. 4. Fig. 4(a) shows a certain frame of a dynamic gesture with a semantic "object" in sign language, and fig. 4(b) shows a certain frame of a detection result of the dynamic gesture process. From the data analysis of Table 1, it can be seen that the gesture remains "scissors" unchanged. The coordinate data of the table 1 is visualized and converted into a graph, namely the graph is shown in fig. 5, the abscissa in fig. 5 represents the abscissa of the gesture center point in the current frame image, and the ordinate represents the ordinate of the gesture center point in the current frame image. The points in fig. 5 represent the coordinates of the gesture center point in the current frame image, and are the coordinate records of two "scissors" gestures, dynamically moving from the outside to the inside. As can be seen from fig. 5, the ordinate of the center point of the dynamic gesture displayed in the figure is basically unchanged, and the abscissa changes greatly, which indicates that the process is that two "scissors" gestures are horizontally close to each other, corresponding to the meaning of "object" in the sign language, see fig. 4.

In the embodiment of the invention, the motion condition of the gesture is judged by calculating the distribution histogram of the motion track, and the expression meaning of the gesture in the whole dynamic process is judged by combining the change of the gesture state in the motion, so that the static gesture recognition and the dynamic gesture interpretation analysis are included.

The technical effects of the present invention will be described with reference to the simulation.

Example 6

The dynamic gesture recognition method based on computer vision is the same as embodiments 1-5.

Simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: intel (r) Core5 processor of dell computer, main frequency 3.20GHz, memory 64 GB; the simulation software platform is as follows: visual Studio software (2015) version.

Simulation experiment content and result analysis:

the simulation experiment of the invention is divided into two simulation experiments.

The position coordinates and the category data of the collected data set are manually marked and made into a PASCAL VOC format data set, wherein 80% of the data set is used as a training set sample, and 20% of the data set is used as a test set sample.

Simulation experiment 1: the invention is compared with the prior art: compared with the method based on hand detection and shape detection and the method based on hand detection and CNN recognition in the prior art, the method provided by the invention is respectively trained by using the same training set sample, and then the same test set sample is used for evaluating various methods. As shown in table 2, Alg1 in table 2 indicates the method of the present invention, Alg2 indicates the method based on hand detection and shape detection, and Alg3 indicates the method based on hand detection and CNN recognition.

TABLE 2 test set accuracy of three methods simulation experiment

Test image	Alg1	Alg2	Alg3
				Accuracy (%)	98.0	31.3	78.6
Each time(s)	0.02	0.13	0.94

As can be seen from table 2, compared with the hand detection and shape detection based method and the hand detection and CNN recognition based method, the gesture recognition accuracy rate of the invention has obvious advantages, the recognition rates are respectively improved by 67% and 20%, and the recognition speed is respectively faster than 6 times and 47 times compared with other two methods. The recognition rate of the invention is higher than that of other two algorithms because the invention can ensure very high recognition rate for various angles of complex backgrounds and gestures. The reason that the recognition speed of the invention is higher than that of other two algorithms is that the invention constructs an end-to-end convolutional neural network, and can simultaneously predict the position and the category of the gesture without two steps. Simulation results show that the method has better performances of high recognition rate, high speed and the like when the gesture target is recognized, particularly under the condition of complex background.

Example 7

The dynamic gesture recognition method based on computer vision is the same as the embodiments 1-5, and the simulation conditions and contents are the same as the embodiment 6.

Simulation experiment 2: by adopting the method, different switch image scaling sizes are respectively used as the input of the network on the test set, and the test evaluation results are shown in table 2.

TABLE 3 recognition results for different network input sizes

As can be seen from table 3, when the input image is scaled to a certain size, the target recognition accuracy rate does not change significantly, so that the gesture image with a fixed size of 608 × 608 is selected as the optimal size of the convolutional neural network in consideration of the recognition rate, the recognition rate and the like.

The dynamic gesture recognition method based on computer vision provided by the invention can obtain better recognition accuracy for gesture target recognition, and can perform real-time gesture recognition.

In summary, the invention discloses a dynamic gesture recognition method based on computer vision. The problem of dynamic recognition of gestures in a complex background is solved. The method comprises the following steps: collecting a gesture data set and carrying out manual marking; clustering the real frame of the labeled image set to obtain a trained prior frame; constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target; training a network to obtain weights; loading weights to the network; inputting a gesture image for recognition; the position coordinates and the category information of the position coordinates are obtained by processing the non-maximum value inhibition method; obtaining a final recognition result image; and recording the identification information in real time to obtain a dynamic gesture interpretation result. The invention overcomes the defects that the hand detection and the category identification are carried out step by step in the gesture identification in the prior art, greatly simplifies the gesture identification process, improves the identification accuracy and speed, enhances the robustness of an identification system, and realizes the function of dynamic gesture interpretation. The method can be applied to the fields of human-computer interaction, sign language translation, unmanned traffic police gesture automatic recognition and the like in virtual reality.

Claims

1. A dynamic gesture recognition method based on computer vision is characterized by comprising the following steps:

(4) training an end-to-end convolutional neural network: in order to enhance the robustness of the convolutional neural network to the image size, after gesture images are read in batch, the read gesture images are subjected to twice scaling； The first time, randomly zooming from an originally input gesture image to an arbitrary size, the second time, zooming from the zoomed image of the arbitrary size to a specified size again, and finally inputting the gesture image zoomed to the specified size into a convolutional neural network for training to obtain a training weight, wherein the training weight specifically comprises the following steps:

(4a) reading in gesture images of training set samples in batch;

(4c) carrying out size scaling on the scaled gesture image obtained in the step 4(b) again by adopting a bilinear interpolation method, and scaling to a fixed size to obtain an image which can be input into a convolution network;

(4d) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the convolutional neural network;

(5) loading weight: loading the network weight obtained in the step (4d) into the convolutional neural network constructed in the step (3);

(6) predicting the location and category of the gesture: reading a gesture image to be recognized, inputting the gesture image into a network loaded with weights for recognition, and simultaneously obtaining position coordinates and category information of gesture target recognition;

(7a) all will bePredictionSorting the scores of the frames in a descending order, and selecting the highest score and the frame corresponding to the highest score;

(7b)traversing the rest frames, and deleting the frame if the overlapping area IOU of the frame with the current highest frame is larger than a certain threshold value Removing;

(7c) continuing to select one of the unprocessed boxes with the highest score, repeating the above process,namely (7a) to (7c),obtaining the reserved prediction frame data;

2. The method according to claim 1, wherein the step (2) of clustering the manually labeled real data frames comprises the following steps:

(2a) reading real frame data of a gesture image training set and a test set sample;

(2b) clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:

d(box,centroid)＝1-IOU(box,centroid)

3. The method according to claim 1, wherein the step (3) of constructing a convolutional neural network capable of predicting the position, size and type of the target gesture simultaneously from end to end comprises the following steps:

(3a) constructing a convolutional neural network comprising G convolutional layers and 5 pooling layers by using simple 1 x 1 and 3 x 3 convolutional kernels based on a GoogleLeNet convolutional neural network;

wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λ_coordThe coordinate loss coefficient is taken as 5 in this example; s²The number of the picture dividing grids is represented, and B represents the number of each grid prediction frame;

representing coordinates of the center point of the prediction box； The second term of the function is the predicted frame width height loss, (w)_i,h_i) The width and height of the real box are represented,

representing width and height of prediction box； The third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λ_noobjRepresents the loss factor when the target is not included, and is taken as 0.5 in the text;

representing the probability of predicting an included object； The fifth term of the function is the prediction class probability loss,

4. The dynamic gesture recognition method based on computer vision according to claim 1, wherein the image is randomly scaled by bilinear interpolation method in step (4b), the size of the gesture image is selected as a multiple of 32, and the scaled input image is obtained by the following steps:

4b 1: reading a gesture image to be recognized;

4b 2: and (4) randomly zooming the gesture image by adopting a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and the zoomed read-in gesture image is obtained.