CN107808143B - Dynamic gesture recognition method based on computer vision - Google Patents

Dynamic gesture recognition method based on computer vision Download PDF

Info

Publication number
CN107808143B
CN107808143B CN201711102008.9A CN201711102008A CN107808143B CN 107808143 B CN107808143 B CN 107808143B CN 201711102008 A CN201711102008 A CN 201711102008A CN 107808143 B CN107808143 B CN 107808143B
Authority
CN
China
Prior art keywords
gesture
image
frame
target
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711102008.9A
Other languages
Chinese (zh)
Other versions
CN107808143A (en
Inventor
王爽
焦李成
方帅
王若静
杨孟然
权豆
孙莉
侯彪
马晶晶
刘飞航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian Univ
Original Assignee
Xidian Univ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian Univ filed Critical Xidian Univ
Priority to CN201711102008.9A priority Critical patent/CN107808143B/en
Publication of CN107808143A publication Critical patent/CN107808143A/en
Application granted granted Critical
Publication of CN107808143B publication Critical patent/CN107808143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00335Recognising movements or behaviour, e.g. recognition of gestures, dynamic facial expressions; Lip-reading
    • G06K9/00355Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06K9/6218Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6267Classification techniques
    • G06K9/6268Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
    • G06K9/6277Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches based on a parametric (probabilistic) model, e.g. based on Neyman-Pearson lemma, likelihood ratio, Receiver Operating Characteristic [ROC] curve plotting a False Acceptance Rate [FAR] versus a False Reject Rate [FRR]
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0454Architectures, e.g. interconnection topology using a combination of multiple neural nets

Abstract

The invention discloses a dynamic gesture recognition method based on computer vision. The problem of dynamic recognition of gestures in a complex background is solved. The method comprises the following implementation steps: acquiring a gesture data set, carrying out manual labeling, clustering the real frames of the labeled image set to obtain a trained prior frame, constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target, training the network to obtain weights, loading the weights to the network, inputting a gesture image for identification, processing the obtained position coordinate and the belonged category information by a non-maximum suppression method to obtain a final identification result image, and recording identification information in real time to obtain a dynamic gesture interpretation result. The invention overcomes the defects that the hand detection and the category identification are carried out step by step in the gesture identification in the prior art, greatly simplifies the gesture identification process, improves the identification accuracy and speed, enhances the robustness of an identification system, and realizes the function of dynamic gesture interpretation.

Description

Dynamic gesture recognition method based on computer vision
Technical Field
The invention belongs to the technical field of image processing, and further relates to an image target recognition technology, in particular to a dynamic gesture recognition method based on computer vision. The gesture recognition method can be used for position detection and state recognition of gestures in the image, so that more accurate information can be provided for subsequent sign language translation, game interaction and other applications of gesture recognition.
Background
In recent years, with the development of related disciplines such as computer vision and machine learning, human computer interaction (human computer interaction) is gradually shifting from "computer-centered" to "human-centered". The natural user interface with the human body as the communication platform provides more visual and comfortable interactive experience for an operator, wherein the interactive experience comprises face recognition, gesture recognition, posture recognition and the like. The gestures in daily life are used as natural and visual communication modes, and have good application prospects: controlling intelligent equipment in virtual reality by using a specified gesture; the sign language translation device is used for sign language translation and solves the communication problem of the deaf-mute; the unmanned automatic recognition traffic police gesture. Therefore, the gesture recognition has important research value and significance.
Gesture recognition focuses mainly on two aspects, one is gesture recognition based on sensing devices (such as a data glove and a position tracker), and the other is gesture recognition based on vision. Much research and attention has been directed to vision-based gesture recognition as it enables human-machine interaction by operators in a more natural way and with greater flexibility. At present, most gesture recognition is based on position detection and recognition of gestures in an image, and a two-step recognition method of firstly detecting hand positions and then determining gesture types is adopted.
A method based on Hand detection and shape detection is proposed in The paper "Real-Time Hand position Recognition Using Finger Segmentation" (The scientific world journal,2014(3):267872) published by Zhi-hua Chen et al. The method comprises the steps of firstly extracting a hand area by using a background difference method, carrying out binarization, then segmenting fingers and a palm, and classifying gesture objects from original 13 templates by using the number and content of the fingers (the content refers to the names of the fingers, such as a thumb, an index finger, a middle finger and the like). However, this method has strict requirements on the image background, and the hand position can be segmented only in a single background. In addition, the gesture recognized by the method is single in shape, poor in robustness and difficult to popularize.
An algorithm based on Hand detection and CNN Recognition is proposed In the paper "A Real-time Hand position Recognition and Human-Computer Interaction System" (In CVPR, IEEE, 2017) published by Pei Xu. According to the method, a binarization image only containing hands is obtained by using basic image processing methods such as filtering and morphology, and then the binarization image is input into a convolutional neural network LeNet for feature extraction and identification so as to improve accuracy. However, the method needs to preprocess the image, has high requirements on background color, and the detection and recognition of the gesture are performed in two steps, namely, the position of the gesture is obtained first, and then the current gesture is classified to obtain the state, so that the recognition step is complicated and time-consuming.
Disclosure of Invention
The invention aims to provide a dynamic gesture recognition method based on computer vision, which has higher accuracy and higher efficiency aiming at the defects of the prior art.
The invention relates to a dynamic gesture recognition method based on computer vision, which is characterized by comprising the following steps of:
(1) acquiring a gesture image: dividing the acquired gesture images into a training set and a testing set, and manually labeling the gestures in the training set and the testing set respectively to obtain the category and coordinate data of a real data frame;
(2) clustering to obtain a prior frame: clustering the manually marked real data frames, and taking the overlapping degree of the areas of the frames as loss measurement to obtain a plurality of preliminary test prior frames;
(3) constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target gesture: constructing an end-to-end convolutional neural network by using an improved GoogLeNet network as a network framework and simultaneously constraining loss functions of target positions and classes;
(4) training the end-to-end network:
(4a) reading in gesture images of training set samples in batch;
(4b) randomly zooming the image by a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and obtaining a zoomed read-in gesture image;
(4c) scaling the size of an input image by a bilinear interpolation method to a fixed size to obtain an image which can be input into a convolution network;
(4d) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the constructed convolutional neural network;
(5) loading weight: loading the weights corresponding to the convolutional neural network obtained in the step (4d) into the convolutional neural network constructed in the step (3);
(6) predicting the location and category of the gesture: reading a gesture image to be recognized, inputting the gesture image into a convolutional neural network loaded with weights for recognition, and simultaneously obtaining position coordinates and category information of the gesture target to be recognized;
(7) removing redundant prediction blocks: processing the obtained position coordinates and the category information by adopting a non-maximum value inhibition method to obtain a final prediction frame:
(7a) sorting the scores of all the prediction frames in a descending order, and selecting the highest score and the frame corresponding to the highest score;
(7b) traversing the other frames, and deleting the frame if the overlapping area IOU of the frame with the current highest frame is larger than a certain threshold value;
(7c) continuing to select one of the unprocessed frames with the highest score, and repeating the processes, namely executing (7a) to (7c) to obtain the reserved prediction frame data;
(8) visualization of prediction results: mapping the prediction frame data to an original image, drawing a prediction frame in the original image and marking a category label to which a gesture target belongs;
(9) recording and analysis: recording the category and position information of the gesture in real time, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpreted result on a screen.
The invention utilizes the deep convolutional neural network to identify the gesture end to end, not only can identify the dynamic gesture in real time, but also can keep higher accuracy under a complex background.
Compared with the prior art, the invention has the following advantages:
1. the method uses the convolutional neural network to identify the gesture, completes the position detection and identification of the gesture target in the image in one step, has simple steps and high identification speed, and overcomes the defect that the real-time property cannot be ensured when the hand position is detected and the gesture is identified after the two steps are separately processed in the prior art. Meanwhile, the network can well extract the characteristics of the gesture image, has high accuracy in gesture recognition at any angle, has no requirement on the background of the image, can accurately recognize the gesture even under the complex background, and overcomes the defect that the image background in the prior art has single requirement;
2. the invention adopts a method of randomly scaling the size of the gesture image when training the convolutional neural network, and the size of the gesture image can be changed and input into the convolutional neural network every iteration for several times. The algorithm adopts 10 batches, the network can randomly select a new picture size, so that the network can achieve a good prediction effect on different input sizes, and the same network can carry out detection on different resolutions. Therefore, the same network can predict detection with different resolutions, and robustness and generalization are stronger.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of natural scene gestures used in simulation experiments with the present invention;
FIG. 3 is a diagram of gesture target recognition results obtained in a simulation experiment;
FIG. 4 is a diagram of the recognition result of the dynamic gesture according to the present invention, in which FIG. 4(a) is a frame of the dynamic gesture with "object" semantic in sign language, and FIG. 4(b) is a frame of the detection result of the process;
FIG. 5 is a record of the coordinates of the gesture center point for the dynamic gesture recognition process.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
Example 1
The gesture is used as a natural and intuitive communication mode, and has good application prospect: controlling intelligent equipment in virtual reality by using a specified gesture; the sign language translation device is used for sign language translation and solves the communication problem of the deaf-mute; unmanned automatic recognition of traffic police gestures, and the like. At present, the gesture recognition technology based on vision generally adopts a traditional method, namely, gestures are firstly segmented and then classified, and the mode has high requirements on the quality of photos and is difficult to process the gestures under a complex background. Thus limiting the development of gesture recognition applications. The invention develops research and innovation aiming at the current situation, provides a dynamic gesture recognition method based on computer vision, and the method is shown in figure 1 and comprises the following steps:
(1) acquiring a gesture image: and dividing the acquired gesture images into a training set and a testing set, wherein the training set is used for training the convolutional neural network, and the testing set is used for calculating the accuracy of the network recognition. And marking the gesture on the acquired gesture image to obtain the size and the center point coordinate of the rectangular box closest to the gesture and the category of the corresponding gesture. And manually marking the gesture in the real data frame to obtain the category and coordinate data of the real data frame.
(2) Clustering to obtain a prior frame: and selecting the number of clustering centers, clustering the manually marked real data frames, and clustering by taking the overlapping degree of the areas of the frames as loss measurement to obtain a plurality of preliminary test prior frames. In this example, the number of clustering centers is set to 9, 9 initial test prior frames are obtained after clustering with the overlapping degree as the loss metric, and the 9 initial test prior frames are used as initial test prediction frames of the convolutional neural network, so that the convergence time of the convolutional neural network can be shortened. In general, the size of the cluster center number depends on how dense the number of objects in the picture is, and the more objects in the picture, the more clusters center number is set.
(3) Constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target gesture: and constructing an end-to-end convolutional neural network by taking the improved GoogLeNet network as a network framework and matching with a loss function which simultaneously constrains the position, size and class of a target. Designing an end-to-end convolutional neural network capable of simultaneously constraining the position and the category of the target, wherein the network can simultaneously predict the position, the size and the category of the target gesture. The convolutional neural network constructed by the invention utilizes the loss function which simultaneously restrains the position, the size and the type of the target, so that the network has the function of simultaneously predicting the position, the size and the type of the target. The network is computationally inexpensive and easy to converge, enabling 9000 targets to be classified on the ImageNet dataset.
(4) Training an end-to-end convolutional neural network: in order to enhance the robustness of the convolutional neural network to the image size, after gesture images are read in batch, the read gesture images are zoomed twice. The first time, randomly zooming from an originally input gesture image to an arbitrary size, the second time, zooming from the zoomed image of the arbitrary size to a specified size again, and finally inputting the gesture image zoomed to the specified size into a convolutional neural network for training to obtain a training weight, wherein the training weight specifically comprises the following steps:
(4a) reading in gesture images of training set samples in batch;
(4b) and randomly zooming the read gesture image by adopting a bilinear interpolation method to enable the size of the zoomed gesture image to be a multiple of 32, so as to obtain the zoomed read gesture image. The purpose of this is to increase the scale diversity of the data, enhance the robustness of the network, and further improve the identification accuracy.
(4c) And scaling the input image by a bilinear interpolation method to a fixed size to obtain an image which can be input into the convolution network, wherein the fixed size is 672 × 672 in the example. Scaling of an image to a fixed size is related to the structure of the convolutional neural network.
(4d) And (4) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the convolutional neural network.
(5) Loading weight: loading the network weight obtained in the step (4d) into the convolutional neural network constructed in the step (3); the weight is the network parameter required for prediction.
(6) Predicting the location and category of the gesture: and (3) reading in a gesture image to be recognized, firstly zooming the input gesture image to the size in 4(c) by the network, then inputting the input gesture image into the network loaded with the weight for recognition, and simultaneously obtaining the position coordinate, the size and the category information of the gesture target recognition.
(7) Removing redundant prediction blocks: and processing by adopting a non-maximum suppression method to obtain the position coordinates and the category information of the gesture in the gesture image, and obtaining a final prediction frame. The method comprises the following steps that a plurality of identification frames can be obtained from a prediction result of the same target, redundant identification frames are removed by using a non-maximum inhibition algorithm, and data of one identification frame with the maximum confidence coefficient is reserved, and the specific operation is as follows:
(7a) sorting the confidence scores of all the frames in a descending order, and selecting the frame corresponding to the highest confidence score;
(7b) traversing the rest of the frames, and if the IOU (input output) of the frame with the highest confidence score is larger than a certain threshold, deleting the frame;
(7c) continuing to select one of the unprocessed frames with the highest score, and repeating the processes, namely executing (7a) to (7c) to obtain the reserved prediction frame data; the data of the prediction box includes the position, size, category of the box.
(8) Visualization of prediction results: and (c) the coordinate data and the size of the predicted recognition frame are in a relative 4(c) size, namely, in a fixed scaling mode, the predicted frame data in the fixed scaling mode are mapped into the original drawing size, namely the size of the gesture image to be recognized, the predicted frame is drawn in the original drawing, and the category label of the gesture target is marked.
(9) Recording and analysis: the invention only needs 0.02 second for identifying a single photo, and can meet the requirement of real-time gesture identification. And calling a camera through opencv, recording the category and position information of the gesture in real time by using the trained convolutional neural network, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpreted result on a screen.
According to the method, an end-to-end convolutional neural network is constructed by using a loss function which simultaneously restrains the position and the type of the target, and the position, the size and the type of the target are simultaneously predicted, so that the gesture recognition step is simplified, and the recognition rate is improved; in the training stage, the gesture image to be recognized is randomly zoomed and sent to the convolutional neural network for training, so that the robustness of the network is enhanced, and the recognition accuracy is improved.
Example 2
Similar to embodiment 1, the dynamic gesture recognition method based on computer vision includes the following steps:
(2a) reading the manually marked real frame data of the training set and the test set samples;
(2b) setting the number of clustering centers, clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:
d(box,centroid)=1-IOU(box,centroid)
the centroid represents a randomly selected clustering center frame, the box represents other real frames except the center frame, and the IOU (box, centroid) represents the similarity between the other frames and the center frame, namely the proportion of the overlapping area of the two frames, and the centroid is calculated by dividing the intersection of the center frame and the other frames by the union.
The invention can obtain a plurality of prior frames which are most representative of the manually collected real frames through clustering, wherein the prior frames are initial test frames of neural network prediction. The determination of the prior frame can reduce the prediction range of the convolutional neural network and accelerate the convergence of the network.
Example 3
The dynamic gesture recognition method based on computer vision is the same as the embodiment 1-2, and the method for constructing the convolutional neural network in the step (3) comprises the following steps:
(3a) based on the google lenet convolutional neural network, a convolutional neural network was constructed containing G convolutional layers and 5 pooling layers, using simple 1 × 1 and 3 × 3 convolutional kernels, in this example G is 25.
(3b) Training the constructed convolutional network according to the loss function of the following formula:
wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λcoordIs a coordinate loss coefficient of 1 ≦ λcoordThe number is less than or equal to 5, and the number is 3 in the example, which is to ensure that the position information of the predicted gesture is accurate; s2The number of the picture dividing grids is represented, and B represents the number of each grid prediction frame;whether a jth prediction box in the ith grid is responsible for the prediction of the target or not is shown when the target exists; (x)i,yi) Representing the coordinates of the center point of the real frame of the target,representing the coordinates of the center point of the prediction box. The second term of the function is the predicted frame width height loss, (w)i,hi) The width and height of the real box are represented,indicating the width and height of the prediction box. The third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λnoobjDenotes a loss coefficient when no object is included, 0.1. ltoreq. lambdacoord1 is taken in the example of less than or equal to 1 so as to ensure that the convolutional neural network can distinguish the target block from the background block;whether the jth prediction box in the ith grid is responsible for the prediction of the target or not when the target is not contained is shown; ciThe true probability of containing the object is represented,representing the probability of predicting the inclusion of an object. The fifth term of the function is the prediction class probability loss,indicating that the ith grid contains a target center point; p is a radical ofi(c) The real object class is represented by the representation,representing a predicted target category; c represents the number of categories.
The position detection and the category identification of the gesture are completed in one step in the embodiment of the invention. The method comprises the steps of extracting features of an original gesture image by adopting a convolutional neural network, and then training the network by reducing position loss and category loss to enable the network to identify gesture types while detecting gesture positions.
Example 4
The dynamic gesture recognition method based on computer vision is the same as the embodiment 1-3, the image is randomly zoomed by adopting a bilinear interpolation method in the step (4b), the size is selected to be a multiple of 32, the zoomed input image is obtained, and the method is carried out according to the following steps:
4b 1: a switch cabinet image to be identified is read in.
4b 2: and randomly zooming the image by a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and obtaining a zoomed input image.
Referring to fig. 2, the switch image to be processed input in the embodiment of the present invention is shown, where the pixel range of the switch image is [ 600-.
The invention randomly scales the size of the gesture image when training the convolutional neural network so as to increase the robustness of the convolutional neural network to the image size. The algorithm randomly scales the gesture images every 10 batches, so that the network can achieve a good prediction effect on different input sizes, and the same network can detect the gesture images on different resolutions. The same network can predict gesture images with different resolutions, and robustness and generalization are stronger.
The invention will now be described more fully hereinafter with reference to the accompanying drawings.
Example 5
The dynamic gesture recognition method based on computer vision is the same as embodiments 1-4. Referring to fig. 1, the specific implementation steps include:
step 1: gather gesture image, shoot gesture image with the camera, including: "stone", "scissors", "cloth", "stick", "OK", "love", etc., see fig. 2(a) - (f). Fig. 2(a) is a front and back fist making gesture, fig. 2(b) is a front and back "scissors" gesture, fig. 2(c) is a front and back palm gesture, fig. 2(d) is a tree thumb gesture, fig. 2(e) is an "OK" gesture, and fig. 2(f) is a "love heart" gesture. Each gesture image also contains some complex backgrounds, and the same gesture has various rotation angles. And dividing the acquired gesture images into a training set and a testing set, and manually labeling the gestures in the acquired gesture images to obtain the category and coordinate data of the real frame.
The collected natural scene gesture image set is 2500, in this example, representative 6 gestures are selected, and are uniformly divided into 2000 training sets and 500 testing sets, see fig. 2. The image set is shot by adopting 1200 ten thousand mobile phone cameras, and the shot images are screened and manually marked
Step 2: and clustering to obtain a prior frame.
Real box data of training set and test set samples are read.
In this embodiment, the real frame of the training set and the test set sample is the coordinate and the category information of the manually labeled target frame in the image.
Clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:
d(box,centroid)=1-IOU(box,centroid)
the centroid represents a randomly selected clustering center frame, the box represents other real frames except the center frame, the IOU (box, centroid) represents the similarity degree of other frames and the center frame, and the intersection of the two frames is divided by the union for calculation.
The number of the cluster center boxes selected in this example is 5, and the IOU (box, centroid) is obtained by calculation according to the following formula:
where, n denotes an intersection area of two frames of centroid and box, and u denotes a union area of two frames of centroid and box.
And step 3: and constructing a convolutional neural network.
Based on the google lenet convolutional neural network, a convolutional neural network was constructed containing G convolutional layers and 5 pooling layers, using simple 1 × 1 and 3 × 3 convolutional kernels, in this example G is 23.
Training the constructed convolutional network according to the loss function of the following formula:
wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λcoordThe coordinate loss coefficient is taken as 5 in this example; the third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λnoobjThis represents the loss factor when the target is not included, and is 0.5 in this example.
Even if the gesture is the same, different shooting angles can obtain different images. In the existing method, stable recognition of different angles of the same gesture is difficult to achieve, but the convolutional neural network constructed by the invention can overcome the problem that the same gesture has multiple rotation angles and is difficult to recognize, and has good stability for gesture recognition.
And 4, step 4: and training the network.
And reading in gesture images of the training set samples in batch. In this embodiment, 64 training set images are read in by the network per batch.
And randomly zooming the image by adopting a bilinear interpolation method, wherein the size of the zoomed gesture image is selected to be a multiple of 32, and the zoomed input image is obtained.
As shown in fig. 2, the range of the pixels of the gesture image is [ 500-.
And (4) carrying out size scaling on the scaled gesture image again by adopting a bilinear interpolation method, scaling to a fixed size, and obtaining an image which can be input into the convolution network. In this example, the gesture image is scaled to a fixed size 608 x 608.
And inputting the gesture image with a fixed size into the constructed convolutional neural network for training to obtain the weight of the convolutional neural network, wherein the weight is a parameter of the convolutional neural network and is used in testing. And (5) adopting a training set sample training network, iterating for 2 ten thousand times to obtain weights, and finishing training.
And 5: and (4) loading the network weight, namely the parameters obtained in the step (4), into the convolutional neural network constructed in the step (3) to prepare for testing.
Step 6: reading in the gesture image to be recognized in the test set, inputting the gesture image into a network loaded with weights for recognition, and obtaining the size, position coordinates and belonging category information of the gesture target recognition, referring to fig. 3, wherein fig. 3(a) - (f) are recognition results corresponding to fig. 2(a) - (f) of the invention.
And 7: and processing the obtained position and the class information by adopting a non-maximum value inhibition method to obtain a final prediction frame.
Sorting all the prediction frames in a descending order according to the confidence score, and selecting the highest score and the frame corresponding to the highest score;
traversing the rest of the prediction boxes, and if the IOU (input output) of the box with the highest confidence score is larger than a certain threshold value, deleting the box;
continuously selecting one frame with the highest score from the unprocessed frames, and repeating the process to obtain the reserved prediction frame data;
and 8: and mapping the data of the prediction frame to an original graph to obtain the category and position information of the gesture, drawing the prediction frame in the original graph and marking a category label to which the target belongs, referring to the attached drawing 3 and fig. 3(a) -3(f), wherein the upper left corner of each prediction frame of each graph is the predicted gesture category label.
And step 9: and recording the category and position information of the gesture in real time, referring to fig. 4, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpretation result on a screen, referring to table 1.
TABLE 1 dynamic gesture recognition real-time detection results
Predicting gesture center point abscissa Predicting gesture center point ordinate Gesture classification
1164 371 Scissor
318 372 Scissor
1152 373 Scissor
364 384 Scissor
1097 380 Scissor
388 388 Scissor
1061 381 Scissor
1027 383 Scissor
430 409 Scissor
452 395 Scissor
1001 380 Scissor
465 397 Scissor
989 381 Scissor
510 395 Scissor
960 381 Scissor
524 392 Scissor
951 384 Scissor
557 395 Scissor
918 394 Scissor
561 396 Scissor
The data in table 1 is part of the log data of the present invention for the dynamic process of the two-gesture horizontal movement from both sides, as represented in fig. 4. Fig. 4(a) shows a certain frame of a dynamic gesture with a semantic "object" in sign language, and fig. 4(b) shows a certain frame of a detection result of the dynamic gesture process. From the data analysis of Table 1, it can be seen that the gesture remains "scissors" unchanged. The coordinate data of the table 1 is visualized and converted into a graph, namely the graph is shown in fig. 5, the abscissa in fig. 5 represents the abscissa of the gesture center point in the current frame image, and the ordinate represents the ordinate of the gesture center point in the current frame image. The points in fig. 5 represent the coordinates of the gesture center point in the current frame image, and are the coordinate records of two "scissors" gestures, dynamically moving from the outside to the inside. As can be seen from fig. 5, the ordinate of the center point of the dynamic gesture displayed in the figure is basically unchanged, and the abscissa changes greatly, which indicates that the process is that two "scissors" gestures are horizontally close to each other, corresponding to the meaning of "object" in the sign language, see fig. 4.
In the embodiment of the invention, the motion condition of the gesture is judged by calculating the distribution histogram of the motion track, and the expression meaning of the gesture in the whole dynamic process is judged by combining the change of the gesture state in the motion, so that the static gesture recognition and the dynamic gesture interpretation analysis are included.
The technical effects of the present invention will be described with reference to the simulation.
Example 6
The dynamic gesture recognition method based on computer vision is the same as embodiments 1-5.
Simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention is as follows: intel (r) Core5 processor of dell computer, main frequency 3.20GHz, memory 64 GB; the simulation software platform is as follows: visual Studio software (2015) version.
Simulation experiment content and result analysis:
the simulation experiment of the invention is divided into two simulation experiments.
The position coordinates and the category data of the collected data set are manually marked and made into a PASCAL VOC format data set, wherein 80% of the data set is used as a training set sample, and 20% of the data set is used as a test set sample.
Simulation experiment 1: the invention is compared with the prior art: compared with the method based on hand detection and shape detection and the method based on hand detection and CNN recognition in the prior art, the method provided by the invention is respectively trained by using the same training set sample, and then the same test set sample is used for evaluating various methods. As shown in table 2, Alg1 in table 2 indicates the method of the present invention, Alg2 indicates the method based on hand detection and shape detection, and Alg3 indicates the method based on hand detection and CNN recognition.
TABLE 2 test set accuracy of three methods simulation experiment
Test image Alg1 Alg2 Alg3
Accuracy (%) 98.0 31.3 78.6
Each time(s) 0.02 0.13 0.94
As can be seen from table 2, compared with the hand detection and shape detection based method and the hand detection and CNN recognition based method, the gesture recognition accuracy rate of the invention has obvious advantages, the recognition rates are respectively improved by 67% and 20%, and the recognition speed is respectively faster than 6 times and 47 times compared with other two methods. The recognition rate of the invention is higher than that of other two algorithms because the invention can ensure very high recognition rate for various angles of complex backgrounds and gestures. The reason that the recognition speed of the invention is higher than that of other two algorithms is that the invention constructs an end-to-end convolutional neural network, and can simultaneously predict the position and the category of the gesture without two steps. Simulation results show that the method has better performances of high recognition rate, high speed and the like when the gesture target is recognized, particularly under the condition of complex background.
Example 7
The dynamic gesture recognition method based on computer vision is the same as the embodiments 1-5, and the simulation conditions and contents are the same as the embodiment 6.
Simulation experiment 2: by adopting the method, different switch image scaling sizes are respectively used as the input of the network on the test set, and the test evaluation results are shown in table 2.
TABLE 3 recognition results for different network input sizes
As can be seen from table 3, when the input image is scaled to a certain size, the target recognition accuracy rate does not change significantly, so that the gesture image with a fixed size of 608 × 608 is selected as the optimal size of the convolutional neural network in consideration of the recognition rate, the recognition rate and the like.
The dynamic gesture recognition method based on computer vision provided by the invention can obtain better recognition accuracy for gesture target recognition, and can perform real-time gesture recognition.
In summary, the invention discloses a dynamic gesture recognition method based on computer vision. The problem of dynamic recognition of gestures in a complex background is solved. The method comprises the following steps: collecting a gesture data set and carrying out manual marking; clustering the real frame of the labeled image set to obtain a trained prior frame; constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target; training a network to obtain weights; loading weights to the network; inputting a gesture image for recognition; the position coordinates and the category information of the position coordinates are obtained by processing the non-maximum value inhibition method; obtaining a final recognition result image; and recording the identification information in real time to obtain a dynamic gesture interpretation result. The invention overcomes the defects that the hand detection and the category identification are carried out step by step in the gesture identification in the prior art, greatly simplifies the gesture identification process, improves the identification accuracy and speed, enhances the robustness of an identification system, and realizes the function of dynamic gesture interpretation. The method can be applied to the fields of human-computer interaction, sign language translation, unmanned traffic police gesture automatic recognition and the like in virtual reality.

Claims (4)

1. A dynamic gesture recognition method based on computer vision is characterized by comprising the following steps:
(1) acquiring a gesture image: dividing the acquired gesture images into a training set and a testing set, and manually labeling the gestures in the training set and the testing set respectively to obtain the category and coordinate data of a real data frame;
(2) clustering to obtain a prior frame: clustering the manually marked real data frames, and taking the overlapping degree of the areas of the frames as loss measurement to obtain a plurality of preliminary test prior frames;
(3) constructing an end-to-end convolutional neural network capable of simultaneously predicting the position, size and category of a target gesture: constructing an end-to-end convolutional neural network by using an improved GoogLeNet network as a network framework and simultaneously constraining loss functions of target positions and classes;
(4) training an end-to-end convolutional neural network: in order to enhance the robustness of the convolutional neural network to the image size, after gesture images are read in batch, the read gesture images are subjected to twice scalingThe first time, randomly zooming from an originally input gesture image to an arbitrary size, the second time, zooming from the zoomed image of the arbitrary size to a specified size again, and finally inputting the gesture image zoomed to the specified size into a convolutional neural network for training to obtain a training weight, wherein the training weight specifically comprises the following steps:
(4a) reading in gesture images of training set samples in batch;
(4b) randomly zooming the image by a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and obtaining a zoomed read-in gesture image;
(4c) carrying out size scaling on the scaled gesture image obtained in the step 4(b) again by adopting a bilinear interpolation method, and scaling to a fixed size to obtain an image which can be input into a convolution network;
(4d) training the convolutional neural network constructed in the step (3) by using the fixed-size image obtained in the step (4c) to obtain the weight corresponding to the convolutional neural network;
(5) loading weight: loading the network weight obtained in the step (4d) into the convolutional neural network constructed in the step (3);
(6) predicting the location and category of the gesture: reading a gesture image to be recognized, inputting the gesture image into a network loaded with weights for recognition, and simultaneously obtaining position coordinates and category information of gesture target recognition;
(7) removing redundant prediction blocks: processing the obtained position coordinates and the category information by adopting a non-maximum value inhibition method to obtain a final prediction frame:
(7a) all will bePredictionSorting the scores of the frames in a descending order, and selecting the highest score and the frame corresponding to the highest score;
(7b)traversing the rest frames, and deleting the frame if the overlapping area IOU of the frame with the current highest frame is larger than a certain threshold value Removing;
(7c) continuing to select one of the unprocessed boxes with the highest score, repeating the above process,namely (7a) to (7c),obtaining the reserved prediction frame data;
(8) visualization of prediction results: mapping the prediction frame data to an original image, drawing a prediction frame in the original image and marking a category label to which a gesture target belongs;
(9) recording and analysis: recording the category and position information of the gesture in real time, analyzing the obtained real-time data, interpreting the dynamic gesture, and directly displaying the interpreted result on a screen.
2. The method according to claim 1, wherein the step (2) of clustering the manually labeled real data frames comprises the following steps:
(2a) reading real frame data of a gesture image training set and a test set sample;
(2b) clustering by adopting a k-means clustering algorithm according to loss measurement d (box, centroid) of the following formula to obtain a prior frame:
d(box,centroid)=1-IOU(box,centroid)
the centroid represents a randomly selected clustering center frame, the box represents other real frames except the center frame, the IOU (box, centroid) represents the similarity degree of other frames and the center frame, and the intersection of the two frames is divided by the union for calculation.
3. The method according to claim 1, wherein the step (3) of constructing a convolutional neural network capable of predicting the position, size and type of the target gesture simultaneously from end to end comprises the following steps:
(3a) constructing a convolutional neural network comprising G convolutional layers and 5 pooling layers by using simple 1 x 1 and 3 x 3 convolutional kernels based on a GoogleLeNet convolutional neural network;
(3b) training the constructed convolutional network according to the loss function of the following formula:
wherein the first term of the loss function is the coordinate loss of the center point of the predicted target frame, where λcoordThe coordinate loss coefficient is taken as 5 in this example; s2The number of the picture dividing grids is represented, and B represents the number of each grid prediction frame;whether a jth prediction box in the ith grid is responsible for the prediction of the target or not is shown when the target exists; (x)i,yi) Representing the coordinates of the center point of the real frame of the target,representing coordinates of the center point of the prediction boxThe second term of the function is the predicted frame width height loss, (w)i,hi) The width and height of the real box are represented,representing width and height of prediction boxThe third and fourth terms of the function are the probability loss of the target contained in the prediction box, where λnoobjRepresents the loss factor when the target is not included, and is taken as 0.5 in the text;whether the jth prediction box in the ith grid is responsible for the prediction of the target or not when the target is not contained is shown; ciThe true probability of containing the object is represented,representing the probability of predicting an included objectThe fifth term of the function is the prediction class probability loss,indicating that the ith grid contains a target center point; p is a radical ofi(c) The real object class is represented by the representation,representing a predicted target category; c represents the number of categories.
4. The dynamic gesture recognition method based on computer vision according to claim 1, wherein the image is randomly scaled by bilinear interpolation method in step (4b), the size of the gesture image is selected as a multiple of 32, and the scaled input image is obtained by the following steps:
4b 1: reading a gesture image to be recognized;
4b 2: and (4) randomly zooming the gesture image by adopting a bilinear interpolation method, wherein the size is selected to be a multiple of 32, and the zoomed read-in gesture image is obtained.
CN201711102008.9A 2017-11-10 2017-11-10 Dynamic gesture recognition method based on computer vision Active CN107808143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711102008.9A CN107808143B (en) 2017-11-10 2017-11-10 Dynamic gesture recognition method based on computer vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711102008.9A CN107808143B (en) 2017-11-10 2017-11-10 Dynamic gesture recognition method based on computer vision

Publications (2)

Publication Number Publication Date
CN107808143A CN107808143A (en) 2018-03-16
CN107808143B true CN107808143B (en) 2021-06-01

Family

ID=61592035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711102008.9A Active CN107808143B (en) 2017-11-10 2017-11-10 Dynamic gesture recognition method based on computer vision

Country Status (1)

Country Link
CN (1) CN107808143B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390344A (en) * 2018-04-19 2019-10-29 华为技术有限公司 Alternative frame update method and device
CN109145756A (en) * 2018-07-24 2019-01-04 湖南万为智能机器人技术有限公司 Object detection method based on machine vision and deep learning
CN109165555A (en) * 2018-07-24 2019-01-08 广东数相智能科技有限公司 Man-machine finger-guessing game method, apparatus and storage medium based on image recognition
CN109117806B (en) * 2018-08-22 2020-11-27 歌尔科技有限公司 Gesture recognition method and device
CN109325454B (en) * 2018-09-28 2020-05-22 合肥工业大学 Static gesture real-time recognition method based on YOLOv3
CN109697407A (en) * 2018-11-13 2019-04-30 北京物灵智能科技有限公司 A kind of image processing method and device
CN109815876B (en) * 2019-01-17 2021-01-05 西安电子科技大学 Gesture recognition method based on address event stream characteristics
CN109948480A (en) * 2019-03-05 2019-06-28 中国电子科技集团公司第二十八研究所 A kind of non-maxima suppression method for arbitrary quadrilateral
CN109934184A (en) * 2019-03-19 2019-06-25 网易(杭州)网络有限公司 Gesture identification method and device, storage medium, processor
CN110135237A (en) * 2019-03-24 2019-08-16 北京化工大学 A kind of gesture identification method
CN110135408B (en) * 2019-03-26 2021-02-19 北京捷通华声科技股份有限公司 Text image detection method, network and equipment
CN110135398A (en) * 2019-05-28 2019-08-16 厦门瑞为信息技术有限公司 Both hands off-direction disk detection method based on computer vision
CN110363158B (en) * 2019-07-17 2021-05-25 浙江大学 Millimeter wave radar and visual cooperative target detection and identification method based on neural network
CN110414402A (en) * 2019-07-22 2019-11-05 北京达佳互联信息技术有限公司 A kind of gesture data mask method, device, electronic equipment and storage medium
CN111050266B (en) * 2019-12-20 2021-07-30 朱凤邹 Method and system for performing function control based on earphone detection action

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960036A (en) * 2017-03-09 2017-07-18 杭州电子科技大学 A kind of database building method for gesture identification
CN107168527A (en) * 2017-04-25 2017-09-15 华南理工大学 The first visual angle gesture identification and exchange method based on region convolutional neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9448636B2 (en) * 2012-04-18 2016-09-20 Arb Labs Inc. Identifying gestures using gesture data compressed by PCA, principal joint variable analysis, and compressed feature matrices
EP3096216B1 (en) * 2015-05-12 2018-08-29 Konica Minolta, Inc. Information processing device, information processing program, and information processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960036A (en) * 2017-03-09 2017-07-18 杭州电子科技大学 A kind of database building method for gesture identification
CN107168527A (en) * 2017-04-25 2017-09-15 华南理工大学 The first visual angle gesture identification and exchange method based on region convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Dynamic Gesture Recognition Method Based on Computer Vision;Xiao Jiang et al.;《2013 6th International Congress on Image and Signal Processing (CISP)》;20140220;第646-650页 *
基于计算机视觉的手势检测识别技术;关然 等;《计算机应用与软件》;20130131;第155-164页 *

Also Published As

Publication number Publication date
CN107808143A (en) 2018-03-16

Similar Documents

Publication Publication Date Title
CN107808143B (en) Dynamic gesture recognition method based on computer vision
CN105718878B (en) The aerial hand-written and aerial exchange method in the first visual angle based on concatenated convolutional neural network
CN107168527B (en) The first visual angle gesture identification and exchange method based on region convolutional neural networks
Malima et al. A fast algorithm for vision-based hand gesture recognition for robot control
CN109359538B (en) Training method of convolutional neural network, gesture recognition method, device and equipment
CN106874826A (en) Face key point-tracking method and device
CN106384126B (en) Clothes fashion recognition methods based on contour curvature characteristic point and support vector machines
CN107272899B (en) VR (virtual reality) interaction method and device based on dynamic gestures and electronic equipment
CN108647625A (en) A kind of expression recognition method and device
CN107633205A (en) lip motion analysis method, device and storage medium
CN102346854A (en) Method and device for carrying out detection on foreground objects
CN103105924A (en) Man-machine interaction method and device
CN108459785A (en) A kind of video multi-scale visualization method and exchange method
CN109685045A (en) A kind of Moving Targets Based on Video Streams tracking and system
Liu et al. Object proposal on RGB-D images via elastic edge boxes
Chuang et al. Saliency-guided improvement for hand posture detection and recognition
CN111275082A (en) Indoor object target detection method based on improved end-to-end neural network
CN109522908A (en) Image significance detection method based on area label fusion
Mahmood et al. A Comparative study of a new hand recognition model based on line of features and other techniques
CN108108648A (en) A kind of new gesture recognition system device and method
CN109255324A (en) Gesture processing method, interaction control method and equipment
CN109343920A (en) A kind of image processing method and its device, equipment and storage medium
CN108648211A (en) A kind of small target detecting method, device, equipment and medium based on deep learning
Półrola et al. Real-time hand pose estimation using classifiers
WO2020182121A1 (en) Expression recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant