CN113435340A - Real-time gesture recognition method based on improved Resnet - Google Patents
Real-time gesture recognition method based on improved Resnet Download PDFInfo
- Publication number
- CN113435340A CN113435340A CN202110722834.3A CN202110722834A CN113435340A CN 113435340 A CN113435340 A CN 113435340A CN 202110722834 A CN202110722834 A CN 202110722834A CN 113435340 A CN113435340 A CN 113435340A
- Authority
- CN
- China
- Prior art keywords
- gesture
- classification
- network
- result
- sliding window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.
Description
Technical Field
The invention belongs to the technical field of pattern recognition and computer vision, and particularly relates to a real-time gesture recognition method based on improved Resnet.
Background
Gestures are a very comfortable way of human-computer interaction, and are currently applied to many aspects of life, such as sign language recognition, device control, and the like. Therefore, as neural network technology matures, computer vision-based gesture recognition is becoming a hot tide. In practical application, how to recognize gestures from a video stream and how to consider accuracy while guaranteeing real-time performance of the system also increases difficulty for real-time gesture recognition. Although the gesture recognition technology has made great progress, many challenges are faced in real environment, and factors such as light, distance, etc. affect the performance of gesture recognition.
Disclosure of Invention
Aiming at the blank of the prior art, the invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.
The invention specifically adopts the following technical scheme:
a real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: filtering the classification result, and outputting the classification result meeting the condition;
in step S1, the feature extraction network employed includes a first modified Resnet10 and a second modified Resnet 10;
said first modified Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 5 x 5, step size to 1, and the step size of the first convolution in the third residual block to 1;
the second improved Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 9 x 9, the step size to 4, changes the third residual block from a bottleneck type residual block to a basic residual block, changes the convolution kernels after the first convolution kernel in the whole network to 5 x 5, and changes the step size to 3;
connecting the outputs of the first improved Resnet10 and the second improved Resnet10, and obtaining a gesture feature t through a basic residual block in the two improved Resnet10 and an average pooling layer with a step length of 2;
in step S3, the gesture classification network connects the feature extracted by the second bottleneck residual block of Resnet101 with the feature extracted by the gesture detection network based on Resnet101, so as to obtain the structure of the gesture classification network.
Further, step S1 specifically includes the following steps:
step S11: selecting a gesture recognition training set Jester as a data set, and obtaining related labels of training data;
step S12: setting the length n of a sliding window of a gesture detection network to be 8, and performing gesture detection to obtain a gesture feature t;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame of image in the sliding window, DET (·) is a gesture detection network model, t ═ DET (D) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-connection layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
Further, the specific method of step S2 is:
is provided withAndscores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formulaWherein the filter is the number of recorded history information and the fraction of the filterIf sf>And 3, the detector is considered to detect the gesture.
Further, step S3 specifically includes the following steps:
step S31: constructing a gesture classification network;
step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmThe method comprises the steps that a, a filter is used for filtering gesture classification network images, when the output of the filter indicates that a gesture exists, data in a sliding window of the gesture classification network are input into the gesture classification network, CLA (cndot) is a gesture classification network model, and fea (CLA (C)) is extracted features of the gesture classification network;
step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaA score representing a gesture of category a; obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula isWherein class represents the number of categories of the gesture, the classification probability PaThe largest class is output as the prediction result.
Further, step S4 specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 seconds, calculating the difference conf between the maximum classification probability and the second classification probability to be Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
Compared with the prior art, the invention and the preferred scheme thereof have the following beneficial effects:
1. the method and the device can effectively identify the dynamic gesture in the video, and improve the accuracy of gesture identification.
2. The method can reduce the phenomenon of gradient disappearance in the model training process, so that even if the network structure is deep, the method can have good convergence speed and accuracy in the training process.
3. Compared with the traditional Resnet network structure, the network provided by the invention constructs a multi-feature extraction network by changing the size of the convolution kernel, extracts features with different sizes, and improves the identification accuracy of hands at different distances.
4. Aiming at the problem that a single gesture network can output a plurality of classification results, the invention carries out post-processing on the results after the results are output by the classification network, thereby ensuring that a large number of classification results are not output in a short time and ensuring that the network is more suitable for practical application.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic diagram of the overall process steps of an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present invention provides a real-time gesture recognition method based on improved Resnet, which includes the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: and filtering the classification result, and outputting only the classification result meeting the condition.
In this embodiment, step S1 specifically includes the following steps:
step S11: acquiring a public gesture recognition training set Jester data set from a network, and acquiring related labels of training data;
step S12: the length n of a sliding window of the gesture detection network is set to be 8, the first feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed to be 5 x 5, the step length is changed to be 1, and the step length of the first convolution in the third residual block is changed to be 1, so that the detection accuracy rate of gestures with longer distances can be improved. The second feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed into 9 x 9, the step length is changed into 4, the third residual block is changed into a basic residual block from a bottleneck type residual block, the convolution kernels behind the first convolution kernel in the whole network are changed into 5 x 5, and the step length is changed into 3, so that the detection accuracy rate when the gesture is close can be improved. Connecting the outputs of the two networks, and obtaining a gesture characteristic t through the basic residual blocks in the two Resnet10 and the average pooling layer with the step length of 2;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame image in the sliding window, DET (-) is a Resnet neural network model for detecting gestures, t ═ DET (d) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-link layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
In this embodiment, the specific method of step S2 is as follows:
is provided withAndscores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formulaWherein the filter is the number of recorded history information and the fraction of the filterIf sf>And 3, the detector is considered to detect the gesture, and the advantage of the detection is that the reliability of the whole system can be ensured when the gesture leaves the picture for a short time.
In the present embodiment, step S3 includes the following steps:
step S31: and (3) connecting the features extracted from the second bottleneck residual block of Resnet101 with the features extracted from the gesture detection network in the step (1) by the gesture classification network on the basis of Resnet101 to obtain the structure of the gesture classification network.
Step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmAnd when the output of the filter indicates that a gesture exists, inputting data in the sliding window of the gesture classification network into the gesture classification network, wherein CLA (·) is a Resnet neural network model for classifying the gesture, and fea ═ CLA (c) is a feature extracted by the gesture classification network.
Step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaRepresenting the score of a gesture of category a. Obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula isWhere class represents the number of categories of the gesture.
In step S4, the method specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 seconds, calculating the difference conf between the maximum classification probability and the second classification probability to be Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
The patent is not limited to the above preferred embodiments, and other various real-time gesture recognition methods based on the Resnet can be derived by anyone in the light of the present patent, and all equivalent changes and modifications made in the claims of the present invention shall fall within the scope of the present patent.
Claims (5)
1. A real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: filtering the classification result, and outputting the classification result meeting the condition;
in step S1, the feature extraction network employed includes a first modified Resnet10 and a second modified Resnet 10;
said first modified Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 5 x 5, step size to 1, and the step size of the first convolution in the third residual block to 1;
the second improved Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 9 x 9, the step size to 4, changes the third residual block from a bottleneck type residual block to a basic residual block, changes the convolution kernels after the first convolution kernel in the whole network to 5 x 5, and changes the step size to 3;
connecting the outputs of the first improved Resnet10 and the second improved Resnet10, and obtaining a gesture feature t through a basic residual block in the two improved Resnet10 and an average pooling layer with a step length of 2;
in step S3, the gesture classification network connects the feature extracted by the second bottleneck residual block of Resnet101 with the feature extracted by the gesture detection network based on Resnet101, so as to obtain the structure of the gesture classification network.
2. The improved Resnet-based real-time gesture recognition method of claim 1, wherein:
step S1 specifically includes the following steps:
step S11: selecting a gesture recognition training set Jester as a data set, and obtaining related labels of training data;
step S12: setting the length n of a sliding window of a gesture detection network to be 8, and performing gesture detection to obtain a gesture feature t;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame of image in the sliding window, DET (·) is a gesture detection network model, t ═ DET (D) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-connection layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
3. The improved Resnet-based real-time gesture recognition method of claim 2, wherein:
the specific method of step S2 is:
is provided withAndscores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formulaWherein the filter is the number of recorded history information and the fraction of the filterIf sf>And 3, the detector is considered to detect the gesture.
4. The improved Resnet-based real-time gesture recognition method of claim 3, wherein:
step S3 specifically includes the following steps:
step S31: constructing a gesture classification network;
step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmThe method comprises the steps that a, a filter is used for filtering gesture classification network images, when the output of the filter indicates that a gesture exists, data in a sliding window of the gesture classification network are input into the gesture classification network, CLA (cndot) is a gesture classification network model, and fea (CLA (C)) is extracted features of the gesture classification network;
step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaA score representing a gesture of category a; obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula isWherein class represents the number of categories of the gesture, the classification probability PaThe largest class is output as the prediction result.
5. The improved Resnet-based real-time gesture recognition method of claim 4, wherein:
in step S4, the method specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 second, calculating the maximum classification probability and the second classification probabilityThe difference of the ratio conf ═ Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110722834.3A CN113435340B (en) | 2021-06-29 | 2021-06-29 | Real-time gesture recognition method based on improved Resnet |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110722834.3A CN113435340B (en) | 2021-06-29 | 2021-06-29 | Real-time gesture recognition method based on improved Resnet |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113435340A true CN113435340A (en) | 2021-09-24 |
CN113435340B CN113435340B (en) | 2022-06-10 |
Family
ID=77757385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110722834.3A Active CN113435340B (en) | 2021-06-29 | 2021-06-29 | Real-time gesture recognition method based on improved Resnet |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113435340B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052884A (en) * | 2017-12-01 | 2018-05-18 | 华南理工大学 | A kind of gesture identification method based on improvement residual error neutral net |
CN111209885A (en) * | 2020-01-13 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Gesture information processing method and device, electronic equipment and storage medium |
WO2020244071A1 (en) * | 2019-06-06 | 2020-12-10 | 平安科技(深圳)有限公司 | Neural network-based gesture recognition method and apparatus, storage medium, and device |
CN112507898A (en) * | 2020-12-14 | 2021-03-16 | 重庆邮电大学 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
-
2021
- 2021-06-29 CN CN202110722834.3A patent/CN113435340B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052884A (en) * | 2017-12-01 | 2018-05-18 | 华南理工大学 | A kind of gesture identification method based on improvement residual error neutral net |
WO2020244071A1 (en) * | 2019-06-06 | 2020-12-10 | 平安科技(深圳)有限公司 | Neural network-based gesture recognition method and apparatus, storage medium, and device |
CN111209885A (en) * | 2020-01-13 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Gesture information processing method and device, electronic equipment and storage medium |
CN112507898A (en) * | 2020-12-14 | 2021-03-16 | 重庆邮电大学 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
Non-Patent Citations (3)
Title |
---|
LI, LW (LI, LIANWEI) AT EL.: "Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions", 《PATTERN ANALYSIS AND APPLICATIONS》 * |
官巍等: "基于卷积神经网络的手势识别网络", 《西安邮电大学学报》 * |
熊才华: "基于深度学习的手势识别算法研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113435340B (en) | 2022-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | On the integration of grounding language and learning objects | |
WO2017079522A1 (en) | Subcategory-aware convolutional neural networks for object detection | |
CN110308795B (en) | Dynamic gesture recognition method and system | |
CN105160318A (en) | Facial expression based lie detection method and system | |
JP2004054956A (en) | Face detection method and system using pattern sorter learned by face/analogous face image | |
CN107742095A (en) | Chinese sign Language Recognition Method based on convolutional neural networks | |
Haque et al. | Two-handed bangla sign language recognition using principal component analysis (PCA) and KNN algorithm | |
CN112801000B (en) | Household old man falling detection method and system based on multi-feature fusion | |
Harini et al. | Sign language translation | |
Shinde et al. | Real time two way communication approach for hearing impaired and dumb person based on image processing | |
CN109101108A (en) | Method and system based on three decision optimization intelligence cockpit human-computer interaction interfaces | |
CN111652017A (en) | Dynamic gesture recognition method and system | |
CN113111968A (en) | Image recognition model training method and device, electronic equipment and readable storage medium | |
Patel et al. | Hand gesture recognition system using convolutional neural networks | |
CN113255557A (en) | Video crowd emotion analysis method and system based on deep learning | |
CN108537109B (en) | OpenPose-based monocular camera sign language identification method | |
CN112926522A (en) | Behavior identification method based on skeleton attitude and space-time diagram convolutional network | |
Koli et al. | Human action recognition using deep neural networks | |
CN115797827A (en) | ViT human body behavior identification method based on double-current network architecture | |
CN116312512A (en) | Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device | |
Singh et al. | Feature based method for human facial emotion detection using optical flow based analysis | |
Gupta et al. | Progression modelling for online and early gesture detection | |
CN114419480A (en) | Multi-person identity and action association identification method and device and readable medium | |
CN111950452A (en) | Face recognition method | |
CN113435340B (en) | Real-time gesture recognition method based on improved Resnet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |