CN113435340A - Real-time gesture recognition method based on improved Resnet - Google Patents

Real-time gesture recognition method based on improved Resnet Download PDF

Info

Publication number
CN113435340A
CN113435340A CN202110722834.3A CN202110722834A CN113435340A CN 113435340 A CN113435340 A CN 113435340A CN 202110722834 A CN202110722834 A CN 202110722834A CN 113435340 A CN113435340 A CN 113435340A
Authority
CN
China
Prior art keywords
gesture
classification
network
result
sliding window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110722834.3A
Other languages
Chinese (zh)
Other versions
CN113435340B (en
Inventor
柯逍
卞永亨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202110722834.3A priority Critical patent/CN113435340B/en
Publication of CN113435340A publication Critical patent/CN113435340A/en
Application granted granted Critical
Publication of CN113435340B publication Critical patent/CN113435340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.

Description

Real-time gesture recognition method based on improved Resnet
Technical Field
The invention belongs to the technical field of pattern recognition and computer vision, and particularly relates to a real-time gesture recognition method based on improved Resnet.
Background
Gestures are a very comfortable way of human-computer interaction, and are currently applied to many aspects of life, such as sign language recognition, device control, and the like. Therefore, as neural network technology matures, computer vision-based gesture recognition is becoming a hot tide. In practical application, how to recognize gestures from a video stream and how to consider accuracy while guaranteeing real-time performance of the system also increases difficulty for real-time gesture recognition. Although the gesture recognition technology has made great progress, many challenges are faced in real environment, and factors such as light, distance, etc. affect the performance of gesture recognition.
Disclosure of Invention
Aiming at the blank of the prior art, the invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.
The invention specifically adopts the following technical scheme:
a real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: filtering the classification result, and outputting the classification result meeting the condition;
in step S1, the feature extraction network employed includes a first modified Resnet10 and a second modified Resnet 10;
said first modified Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 5 x 5, step size to 1, and the step size of the first convolution in the third residual block to 1;
the second improved Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 9 x 9, the step size to 4, changes the third residual block from a bottleneck type residual block to a basic residual block, changes the convolution kernels after the first convolution kernel in the whole network to 5 x 5, and changes the step size to 3;
connecting the outputs of the first improved Resnet10 and the second improved Resnet10, and obtaining a gesture feature t through a basic residual block in the two improved Resnet10 and an average pooling layer with a step length of 2;
in step S3, the gesture classification network connects the feature extracted by the second bottleneck residual block of Resnet101 with the feature extracted by the gesture detection network based on Resnet101, so as to obtain the structure of the gesture classification network.
Further, step S1 specifically includes the following steps:
step S11: selecting a gesture recognition training set Jester as a data set, and obtaining related labels of training data;
step S12: setting the length n of a sliding window of a gesture detection network to be 8, and performing gesture detection to obtain a gesture feature t;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame of image in the sliding window, DET (·) is a gesture detection network model, t ═ DET (D) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-connection layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
Further, the specific method of step S2 is:
is provided with
Figure BDA0003137210800000021
And
Figure BDA0003137210800000022
scores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formula
Figure BDA0003137210800000023
Wherein the filter is the number of recorded history information and the fraction of the filter
Figure BDA0003137210800000024
If sf>And 3, the detector is considered to detect the gesture.
Further, step S3 specifically includes the following steps:
step S31: constructing a gesture classification network;
step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmThe method comprises the steps that a, a filter is used for filtering gesture classification network images, when the output of the filter indicates that a gesture exists, data in a sliding window of the gesture classification network are input into the gesture classification network, CLA (cndot) is a gesture classification network model, and fea (CLA (C)) is extracted features of the gesture classification network;
step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaA score representing a gesture of category a; obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula is
Figure BDA0003137210800000031
Wherein class represents the number of categories of the gesture, the classification probability PaThe largest class is output as the prediction result.
Further, step S4 specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 seconds, calculating the difference conf between the maximum classification probability and the second classification probability to be Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
Compared with the prior art, the invention and the preferred scheme thereof have the following beneficial effects:
1. the method and the device can effectively identify the dynamic gesture in the video, and improve the accuracy of gesture identification.
2. The method can reduce the phenomenon of gradient disappearance in the model training process, so that even if the network structure is deep, the method can have good convergence speed and accuracy in the training process.
3. Compared with the traditional Resnet network structure, the network provided by the invention constructs a multi-feature extraction network by changing the size of the convolution kernel, extracts features with different sizes, and improves the identification accuracy of hands at different distances.
4. Aiming at the problem that a single gesture network can output a plurality of classification results, the invention carries out post-processing on the results after the results are output by the classification network, thereby ensuring that a large number of classification results are not output in a short time and ensuring that the network is more suitable for practical application.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic diagram of the overall process steps of an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present invention provides a real-time gesture recognition method based on improved Resnet, which includes the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: and filtering the classification result, and outputting only the classification result meeting the condition.
In this embodiment, step S1 specifically includes the following steps:
step S11: acquiring a public gesture recognition training set Jester data set from a network, and acquiring related labels of training data;
step S12: the length n of a sliding window of the gesture detection network is set to be 8, the first feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed to be 5 x 5, the step length is changed to be 1, and the step length of the first convolution in the third residual block is changed to be 1, so that the detection accuracy rate of gestures with longer distances can be improved. The second feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed into 9 x 9, the step length is changed into 4, the third residual block is changed into a basic residual block from a bottleneck type residual block, the convolution kernels behind the first convolution kernel in the whole network are changed into 5 x 5, and the step length is changed into 3, so that the detection accuracy rate when the gesture is close can be improved. Connecting the outputs of the two networks, and obtaining a gesture characteristic t through the basic residual blocks in the two Resnet10 and the average pooling layer with the step length of 2;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame image in the sliding window, DET (-) is a Resnet neural network model for detecting gestures, t ═ DET (d) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-link layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
In this embodiment, the specific method of step S2 is as follows:
is provided with
Figure BDA0003137210800000051
And
Figure BDA0003137210800000052
scores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formula
Figure BDA0003137210800000053
Wherein the filter is the number of recorded history information and the fraction of the filter
Figure BDA0003137210800000054
If sf>And 3, the detector is considered to detect the gesture, and the advantage of the detection is that the reliability of the whole system can be ensured when the gesture leaves the picture for a short time.
In the present embodiment, step S3 includes the following steps:
step S31: and (3) connecting the features extracted from the second bottleneck residual block of Resnet101 with the features extracted from the gesture detection network in the step (1) by the gesture classification network on the basis of Resnet101 to obtain the structure of the gesture classification network.
Step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmAnd when the output of the filter indicates that a gesture exists, inputting data in the sliding window of the gesture classification network into the gesture classification network, wherein CLA (·) is a Resnet neural network model for classifying the gesture, and fea ═ CLA (c) is a feature extracted by the gesture classification network.
Step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaRepresenting the score of a gesture of category a. Obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula is
Figure BDA0003137210800000055
Where class represents the number of categories of the gesture.
In step S4, the method specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 seconds, calculating the difference conf between the maximum classification probability and the second classification probability to be Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
The patent is not limited to the above preferred embodiments, and other various real-time gesture recognition methods based on the Resnet can be derived by anyone in the light of the present patent, and all equivalent changes and modifications made in the claims of the present invention shall fall within the scope of the present patent.

Claims (5)

1. A real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:
step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;
step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;
step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;
step S4: filtering the classification result, and outputting the classification result meeting the condition;
in step S1, the feature extraction network employed includes a first modified Resnet10 and a second modified Resnet 10;
said first modified Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 5 x 5, step size to 1, and the step size of the first convolution in the third residual block to 1;
the second improved Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 9 x 9, the step size to 4, changes the third residual block from a bottleneck type residual block to a basic residual block, changes the convolution kernels after the first convolution kernel in the whole network to 5 x 5, and changes the step size to 3;
connecting the outputs of the first improved Resnet10 and the second improved Resnet10, and obtaining a gesture feature t through a basic residual block in the two improved Resnet10 and an average pooling layer with a step length of 2;
in step S3, the gesture classification network connects the feature extracted by the second bottleneck residual block of Resnet101 with the feature extracted by the gesture detection network based on Resnet101, so as to obtain the structure of the gesture classification network.
2. The improved Resnet-based real-time gesture recognition method of claim 1, wherein:
step S1 specifically includes the following steps:
step S11: selecting a gesture recognition training set Jester as a data set, and obtaining related labels of training data;
step S12: setting the length n of a sliding window of a gesture detection network to be 8, and performing gesture detection to obtain a gesture feature t;
step S13: let D ═ D1,d2,…,dnIs the set of image frames in the sliding window of the gesture detection network, diFor the ith frame of image in the sliding window, DET (·) is a gesture detection network model, t ═ DET (D) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-connection layer W to obtain s0And s1,s0Score, s, for absence of gestures1A score such as a gesture exists.
3. The improved Resnet-based real-time gesture recognition method of claim 2, wherein:
the specific method of step S2 is:
is provided with
Figure FDA0003137210790000021
And
Figure FDA0003137210790000022
scores, w, of presence and absence of gestures, respectively, for the first j time-sliding windowsjWeights corresponding to the first j times, wjIs calculated by the formula
Figure FDA0003137210790000023
Wherein the filter is the number of recorded history information and the fraction of the filter
Figure FDA0003137210790000024
If sf>And 3, the detector is considered to detect the gesture.
4. The improved Resnet-based real-time gesture recognition method of claim 3, wherein:
step S3 specifically includes the following steps:
step S31: constructing a gesture classification network;
step S32: let m be the sliding window size of the gesture classification network, and C ═ d1,d2,…,dmThe method comprises the steps that a, a filter is used for filtering gesture classification network images, when the output of the filter indicates that a gesture exists, data in a sliding window of the gesture classification network are input into the gesture classification network, CLA (cndot) is a gesture classification network model, and fea (CLA (C)) is extracted features of the gesture classification network;
step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each categoryaA denotes the category of the gesture, scoreaA score representing a gesture of category a; obtaining various classification probabilities P through a Softmax activation functionaThe calculation formula is
Figure FDA0003137210790000025
Wherein class represents the number of categories of the gesture, the classification probability PaThe largest class is output as the prediction result.
5. The improved Resnet-based real-time gesture recognition method of claim 4, wherein:
in step S4, the method specifically includes the following steps:
step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;
step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 second, calculating the maximum classification probability and the second classification probabilityThe difference of the ratio conf ═ Pmax-PsecondIn which P ismaxTo maximum classification probability, PsecondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.
CN202110722834.3A 2021-06-29 2021-06-29 Real-time gesture recognition method based on improved Resnet Active CN113435340B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110722834.3A CN113435340B (en) 2021-06-29 2021-06-29 Real-time gesture recognition method based on improved Resnet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110722834.3A CN113435340B (en) 2021-06-29 2021-06-29 Real-time gesture recognition method based on improved Resnet

Publications (2)

Publication Number Publication Date
CN113435340A true CN113435340A (en) 2021-09-24
CN113435340B CN113435340B (en) 2022-06-10

Family

ID=77757385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110722834.3A Active CN113435340B (en) 2021-06-29 2021-06-29 Real-time gesture recognition method based on improved Resnet

Country Status (1)

Country Link
CN (1) CN113435340B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052884A (en) * 2017-12-01 2018-05-18 华南理工大学 A kind of gesture identification method based on improvement residual error neutral net
CN111209885A (en) * 2020-01-13 2020-05-29 腾讯科技(深圳)有限公司 Gesture information processing method and device, electronic equipment and storage medium
WO2020244071A1 (en) * 2019-06-06 2020-12-10 平安科技(深圳)有限公司 Neural network-based gesture recognition method and apparatus, storage medium, and device
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052884A (en) * 2017-12-01 2018-05-18 华南理工大学 A kind of gesture identification method based on improvement residual error neutral net
WO2020244071A1 (en) * 2019-06-06 2020-12-10 平安科技(深圳)有限公司 Neural network-based gesture recognition method and apparatus, storage medium, and device
CN111209885A (en) * 2020-01-13 2020-05-29 腾讯科技(深圳)有限公司 Gesture information processing method and device, electronic equipment and storage medium
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LI, LW (LI, LIANWEI) AT EL.: "Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions", 《PATTERN ANALYSIS AND APPLICATIONS》 *
官巍等: "基于卷积神经网络的手势识别网络", 《西安邮电大学学报》 *
熊才华: "基于深度学习的手势识别算法研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Also Published As

Publication number Publication date
CN113435340B (en) 2022-06-10

Similar Documents

Publication Publication Date Title
Yu et al. On the integration of grounding language and learning objects
WO2017079522A1 (en) Subcategory-aware convolutional neural networks for object detection
CN110308795B (en) Dynamic gesture recognition method and system
CN105160318A (en) Facial expression based lie detection method and system
JP2004054956A (en) Face detection method and system using pattern sorter learned by face/analogous face image
CN107742095A (en) Chinese sign Language Recognition Method based on convolutional neural networks
Haque et al. Two-handed bangla sign language recognition using principal component analysis (PCA) and KNN algorithm
CN112801000B (en) Household old man falling detection method and system based on multi-feature fusion
Harini et al. Sign language translation
Shinde et al. Real time two way communication approach for hearing impaired and dumb person based on image processing
CN109101108A (en) Method and system based on three decision optimization intelligence cockpit human-computer interaction interfaces
CN111652017A (en) Dynamic gesture recognition method and system
CN113111968A (en) Image recognition model training method and device, electronic equipment and readable storage medium
Patel et al. Hand gesture recognition system using convolutional neural networks
CN113255557A (en) Video crowd emotion analysis method and system based on deep learning
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN112926522A (en) Behavior identification method based on skeleton attitude and space-time diagram convolutional network
Koli et al. Human action recognition using deep neural networks
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
Singh et al. Feature based method for human facial emotion detection using optical flow based analysis
Gupta et al. Progression modelling for online and early gesture detection
CN114419480A (en) Multi-person identity and action association identification method and device and readable medium
CN111950452A (en) Face recognition method
CN113435340B (en) Real-time gesture recognition method based on improved Resnet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant