CN113435340A

CN113435340A - Real-time gesture recognition method based on improved Resnet

Info

Publication number: CN113435340A
Application number: CN202110722834.3A
Authority: CN
Inventors: 柯逍; 卞永亨
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-24
Anticipated expiration: 2041-06-29
Also published as: CN113435340B

Abstract

The invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.

Description

Real-time gesture recognition method based on improved Resnet

Technical Field

The invention belongs to the technical field of pattern recognition and computer vision, and particularly relates to a real-time gesture recognition method based on improved Resnet.

Background

Gestures are a very comfortable way of human-computer interaction, and are currently applied to many aspects of life, such as sign language recognition, device control, and the like. Therefore, as neural network technology matures, computer vision-based gesture recognition is becoming a hot tide. In practical application, how to recognize gestures from a video stream and how to consider accuracy while guaranteeing real-time performance of the system also increases difficulty for real-time gesture recognition. Although the gesture recognition technology has made great progress, many challenges are faced in real environment, and factors such as light, distance, etc. affect the performance of gesture recognition.

Disclosure of Invention

Aiming at the blank of the prior art, the invention provides a real-time gesture recognition method based on improved Resnet, which comprises the following steps: step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network; step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information; step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network; step S4: and filtering the classification result, and outputting the classification result meeting the condition. The method can effectively identify the gestures in the video.

The invention specifically adopts the following technical scheme:

a real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:

step S1: the video stream is used as the input of a gesture detection network through a sliding window, and whether a gesture is detected or not is output by the gesture detection network;

step S2: the detection result passes through a filter, and the filter outputs a final detection result by combining historical information;

step S3: if the output of the filter indicates that the gesture is detected, inputting the video stream in the sliding window into a gesture classification network, and outputting a classification result by the gesture classification network;

step S4: filtering the classification result, and outputting the classification result meeting the condition;

in step S1, the feature extraction network employed includes a first modified Resnet10 and a second modified Resnet 10;

said first modified Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 5 x 5, step size to 1, and the step size of the first convolution in the third residual block to 1;

the second improved Resnet10 changes the first 7 x 7 convolution kernel of Resnet10 to 9 x 9, the step size to 4, changes the third residual block from a bottleneck type residual block to a basic residual block, changes the convolution kernels after the first convolution kernel in the whole network to 5 x 5, and changes the step size to 3;

connecting the outputs of the first improved Resnet10 and the second improved Resnet10, and obtaining a gesture feature t through a basic residual block in the two improved Resnet10 and an average pooling layer with a step length of 2;

in step S3, the gesture classification network connects the feature extracted by the second bottleneck residual block of Resnet101 with the feature extracted by the gesture detection network based on Resnet101, so as to obtain the structure of the gesture classification network.

Further, step S1 specifically includes the following steps:

step S11: selecting a gesture recognition training set Jester as a data set, and obtaining related labels of training data;

step S12: setting the length n of a sliding window of a gesture detection network to be 8, and performing gesture detection to obtain a gesture feature t;

step S13: let D ═ D₁,d₂,…,d_nIs the set of image frames in the sliding window of the gesture detection network, d_iFor the ith frame of image in the sliding window, DET (·) is a gesture detection network model, t ═ DET (D) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-connection layer W to obtain s₀And s₁，s₀Score, s, for absence of gestures₁A score such as a gesture exists.

Further, the specific method of step S2 is:

is provided with

And

scores, w, of presence and absence of gestures, respectively, for the first j time-sliding windows_jWeights corresponding to the first j times, w_jIs calculated by the formula

Wherein the filter is the number of recorded history information and the fraction of the filter

If sf>And 3, the detector is considered to detect the gesture.

Further, step S3 specifically includes the following steps:

step S31: constructing a gesture classification network;

step S32: let m be the sliding window size of the gesture classification network, and C ═ d₁,d₂,…,d_mThe method comprises the steps that a, a filter is used for filtering gesture classification network images, when the output of the filter indicates that a gesture exists, data in a sliding window of the gesture classification network are input into the gesture classification network, CLA (cndot) is a gesture classification network model, and fea (CLA (C)) is extracted features of the gesture classification network;

step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each category_aA denotes the category of the gesture, score_aA score representing a gesture of category a; obtaining various classification probabilities P through a Softmax activation function_aThe calculation formula is

Wherein class represents the number of categories of the gesture, the classification probability P_aThe largest class is output as the prediction result.

Further, step S4 specifically includes the following steps:

step S41: if the time stamp interval between the output result of the current gesture classification network and the last classification result is greater than or equal to 0.75 seconds, taking the current result as a final result;

step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 seconds, calculating the difference conf between the maximum classification probability and the second classification probability to be P_max-P_secondIn which P is_maxTo maximum classification probability, P_secondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.

Compared with the prior art, the invention and the preferred scheme thereof have the following beneficial effects:

1. the method and the device can effectively identify the dynamic gesture in the video, and improve the accuracy of gesture identification.

2. The method can reduce the phenomenon of gradient disappearance in the model training process, so that even if the network structure is deep, the method can have good convergence speed and accuracy in the training process.

3. Compared with the traditional Resnet network structure, the network provided by the invention constructs a multi-feature extraction network by changing the size of the convolution kernel, extracts features with different sizes, and improves the identification accuracy of hands at different distances.

4. Aiming at the problem that a single gesture network can output a plurality of classification results, the invention carries out post-processing on the results after the results are output by the classification network, thereby ensuring that a large number of classification results are not output in a short time and ensuring that the network is more suitable for practical application.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic diagram of the overall process steps of an embodiment of the present invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present invention provides a real-time gesture recognition method based on improved Resnet, which includes the following steps:

step S4: and filtering the classification result, and outputting only the classification result meeting the condition.

In this embodiment, step S1 specifically includes the following steps:

step S11: acquiring a public gesture recognition training set Jester data set from a network, and acquiring related labels of training data;

step S12: the length n of a sliding window of the gesture detection network is set to be 8, the first feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed to be 5 x 5, the step length is changed to be 1, and the step length of the first convolution in the third residual block is changed to be 1, so that the detection accuracy rate of gestures with longer distances can be improved. The second feature extraction network is improved on the basis of Resnet10, the first 7 x 7 convolution kernel is changed into 9 x 9, the step length is changed into 4, the third residual block is changed into a basic residual block from a bottleneck type residual block, the convolution kernels behind the first convolution kernel in the whole network are changed into 5 x 5, and the step length is changed into 3, so that the detection accuracy rate when the gesture is close can be improved. Connecting the outputs of the two networks, and obtaining a gesture characteristic t through the basic residual blocks in the two Resnet10 and the average pooling layer with the step length of 2;

step S13: let D ═ D₁,d₂,…,d_nIs the set of image frames in the sliding window of the gesture detection network, d_iFor the ith frame image in the sliding window, DET (-) is a Resnet neural network model for detecting gestures, t ═ DET (d) is a feature corresponding to the video in the current sliding window, and the feature t is passed through the last full-link layer W to obtain s₀And s₁，s₀Score, s, for absence of gestures₁A score such as a gesture exists.

In this embodiment, the specific method of step S2 is as follows:

is provided with

And

If sf>And 3, the detector is considered to detect the gesture, and the advantage of the detection is that the reliability of the whole system can be ensured when the gesture leaves the picture for a short time.

In the present embodiment, step S3 includes the following steps:

step S31: and (3) connecting the features extracted from the second bottleneck residual block of Resnet101 with the features extracted from the gesture detection network in the step (1) by the gesture classification network on the basis of Resnet101 to obtain the structure of the gesture classification network.

Step S32: let m be the sliding window size of the gesture classification network, and C ═ d₁,d₂,…,d_mAnd when the output of the filter indicates that a gesture exists, inputting data in the sliding window of the gesture classification network into the gesture classification network, wherein CLA (·) is a Resnet neural network model for classifying the gesture, and fea ═ CLA (c) is a feature extracted by the gesture classification network.

Step S33: the features fea extracted by the gesture classification network sequentially pass through an average pooling layer and a FULL connection layer FULL to obtain a score of each category_aA denotes the category of the gesture, score_aRepresenting the score of a gesture of category a. Obtaining various classification probabilities P through a Softmax activation function_aThe calculation formula is

Where class represents the number of categories of the gesture.

In step S4, the method specifically includes the following steps:

The patent is not limited to the above preferred embodiments, and other various real-time gesture recognition methods based on the Resnet can be derived by anyone in the light of the present patent, and all equivalent changes and modifications made in the claims of the present invention shall fall within the scope of the present patent.

Claims

1. A real-time gesture recognition method based on improved Resnet is characterized by comprising the following steps:

2. The improved Resnet-based real-time gesture recognition method of claim 1, wherein:

step S1 specifically includes the following steps:

3. The improved Resnet-based real-time gesture recognition method of claim 2, wherein:

the specific method of step S2 is:

is provided with

And

If sf>And 3, the detector is considered to detect the gesture.

4. The improved Resnet-based real-time gesture recognition method of claim 3, wherein:

step S3 specifically includes the following steps:

step S31: constructing a gesture classification network;

5. The improved Resnet-based real-time gesture recognition method of claim 4, wherein:

in step S4, the method specifically includes the following steps:

step S42: if the time stamp interval between the output result of the current gesture classification network and the last classification result is less than 0.75 second, calculating the maximum classification probability and the second classification probabilityThe difference of the ratio conf ═ P_max-P_secondIn which P is_maxTo maximum classification probability, P_secondIs the second largest classification probability, if conf>0.15, the class with the highest classification probability is output as the classification result, and if conf<0.15 does not output the classification result.