CN111652017B

CN111652017B - Dynamic gesture recognition method and system

Info

Publication number: CN111652017B
Application number: CN201910240044.4A
Authority: CN
Inventors: 熊杰成
Original assignee: Shanghai Re Sr Information Technology Co ltd
Current assignee: Shanghai Re Sr Information Technology Co ltd
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2023-06-23
Anticipated expiration: 2039-03-27
Also published as: CN111652017A

Abstract

The invention relates to the technical field of recognition, and discloses a dynamic gesture recognition method, which comprises the following steps: acquiring a group of hand action videos of images with the sampling frame number according to a preset sampling frame number; and carrying out hand type recognition according to a preset deep learning model, obtaining a hand type category corresponding to each frame of image, arranging all the frame of images according to a time sequence, and marking the gesture name of the gesture when the number of frames of continuous frame images with the hand type category of the hand type in a certain state is larger than a preset frame number threshold value. The invention also discloses a system for dynamic gesture recognition. By the method and the system provided by the invention, gesture recognition can be performed by consuming less computing resources.

Description

Dynamic gesture recognition method and system

Technical Field

The present invention relates to the field of recognition technologies, and in particular, to a dynamic gesture recognition method and system.

Background

With the rapid development of artificial intelligence technology, gestures are becoming one of the important sources of device input for various human-computer interaction or augmented reality tasks, so accurate gesture recognition is becoming more and more important, and attracts more and more attention in the field of computer vision.

Feature engineering refers to the process of converting raw data into training data of a model, and aims to acquire better training data features and enable a machine learning model to approach the upper limit. The feature engineering can improve the performance of the model, and sometimes good effect can be achieved even in a simple model. Feature engineering plays a very important role in machine learning, and generally comprises three parts, namely feature construction, feature extraction and feature selection. Both feature extraction and feature selection are to find the most efficient feature from the original features. The difference between them is that feature extraction emphasizes that a set of features with obvious physical or statistical significance is obtained by means of feature transformation, and feature selection is that a set of feature subsets with physical or statistical significance is selected from feature sets. Both can help reduce latitude of features, data redundancy. Feature extraction sometimes finds more meaningful feature attributes, and the process of feature selection often indicates the importance of each feature to model construction.

For example, the patent application number 201810002966.7 dynamic gesture obtaining method obtains a face area of a user in a view-finding range, generates a gesture recognition area in the view-finding range outside the face area, obtains a motion track curve of the hand of the user in the gesture recognition area, and determines the dynamic gesture of the user based on the motion track curve. At present, in 2D gesture recognition, a technical scheme adopts traditional feature engineering. The technical scheme basically defines an operation method, a descriptor, an algorithm and the like of the algorithm, and uses predefined algorithm parameters to operate. The disadvantage of this solution is that the accuracy of the algorithm does not necessarily meet the product requirements. Meanwhile, the characteristic engineering generally requires huge operand, and the huge operand has higher resource requirements on embedded equipment, thereby increasing the cost.

Another technical solution for 2D gesture recognition is to directly use a deep learning network, and obtain a corresponding hand shape through regression. The advantage of this solution is that the accuracy is high and the operation speed is also fast if an embedded neural network is used. Many models used by deep learning frameworks are based on static gesture recognition, rather than dynamic gesture recognition, because static hand patterns are poorly expressive and do not meet the requirements of commercial-grade applications.

Therefore, the invention provides a 2D dynamic gesture recognition technical scheme, which is combined with a traditional method and a deep learning network to recognize dynamic 2D gestures, so that the problems in the prior art are solved.

Disclosure of Invention

The invention aims to provide a dynamic gesture recognition method and a system, which can rapidly and accurately recognize dynamic gestures.

To achieve the above object, the present invention provides a dynamic gesture recognition method, the method comprising: acquiring a group of hand action videos of images with the sampling frame number according to a preset sampling frame number; and carrying out hand pattern recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model, and obtaining a hand pattern category corresponding to each frame of image, wherein the hand pattern category comprises a hand pattern in a first state, a hand pattern in an intermediate state and a hand pattern in a second state. And arranging all the frame images according to the time sequence, and marking the gesture name of the gesture when the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold value, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold value, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold value. The method and the device combine the deep learning network and the traditional feature engineering method to identify the 2D dynamic gesture, and can quickly and accurately identify the gesture. The operation speed is high, and the accuracy is high.

Optionally, the step S1 includes: the number of sampling frames is set to 25 frames per second. If the number of the image frames is greater than 25 frames of hand motion video, calculating a downsampling coefficient according to the following formula, wherein M=N/(N-25); wherein N is the original frame number of the hand motion video, M is a downsampling coefficient, and M is a rounding integer; and discarding one frame after every M frames to obtain the hand motion video with the sampling frame number of 25 frames per second. Through the technical scheme of video fixed sampling, an input source is fixed, the robustness of an algorithm is enhanced, and meanwhile, the efficiency loss of algorithm operation is effectively avoided.

Optionally, the step S2 includes: carrying out image recognition on each frame of image according to the mobileNetv2-ssd deep learning model; acquiring hand type, hand position information and hand length and width information corresponding to each frame of image; the hand type category includes a first state hand type, an intermediate state hand type, and a second state hand type. The invention adopts the mobileNetv2-ssd deep learning network to carry out hand recognition, has high operation speed and high accuracy, and can meet the requirement of running on embedded equipment.

Optionally, before the step S3, the method further includes: calculating the Hamming distance between two adjacent frames of images; and when the Hamming distance is smaller than a preset threshold value, judging whether the hand type categories of the two adjacent frames of images are consistent according to a IoU method. The hamming distance calculation step specifically includes: recording two adjacent frame images as a first frame image and a second frame image respectively; dividing the first frame image and the second frame image into 64 x 64 small square grids respectively; accumulating the pixel values in each small square in the first frame image to obtain a first characteristic value block corresponding to the first frame image; accumulating the pixel values in each small square in the second frame image to obtain a second characteristic value block corresponding to the second frame image; subtracting the first characteristic value block and the second characteristic value block from each other to obtain a new characteristic value block, and accumulating and summing the new characteristic value blocks to obtain the Hamming distance between two adjacent frames of images; and setting a threshold value, and judging that the hand types of the first frame image and the second frame image are similar when the Hamming distance is smaller than the threshold value. Calculating a proportion value of the sum of the overlapped parts and the non-overlapped parts of the contents in the first frame image and the second frame image according to a IoU algorithm; if the ratio value is greater than or equal to a preset ratio threshold value, judging that the hand type of the first frame image is consistent with that of the second frame image; and if the proportion value is smaller than the proportion threshold value, judging that the hand type categories of the first frame image and the second frame image are inconsistent. Through the technical scheme of Hamming distance and IoU, the technical scheme of area coincidence and similarity judgment is used, the judged images are continuously changed frame images, whether the hand type categories between two adjacent frame images are consistent or not is judged, the accuracy of subsequent gesture recognition is further ensured, the technical scheme is high in recognition speed, and algorithm robustness is high.

Optionally, the frame number threshold is set to 5 frames.

The invention also provides a dynamic gesture recognition system, which comprises: the sampling module is used for acquiring a group of hand action videos of images with the sampling frame number according to a preset sampling frame number; the hand type recognition module is used for carrying out hand type recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model to obtain a hand type category corresponding to each frame of image, wherein the hand type category comprises a hand type in a first state, a hand type in an intermediate state and a hand type in a second state; the gesture recognition module is used for arranging all the frame images according to the time sequence, and marking the gesture name of the gesture when the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold value, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold value, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold value. The method and the device combine the deep learning network and the traditional feature engineering method to identify the 2D dynamic gesture, and can quickly and accurately identify the gesture.

Optionally, the system further comprises: the similarity module is used for calculating the Hamming distance between two adjacent frames of images; and the judging module is used for judging whether the hand type categories of the two adjacent frames of images are consistent according to a IoU method when the Hamming distance is smaller than a preset threshold value.

Compared with the prior art, the dynamic gesture recognition method and system provided by the invention have the following beneficial effects: the method and the device combine the deep learning network and the traditional feature engineering method to identify the 2D dynamic gesture, and can quickly and accurately identify the gesture. The invention adopts the mobileNetv2-ssd deep learning network to carry out hand recognition, has high operation speed and high accuracy, and can meet the requirement of running on embedded equipment. After the hand recognition is completed, the corresponding gesture is recognized by adopting a traditional characteristic engineering method, so that the 2D dynamic gesture recognition is realized. The recognition accuracy of complex dynamic gestures is improved; the human self-cognition process of dynamic gestures is introduced into the gesture recognition of artificial intelligence, so that the network is forced to learn our cognition process, and the recognition of various, high-complexity and high-precision gestures is achieved through correction of the network learning method.

Drawings

FIG. 1 is a flow chart of a dynamic gesture recognition method according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of the composition of a dynamic gesture recognition system according to one embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings. In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components may be exaggerated in some places in the drawings for clarity of illustration.

As shown in fig. 1, according to one embodiment of the present invention, a dynamic gesture recognition method includes:

s1, acquiring a group of hand action videos of images with the sampling frame number according to a preset sampling frame number;

s2, performing hand pattern recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model, and obtaining a hand pattern category corresponding to each frame of image, wherein the hand pattern category comprises a hand pattern in a first state, a hand pattern in an intermediate state and a hand pattern in a second state;

s3, arranging all frame images according to a time sequence, and marking the gesture name of the gesture when the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold.

According to a preset sampling frame number, a group of hand action videos of images with the sampling frame number is obtained. Shooting a gesture by a camera, and collecting a hand action video of the gesture. A dynamic gesture is a sequence of successive hand motion image frames. And decomposing the acquired hand motion video into hand motion images with the images of the sampling frame numbers. In one embodiment of the present invention, the number of sampling frames is set to 25 frames per second, that is, the number of frames of images of one second is 25 frames, and the number of images of hand motions acquired per second is 25 frames. A downsampling method is adopted for the hand motion video with the hand motion image sampled per second being greater than 25 frames, the downsampling coefficient is calculated according to the following formula,

M＝N/(N-25)；

wherein N is the original frame number of the hand motion video, M is a downsampling coefficient, M is a rounded integer, and the downsampling method discards one frame after every M frames of the hand motion video with more than 25 frames, and finally obtains the hand motion video with the sampling frame number of 25 frames per second.

Because the video sampling frequencies of various devices are different, the different sampling frequencies can cause uncertainty to the processing of the algorithm, the invention adopts the technical scheme of video fixed sampling, and solves the technical problem. And downsampling is carried out on the video with the too high sampling frequency, and upsampling is carried out on the video with the lower sampling frequency. Through the technical scheme of video fixed sampling, an input source is fixed, the robustness of an algorithm is enhanced, and meanwhile, the efficiency loss of algorithm operation is effectively avoided.

And carrying out hand pattern recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model, and obtaining a hand pattern category corresponding to each frame of image, wherein the hand pattern category comprises a hand pattern in a first state, a hand pattern in an intermediate state and a hand pattern in a second state. The mobileNet v2-SSD deep learning is a detection neural network model which is proposed by google and based on the mobileNet as a deep network to extract the characteristics of an object and uses SSD to detect the position of the object. Specifically, image recognition is carried out on each frame of image through the mobileNetv2-ssd deep learning model, and hand type category, hand position information and hand length and width information corresponding to each frame of image are obtained. The hand position information refers to position coordinate information of the hand in the frame image, and the hand length and width information refers to the length and width of the hand in the frame image. The hand type category is hand type information. The hand type category includes a first state hand type, an intermediate state hand type, and a second state hand type.

One embodiment of the present invention is a gesture diagram of a pinching motion. In this embodiment, the first state of the hand is in the form of a palm, and the intermediate state of the hand is in the form of a hand between the palm and the fist. The second state is a fist shape, so the three hand shapes form a hand-pinching gesture motion by the palm, the fist and the shape between the palm and the fist.

In a specific embodiment of the present invention, in order to ensure the accuracy of the hand type of each frame of image, the present invention adopts the technical scheme of hamming distance and IoU (cross-correlation ratio) to verify each frame of image. Each frame of image is acquired from the gesture video, and each frame of image is a frame of the video, so each frame of image is time-series in time. And calculating the hamming distance between the images of the two adjacent frames according to the hamming distance, and judging whether the hand types of the images of the two adjacent frames are consistent according to a IoU method when the hamming distance is smaller than a preset threshold value. The technical scheme is that whether the hand types of the images on the same time sequence are the same or not is judged and carried out based on the images on the same time sequence, namely whether the hand types of the images of the front frame and the rear frame are the same or not is detected according to the time sequence, a hamming distance method and a IoU method are adopted for detection, and if the detection is passed, the hand types of the images of the adjacent two frames are judged to be consistent.

Specifically, the hamming distance is used to calculate the similarity of two adjacent frames of images. And recording the adjacent two frame images as a first frame image and a second frame image respectively. Dividing the first frame image and the second frame image into 64 x 64 small square grids respectively, and carrying out accumulation calculation on pixel values in each small square grid in the first frame image to obtain a first eigenvalue block corresponding to the first frame image, namely obtaining a 64 x 64 eigenvalue matrix. Similarly, the pixel value in each small square in the second frame image is accumulated and calculated to obtain a second eigenvalue block corresponding to the second frame image, namely, a 64 x 64 eigenvalue matrix is obtained. And subtracting the first characteristic value block and the second characteristic value block from each other to obtain a new characteristic value block, wherein the new characteristic value block is the distance of each corresponding position, and accumulating and summing the new characteristic value blocks to obtain the Hamming distance. According to a preset threshold, when the Hamming distance is smaller than the threshold, the hand types of the first frame image and the second frame image are similar.

In an embodiment of the present invention, after determining that the hand types of the first frame image and the second frame image are similar, determining whether the hand types of the first frame image and the second frame image are identical according to the IoU method. According to IoU, calculating a proportion value of a sum of overlapping parts and a sum of non-overlapping parts of contents in the first frame image and the second frame image, and if the proportion value is greater than or equal to a preset proportion threshold value, enabling hand types of the first frame image and the second frame image to be consistent; and if the proportion value is smaller than the proportion threshold value, the hand type categories of the first frame image and the second frame image are inconsistent. For example, the ratio threshold is set to 0.5.

According to the technical scheme, through the technical scheme of hamming distance and IoU, the technical scheme of area coincidence and similarity judgment is used, the judged images are continuously changed frame images, whether the hand types between two adjacent frame images are consistent or not is judged, the accuracy of subsequent gesture recognition is further ensured, the technical scheme is high in recognition speed, and algorithm robustness is high.

And arranging all the frame images according to the time sequence, and marking the gesture name of the gesture if the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold. Gesture recognition is the recognition of frame images over a time sequence. For example, recognizing a pinching gesture, if an image is detected that toggles between the palm, fist, and intermediate form between the palm and fist, then the gesture may be determined to be a pinching gesture.

In an embodiment of the present invention, a captured 25-frame image is taken as an example to describe in detail, and the threshold of the frame number is set to be 5 frames. And marking the gesture name of the gesture if the number of frames of the continuous frame images of the hand type in the intermediate state is greater than 5 frames, the number of frames of the continuous frame images of the hand type in the first state is greater than 5 frames, and the number of frames of the continuous frame images of the hand type in the second state is greater than 5 frames. Taking the above-mentioned pinch gesture as an example, if the number of frames of continuous frame images in the middle of the palm and the fist is greater than 5 frames, the frame images of more than 5 frames in the front are the palm, the frame images of more than 5 frames in the back are the fist, the gesture is determined to be a pinch gesture, otherwise, the gesture is not a pinch gesture.

According to the technical scheme, the hand type recognition of continuous frame images based on the time sequence is performed, and the judgment of the gesture is performed according to the hand type recognition result, so that the recognition of the 2D dynamic gesture is realized, and the gesture recognition can be performed rapidly and accurately. The scheme has the advantages of high operation speed, high accuracy and high robustness. The recognition accuracy of complex dynamic gestures is improved; the human self-cognition process of dynamic gestures is introduced into the gesture recognition of artificial intelligence, so that the network is forced to learn our cognition process, and the recognition of various, high-complexity and high-precision gestures is achieved through correction of the network learning method.

As shown in fig. 2, in one embodiment of the present invention, a dynamic gesture recognition system, the system includes:

the sampling module 20 is configured to obtain a set of hand motion videos of images with a preset sampling frame number according to the sampling frame number;

the hand type recognition module 21 is configured to perform hand type recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model, and obtain a hand type category corresponding to each frame of image, where the hand type category includes a hand type in a first state, a hand type in an intermediate state, and a hand type in a second state;

the gesture recognition module 22 is configured to arrange all frame images according to a time sequence, and mark a gesture name of the gesture when a frame number of continuous frame images of a hand type with a hand type category of the intermediate state is greater than a preset frame number threshold, and a frame number of continuous frame images of a hand type with a hand type category of the first state is greater than the frame number threshold, and a frame number of continuous frame images of a hand type with a hand type category of the second state is greater than the frame number threshold.

The sampling module shoots a gesture according to a preset sampling frame number, and acquires a group of hand action videos of images with the sampling frame number. A dynamic gesture is a sequence of successive hand motion image frames. And decomposing the acquired hand motion video into hand motion images with the images of the sampling frame numbers. In one embodiment of the present invention, the number of sampling frames is set to 25 frames per second, that is, the number of frames of images of one second is 25 frames, and the number of images of hand motions acquired per second is 25 frames. Through the technical scheme of video fixed sampling, an input source is fixed, the robustness of an algorithm is enhanced, and meanwhile, the efficiency loss of algorithm operation is effectively avoided.

And the hand type recognition module performs hand type recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model to obtain a hand type category corresponding to each frame of image, wherein the hand type category comprises a hand type in a first state, a hand type in an intermediate state and a hand type in a second state. And carrying out image recognition on each frame of image through the mobileNetv2-ssd deep learning model, and obtaining the hand type, the hand position information and the hand length and width information corresponding to each frame of image. The hand position information refers to position coordinate information of the hand in the frame image, and the hand length and width information refers to the length and width of the hand in the frame image. The hand type category is hand type information.

In one embodiment of the present invention, the system further includes a similarity module and a determination module. The similarity module calculates the hamming distance between the images of two adjacent frames according to the hamming distance. In the judging module, when the hamming distance is smaller than a preset threshold value, judging whether the hand type categories of the two adjacent frames of images are consistent according to a IoU method. Each frame of image is acquired from the gesture video, and thus each frame of image is time-sequential in time. And calculating the hamming distance between the images of the two adjacent frames according to the hamming distance, and judging whether the hand types of the images of the two adjacent frames are consistent according to a IoU method when the hamming distance is smaller than a preset threshold value. The technical scheme is that whether the hand types of the images on the same time sequence are the same or not is judged and carried out based on the images on the same time sequence, namely whether the hand types of the images of the front frame and the rear frame are the same or not is detected according to the time sequence, a hamming distance method and a IoU method are adopted for detection, and if the detection is passed, the hand types of the images of the adjacent two frames are judged to be consistent. According to the technical scheme, through the technical scheme of Hamming distance and IoU, the technical scheme of area coincidence and similarity judgment is used, the accuracy of subsequent gesture recognition is further guaranteed, the technical scheme is high in recognition speed, and algorithm robustness is high.

And the gesture recognition module is used for arranging all the frame images according to the time sequence, and marking the gesture name of the gesture if the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold value, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold value, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold value. Gesture recognition is the recognition of frame images over a time sequence. For example, recognizing a pinching gesture, if an image is detected that toggles between the palm, fist, and intermediate form between the palm and fist, then the gesture may be determined to be a pinching gesture. In one embodiment of the present invention, the threshold number of frames is set to 5 frames. And marking the gesture name of the gesture if the number of frames of the continuous frame images of the hand type in the intermediate state is greater than 5 frames, the number of frames of the continuous frame images of the hand type in the first state is greater than 5 frames, and the number of frames of the continuous frame images of the hand type in the second state is greater than 5 frames. Taking the above-mentioned pinch gesture as an example, if the number of frames of continuous frame images in the middle of the palm and the fist is greater than 5 frames, the frame images of more than 5 frames in the front are the palm, the frame images of more than 5 frames in the back are the fist, the gesture is determined to be a pinch gesture, otherwise, the gesture is not a pinch gesture.

While the invention has been described in detail in the foregoing drawings and embodiments, such illustration and description are to be considered illustrative or exemplary and not restrictive. The invention is not limited to the disclosed embodiments. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" or "a particular" plurality should be understood as at least one or at least a particular plurality. Any reference signs in the claims shall not be construed as limiting the scope. Other variations to the above-described embodiments can be understood and effected by those skilled in the art in light of the figures, the description, and the appended claims, without departing from the scope of the invention as defined in the claims.

Claims

1. A method of dynamic gesture recognition, the method comprising the steps of:

s1, according to a preset sampling frame number, acquiring a group of hand action videos of images with the sampling frame number;

s3, arranging all frame images according to a time sequence, and marking the gesture name of the gesture when the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold value, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold value, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold value;

the step S3 includes:

recording two adjacent frame images as a first frame image and a second frame image respectively;

dividing the first frame image and the second frame image into 64 x 64 small square grids respectively;

accumulating the pixel values in each small square in the first frame image to obtain a first characteristic value block corresponding to the first frame image;

accumulating the pixel values in each small square in the second frame image to obtain a second characteristic value block corresponding to the second frame image;

subtracting the first characteristic value block and the second characteristic value block from each other to obtain a new characteristic value block, and accumulating and summing the new characteristic value blocks to obtain the Hamming distance between two adjacent frames of images;

setting a threshold value, and when the Hamming distance is smaller than the threshold value, the hand type of the first frame image and the hand type of the second frame image are similar;

and judging whether the hand types of the first frame image and the second frame image are consistent according to a IoU method after judging that the hand types of the first frame image and the second frame image are similar.

2. The dynamic gesture recognition method of claim 1, wherein the step S1 includes: the number of sampling frames is set to 25 frames per second.

3. The dynamic gesture recognition method of claim 2, wherein the step S1 further comprises:

if the number of image frames is greater than 25 frames of hand motion video, the downsampling coefficient is calculated according to the following formula,

M＝N/(N-25)；

wherein N is the original frame number of the hand action video;

m is a downsampling coefficient, and M is rounded to an integer;

and discarding one frame after every M frames to obtain the hand motion video with the sampling frame number of 25 frames per second.

4. The dynamic gesture recognition method of claim 1, wherein the step S2 includes:

carrying out image recognition on each frame of image according to the mobileNetv2-ssd deep learning model; acquiring hand type, hand position information and hand length and width information corresponding to each frame of image;

the hand type category includes a first state hand type, an intermediate state hand type, and a second state hand type.

5. The dynamic gesture recognition method of claim 4, wherein the calculating step of the IoU method comprises:

calculating a ratio value of the sum of the overlapping portions and the sum of the non-overlapping portions of the contents in the first frame image and the second frame image;

if the ratio value is greater than or equal to a preset ratio threshold value, the hand type of the first frame image is consistent with that of the second frame image;

and if the proportion value is smaller than the proportion threshold value, the hand type of the first frame image and the hand type of the second frame image are inconsistent.

6. The dynamic gesture recognition method of claim 5, wherein the frame number threshold is set to 5 frames.

7. A dynamic gesture recognition system, the system comprising:

the sampling module is used for acquiring a group of hand action videos of images with the sampling frame number according to a preset sampling frame number;

the hand type recognition module is used for carrying out hand type recognition on each frame of image according to a preset mobileNetv2-ssd deep learning model to obtain a hand type category corresponding to each frame of image, wherein the hand type category comprises a hand type in a first state, a hand type in an intermediate state and a hand type in a second state

The gesture recognition module is used for arranging all frame images according to a time sequence, and marking the gesture name of the gesture when the frame number of the continuous frame images of the hand type with the hand type category in the middle state is larger than a preset frame number threshold value, the frame number of the continuous frame images of the hand type with the hand type category in the first state is larger than the frame number threshold value, and the frame number of the continuous frame images of the hand type with the hand type category in the second state is larger than the frame number threshold value;

the similarity module is used for respectively recording two adjacent frame images as a first frame image and a second frame image, respectively dividing the first frame image and the second frame image into 64 x 64 small squares, performing accumulation calculation on pixel values in each small square in the first frame image to obtain a first characteristic value block corresponding to the first frame image, performing accumulation calculation on pixel values in each small square in the second frame image to obtain a second characteristic value block corresponding to the second frame image, performing two-phase on the first characteristic value block and the second characteristic value block to obtain a new characteristic value block, performing accumulation summation on the new characteristic value blocks to obtain a hamming distance between the two adjacent frame images, and setting a threshold value, wherein when the hamming distance is smaller than the threshold value, the hand types of the first frame image and the second frame image are similar;

and the judging module is used for judging whether the hand type categories of the two adjacent frames of images are consistent according to a IoU method when the Hamming distance is smaller than a preset threshold value.