CN112183424A

CN112183424A - Real-time hand tracking method and system based on video

Info

Publication number: CN112183424A
Application number: CN202011074015.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Current assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-05

Abstract

The invention discloses a real-time hand tracking method and a real-time hand tracking system based on videos, wherein the method comprises the following steps: inputting a video frame image; carrying out real-time palm detection on the video frame image through a palm detection model, and carrying out image cutting on the detected palm to obtain a palm image; performing finger key point positioning detection on the palm image through a hand identification model to obtain the coordinate position of each finger key point on the palm image and identify the coordinate position; and performing gesture recognition on the gesture image identified by the hand identification model through a gesture recognition model to obtain a real-time hand gesture recognition result. The invention realizes real-time effective tracking of the hands.

Description

Real-time hand tracking method and system based on video

Technical Field

The invention relates to the technical field of image recognition and animation, in particular to a real-time hand tracking method and system based on video.

Background

Today, millions of people are communicating using sign language, but to date, there has been limited progress in the research of capturing complex gestures and translating them into spoken language. Since the hand motion is usually fast and delicate, the hand is often blocked during the motion process, and the hand image and the background image usually lack high contrast, it is not easy to quickly identify the hand image from the video frame image, and it is difficult to dynamically track the hand in real time even if multiple cameras are used to capture the hand from multiple angles or other depth sensing devices are used to sense the hand area image.

Disclosure of Invention

The present invention is directed to a method and system for real-time video-based hand tracking to solve the above-mentioned problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a video-based real-time hand tracking method is provided, which comprises the following steps:

inputting a video frame image;

carrying out real-time palm detection on the video frame image through a palm detection model, and carrying out image cutting on the detected palm to obtain a palm image;

performing finger key point positioning detection on the palm image through a hand identification model to obtain the coordinate position of each finger key point on the palm image and identify the coordinate position;

and performing gesture recognition on the gesture image identified by the hand identification model through a gesture recognition model to obtain a real-time hand gesture recognition result.

Preferably, the method of training the palm detection model comprises the steps of:

selecting 30000 video frame images containing a palm as a training sample of the palm detection model;

inputting the video frame image serving as a training sample into a deep learning network, and training to form a palm detection initial model;

carrying out palm detection on the video frame image through the palm detection initial model, and outputting a detection result;

manually checking the detection result output by the palm detection initial model to evaluate the performance of the model, and then adjusting the model training parameters of the deep learning network according to the evaluation result of the performance of the model;

and according to the adjusted model training parameters, taking the video frame image as a training sample, carrying out iterative updating on the palm detection initial model, and finally training to form the palm detection model.

Preferably, the deep learning network is a neural network of an RPN network structure.

Preferably, the size of the video frame image is 256 × 256.

Preferably, the deep learning network includes 5 convolutional layers which are cascaded in sequence, and the video frame image with the size of 256 × 256 is subjected to image feature extraction of the first convolutional layer of the deep learning network and then outputs a 128 × 128 feature map; extracting the image features of the second convolution layer from the feature map with the size of 128 x 128 and outputting a feature map of 64 x 64; extracting the image features of the third convolution layer from the 64 × 64 feature map, and outputting a 32 × 32 feature map; extracting the image features of the fourth convolution layer from the 32 x 32 feature map, and outputting a 16 x 16 feature map; and (3) extracting the image features of the fifth convolution layer from the 16 × 16 feature map, and outputting an 8 × 8 feature map.

Preferably, the finger keypoints comprise 21 finger keypoints with 3D coordinates that can characterize the shape of the palm.

Preferably, the method for recognizing gestures by the gesture recognition model comprises the following steps:

cutting the gesture image from the palm image according to a preset size according to the identification result of the finger key point;

and performing image matching on the gesture image and classification template images stored in an image database, wherein each classification template image is associated with a gesture type, and if the image matching is successful, outputting the gesture type associated with the matched classification template image as a gesture recognition result of the gesture image.

The invention also provides a real-time hand tracking system based on video, which can realize the real-time hand tracking method, and the system comprises:

the image input module is used for inputting video frame images;

the palm detection module is connected with the image input module and used for carrying out real-time palm detection on the video frame image through a palm detection model and carrying out image cutting on the detected palm to obtain a palm image;

the hand identification module is connected with the palm detection module and used for carrying out finger key point positioning detection on the palm image through a hand identification model to obtain the coordinate position of each finger key point on the palm image and identifying the coordinate position;

and the gesture recognition module is connected with the hand identification module and used for performing gesture recognition on the gesture image identified by the hand identification module through a gesture recognition model to obtain a real-time hand gesture recognition result.

Preferably, the real-time hand tracking system further comprises:

palm detection model training module connects palm detection module is used for the training palm detection model, specifically include in the palm detection model training module:

the sample marking unit is used for providing marking personnel with palm positions identified in the video frame images;

the sample selecting unit is connected with the sample labeling unit and used for selecting an image sample used for training the palm detection model from each video frame image labeled by the palm position;

the palm detection initial model training unit is connected with the sample selection unit and used for inputting each video frame image serving as a training sample into a deep learning network and training to form a palm detection initial model;

the model performance verification unit is connected with the palm detection initial model training unit and used for verifying the model performance of the palm detection initial model;

the model parameter adjusting unit is connected with the model performance verifying unit and used for providing a model training person with an adjusting model training parameter according to a model performance verifying result;

and the model iteration updating unit is respectively connected with the sample selecting unit and the model parameter adjusting unit and is used for performing iteration updating on the palm detection initial model according to the adjusted model training parameters and by taking each selected video frame image as a training sample, and finally training to form the palm detection model.

The invention adopts a deep learning technology, firstly detects the most unique and reliable part of the hand, namely the palm, through a palm detection model, then detects the key points of fingers of the detected palm to obtain the hand posture information, and finally identifies the hand posture through gesture matching, thereby realizing the real-time effective tracking of the hand.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a diagram of the method steps of a video-based real-time hand tracking method according to an embodiment of the invention;

FIG. 2 is a diagram of method steps for training the palm detection model;

FIG. 3 is a diagram of method steps by which the gesture recognition model recognizes gestures;

FIG. 4 is a schematic diagram of a video-based real-time hand tracking system according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the internal structure of the palm detection model training module in the real-time hand tracking system;

fig. 6 is a network architecture diagram of the deep learning network.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

As shown in fig. 1, the method for real-time hand tracking based on video according to an embodiment of the present invention includes the following steps:

step S1, inputting video frame images;

step S2, real-time palm detection is carried out on the video frame image through a palm detection model, and image cutting is carried out on the detected palm to obtain a palm image;

step S3, carrying out finger key point positioning detection on the palm image through a hand identification model to obtain the coordinate position of each finger key point on the palm image and identify the coordinate position;

and step S4, performing gesture recognition on the gesture image identified by the hand identification model through a gesture recognition model to obtain a real-time hand gesture recognition result.

In step S2, as shown in fig. 2, the palm detection model is trained by the following steps:

step S21, selecting 30000 video frame images containing palms as training samples of the palm detection model;

step S22, inputting the video frame image as the training sample into a deep learning network, training and forming a palm detection initial model;

step S23, carrying out palm detection on the video frame image through the palm detection initial model, and outputting a detection result;

step S24, carrying out manual verification on the detection result output by the palm detection initial model to evaluate the model performance, and then adjusting the model training parameters of the deep learning network according to the model performance evaluation result;

and step S25, according to the adjusted model training parameters and by taking the video frame image as a training sample, carrying out iterative update on the palm detection initial model, and finally training to form a palm detection model.

The deep learning network adopted by the embodiment is improved based on the RPN neural network structure. Specifically, as shown in fig. 6, the deep learning network adopted in this embodiment includes sequentially cascaded 5 convolutional layers, and the video frame image with the size of 256 × 256 is subjected to image feature extraction by the first convolutional layer of the deep learning network and then output a 128 × 128 feature map; extracting the image features of the second convolution layer from the feature map with the size of 128 x 128 and outputting a feature map of 64 x 64; extracting the image features of the third convolution layer from the 64 × 64 feature map, and outputting a 32 × 32 feature map; extracting the image features of the fourth convolution layer from the 32 x 32 feature map, and outputting a 16 x 16 feature map; and (3) extracting the image features of the fifth convolution layer from the 16 × 16 feature map, and outputting an 8 × 8 feature map. According to the invention, multiple experiments show that the feature diagram with the size of 64 x 64 is enough to represent the palm in the five-finger stretching state, and the feature diagram with the size of 8 x 8 is enough to represent the palm in the fist-clenching state, so that the palm detection model outputs the detected palm to the hand identification model with the image size of 64 x 64 or 32 x 32 or 16 x 16 or 8 x 8 for further finger key point positioning detection according to the stretching state of the five fingers.

The finger key points detected by the invention comprise 21 finger key points with 3D coordinates, which can fully represent the shape of the palm. The 21 finger key points representing the palm shape are obtained by the research of the prior art, so that the specific positions of the 21 finger key points representing the palm shape on the palm are not explained here. After 21 finger key points are detected and positioned, the positions of the gesture images in the video frame images can be obtained through the coordinate positions of the 21 finger key points, and then the gesture images are intercepted and output to the gesture recognition model.

As shown in fig. 3, the method for recognizing a gesture by a gesture recognition model includes the following steps:

step S41, cutting out a gesture image from the palm image according to a preset size according to the finger key point identification result;

and step S42, performing image matching on the gesture image and the classification template images stored in an image database, wherein each classification template image is associated with a gesture type, and if the image matching is successful, outputting the gesture type associated with the matched classification template image as a gesture recognition result of the gesture image.

The present invention further provides a real-time hand tracking system based on video, which can implement the real-time hand tracking method, as shown in fig. 4, the system includes:

the image input module 1 is used for inputting video frame images;

the palm detection module 2 is connected with the image input module 1 and is used for carrying out real-time palm detection on the video frame image through a palm detection model and carrying out image cutting on the detected palm to obtain a palm image;

the hand identification module 3 is connected with the palm detection module 2 and used for carrying out finger key point positioning detection on the palm image through a hand identification model to obtain the coordinate position of each finger key point on the palm image and identifying the coordinate position;

and the gesture recognition module 4 is connected with the hand identification module 3 and used for performing gesture recognition on the gesture image identified by the hand identification module through a gesture recognition model to obtain a real-time hand gesture recognition result.

To achieve training of the palm detection model, preferably, the real-time hand tracking system further comprises:

palm detection model training module connects palm detection module for training palm detection model, as shown in fig. 5, specifically include in the training of palm detection model:

a sample annotation unit 51, configured to provide an annotation person with a palm position identified in the video frame image;

the sample selecting unit 52 is connected with the sample labeling unit 51 and is used for selecting an image sample serving as a training palm detection model from each video frame image labeled by the palm position;

the palm detection initial model training unit 53 is connected with the sample selection unit 52 and is used for inputting each video frame image serving as a training sample into a deep learning network and training to form a palm detection initial model;

the model performance verification unit 54 is connected with the palm detection initial model training unit 53 and is used for verifying the model performance of the palm detection initial model;

the model parameter adjusting unit 55 is connected with the model performance verifying unit 54 and used for providing the model training personnel with the adjusted model training parameters according to the model performance verifying result;

and the model iteration updating unit 56 is respectively connected with the sample selecting unit 52 and the model parameter adjusting unit 55, and is used for performing iteration updating on the palm detection initial model by using the selected video frame images as training samples according to the adjusted model training parameters, and finally training to form the palm detection model.

As a preferable scheme, the deep learning network adopted in the embodiment is improved based on the structure of the RPN neural network. Specifically, the deep learning network comprises 5 convolutional layers which are cascaded in sequence, and a video frame image with the size of 256 × 256 is subjected to image feature extraction of a first convolutional layer of the deep learning network and then output to a 128 × 128 feature map; extracting the image features of the second convolution layer from the feature map with the size of 128 x 128 and outputting a feature map of 64 x 64; extracting the image features of the third convolution layer from the 64 × 64 feature map, and outputting a 32 × 32 feature map; extracting the image features of the fourth convolution layer from the 32 x 32 feature map, and outputting a 16 x 16 feature map; and (3) extracting the image features of the fifth convolution layer from the 16 × 16 feature map, and outputting an 8 × 8 feature map.

In conclusion, the invention realizes real-time tracking detection of the hand.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A real-time video-based hand tracking method, comprising:

inputting a video frame image;

2. The video-based real-time hand tracking method of claim 1, wherein the method of training the palm detection model comprises the steps of:

3. The video-based real-time hand tracking method of claim 2, wherein the deep learning network is a neural network of RPN network structure.

4. The video-based real-time hand tracking method of claim 3, wherein the video frame image has a size of 256 x 256.

5. The video-based real-time hand tracking method according to claim 4, wherein the deep learning network comprises 5 convolutional layers which are cascaded in sequence, and the video frame image with the size of 256 x 256 is subjected to image feature extraction of a first convolutional layer of the deep learning network and then output a 128 x 128 feature map; extracting the image features of the second convolution layer from the feature map with the size of 128 x 128 and outputting a feature map of 64 x 64; extracting the image features of the third convolution layer from the 64 × 64 feature map, and outputting a 32 × 32 feature map; extracting the image features of the fourth convolution layer from the 32 x 32 feature map, and outputting a 16 x 16 feature map; and (3) extracting the image features of the fifth convolution layer from the 16 × 16 feature map, and outputting an 8 × 8 feature map.

6. The video-based real-time hand tracking method of claim 1, wherein the finger keypoints comprise 21 finger keypoints with 3D coordinates that can characterize the shape of the palm.

7. The video-based real-time hand tracking method according to claim 1, wherein the method for recognizing gestures by the gesture recognition model comprises the following steps:

8. A video-based real-time hand tracking system that implements the real-time hand tracking method of any one of claims 1-7, comprising:

the image input module is used for inputting video frame images;

9. The video-based real-time hand tracking system of claim 8, further comprising:

10. The video-based real-time hand tracking system of claim 9, wherein the deep learning network comprises 5 convolutional layers cascaded in sequence, and the video frame image with the size of 256 × 256 is subjected to image feature extraction of a first convolutional layer of the deep learning network and then output a 128 × 128 feature map; extracting the image features of the second convolution layer from the feature map with the size of 128 x 128 and outputting a feature map of 64 x 64; extracting the image features of the third convolution layer from the 64 × 64 feature map, and outputting a 32 × 32 feature map; extracting the image features of the fourth convolution layer from the 32 x 32 feature map, and outputting a 16 x 16 feature map; and (3) extracting the image features of the fifth convolution layer from the 16 × 16 feature map, and outputting an 8 × 8 feature map.