CN110569817B

CN110569817B - System and method for realizing gesture recognition based on vision

Info

Publication number: CN110569817B
Application number: CN201910865437.4A
Authority: CN
Inventors: 王敬宇; 孙海峰; 王晶; 戚琦; 黄伟亭; 任鹏飞; 穆正阳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2021-11-02
Anticipated expiration: 2039-09-12
Also published as: CN110569817A

Abstract

The system for realizing gesture recognition based on vision comprises the following modules: the hand posture estimation device comprises a hand detection module, a hand posture estimation module and a hand posture estimation module; the method for realizing gesture recognition based on vision comprises the following operation steps: (1) inputting the aligned RGB pictures into a hand detection module to obtain a hand boundary frame; (2) the hand posture estimation module intercepts a corresponding hand part in the depth map to obtain the 3D coordinates of key joint points of the hand; (3) inputting the 3D coordinates of the key joint points of the hand part into a gesture recognition module to obtain digital gesture codes; (4) according to the digital gesture codes, similarity measurement is carried out on the gestures, so that gesture recognition is realized; the system and the method have good accuracy, real-time performance and robustness.

Description

System and method for realizing gesture recognition based on vision

Technical Field

The invention relates to a system and a method for realizing gesture recognition based on vision, which belong to the technical field of information, in particular to the technical field of computer vision.

Background

With the dramatic improvement of computer computing power and the vigorous development of deep learning, artificial intelligence has shown strong vitality and development prospect in recent years, the appearance of technologies such as face recognition and voice recognition changes the mode of human-computer interaction, but people are still exploring more accurate and efficient interaction modes which are more in line with the use habits of human beings, and with the evolution of user interfaces, especially the rapid development of virtual reality technology and augmented reality technology, the realization of non-contact remote human-computer interaction by gestures is regarded as the most representative and innovative interaction mode of the next generation. In the field of intelligent home furnishing, the intelligent household appliances or robots can be controlled more conveniently by introducing gesture control; in the fields of virtual reality and augmented reality, a stronger sense of reality can be obtained through a more natural expression mode of gesture operation; in the fields of games and education, user experience can be greatly enhanced, and the like.

Gesture recognition passes through hand wearing equipment information acquisition at first, like data gloves and optical marking equipment, the main articular spatial position of direct detection hand, and detection effect is good, but the price is expensive, uses at the field price/performance ratio low commonly used. Wearing additional equipment can guarantee the accuracy and stability of gesture recognition, but the way of natural expression of gestures is covered, and the additional burden is brought to the user.

In recent years, advances in depth imaging and depth learning have made significant breakthroughs in hand pose estimation based on depth data. First, due to the widespread use of commercial depth cameras (e.g., microsoft Kinect and intel Realsense), hand pose estimation technology has almost shifted to using only depth input, where depth information solves the ambiguity problem in monocular RGB input. Secondly, the deep learning fundamentally solves the solution of the task in the visual field, and particularly, the Convolutional Neural Network (CNN) becomes one of the most advanced learning frameworks in the image recognition field. The key to their success is their ability to learn the complex appearance of real-world objects from large amounts of labeled data, particularly the common visual features that can be used in tasks such as object detection, semantic segmentation, human pose estimation, and many others.

However, most of current hand posture estimation solutions based on depth data do not have a special hand detection module, a point closest to a camera is found on a depth map through a traditional image processing mode, an area near the point is determined as a hand, and the network interference resistance is poor due to the fact that the position of the hand is determined through the mode, the robustness is low, the position of the hand is limited to a certain extent, the method is not suitable for the situation that the front of the hand is shielded, and when the moving range of the hand is large, the positioning is often not accurate enough.

Hand detection and 3D hand posture estimation based on depth data, and then gesture recognition becomes a technical problem to be solved urgently in the fields of computer vision, man-machine interaction, virtual technology and the like.

Disclosure of Invention

In view of this, the present invention is directed to a system and a method for implementing gesture recognition based on computer vision, which achieve the purpose that a user directly interacts with a computer or a virtual scene without the help of a third-party tool, and thus obtain an immersive user experience.

In order to achieve the above object, the present invention provides a system for implementing gesture recognition based on vision, which includes the following modules:

the hand detection module: the function of the module is to obtain the bounding box of the hand from the input aligned RGB picture; the module is modified based on an SSD network and consists of three parts: a basic network sub-module, an additional layer sub-module and a prediction layer sub-module;

the basic network sub-module has the main functions of completing feature extraction, generating a feature map with a larger resolution, and generating a default bounding box with a set size and a set aspect ratio by using the feature map; the submodule is modified based on a VGG16 network, and specifically comprises the following steps: the submodule uses all convolutional layers of the VGG16 network, and two fully-connected layers of the VGG16 network are replaced by two common convolutional layers;

the additional layer submodule consists of a series of convolution layers, two convolution layers form a group, and the submodule has the main function of generating a feature map with smaller resolution and generating a default bounding box with set size and set aspect ratio by using the feature map;

the prediction layer submodule consists of convolution layers, and the submodule has the main function of performing two convolution filtering processes on each feature map, and predicting the position offset of a default boundary box on the feature map and the category confidence of the default boundary box respectively, namely the probability that the default boundary box contains a hand.

The convolution layer for predicting the position offset of the default boundary frame consists of 4 xq convolution kernels with the size of 3 x3 xp, wherein a parameter q is the number of the default boundary frames generated on each point of the feature map, and a parameter p is the number of channels of the feature map;

the convolutional layer predicting the confidence of the default bounding box class consists of c × q convolutional kernels of size 3 × 3 × p, where the parameter c is the total number of classes predicted.

A hand pose estimation module: the function of the module is: utilizing a hand boundary frame obtained by a hand detection module to carry out data preprocessing on a depth map corresponding to the aligned RBG picture, intercepting a corresponding hand part in the depth map, and inputting the hand part into a hand posture estimation network to obtain a 3D coordinate of a hand key joint point; the 3D coordinates of the key joint points are the positions of the joint points in an image coordinate system, and can be converted into a camera coordinate system, and when a camera is calibrated, the camera coordinate system is a world coordinate system; in order to improve robustness, when the hand detection module does not detect a hand, the hand posture estimation module intercepts points within a certain depth threshold of the depth map as a part of the hand; the hand pose estimation module directly uses the Resnet18 network to predict the 3D coordinates of the key joint points of the hand in the image coordinate system, i.e., (u, v, D);

a gesture recognition module: the module has the functions of identifying the relation between the state of a single finger and the finger and outputting a digital gesture code based on the output result of the hand posture estimation module, namely the 3D coordinate of the key joint point of the hand in an image coordinate system; according to the digital gesture codes, similarity measurement is carried out on the gestures, so that gesture recognition is realized; the recognition precision mainly depends on the precision of the hand posture estimation module, and the angles of palms and cameras, the sizes of palms and the like are stronger in robustness;

the specific contents of the default bounding box for generating the feature map and generating the set size and the set aspect ratio by the hand detection module are as follows:

the hand detection module extracts a plurality of feature maps with different resolutions from each convolution layer, and generates q default bounding boxes with a set size and a set aspect ratio at each point of the feature maps by using the feature maps.

The resolution of the feature map of a lower level is higher, and the generated default bounding box is smaller and is responsible for detecting small objects; the resolution of the feature map of a higher level is smaller, the generated default bounding box is larger, and the default bounding box is responsible for detecting large objects and combining the default bounding boxes of various sizes so as to improve the robustness of the system to the sizes of the detected objects;

when the hand detection module does not detect the hand, the hand posture estimation module intercepts points within a certain depth threshold of the depth map as the specific content of the hand part:

because the hand detection module may have incomplete hand detection or no hand detection, directly intercepting the depth value in the bounding box of the hand may cause the depth value of the hand to be seriously lost, so that the depth map needs to be preprocessed according to the bounding box of the hand, and a reasonable hand area is intercepted;

the specific method comprises the following steps:

calculating coordinates (u) of a center point of a bounding box of a hand when hand detection is incomplete_o，v_o) Calculating the average depth value d of each point in the depth map region corresponding to the bounding box of the hand_oForming point coordinates (u) in an image coordinate system_o，v_o，d_o) Then, the coordinates (u) of the point in the image coordinate system are calculated_o，v_o，d_o) Converting the image data into a camera coordinate system to be used as the center of a cubic boundary frame with a fixed size, intercepting a hand region by using the cubic boundary frame by using a hand posture estimation module, keeping points in the intercepted frame at an original value, setting points outside the frame as background points, and then converting the points in the frame into the image coordinate system to be used as the hand region for hand posture estimation; the size of the cubic boundary frame can be set according to the requirement so as to be suitable for hands with different shapes;

when the hand is not detected, sampling some points closest to the camera as the area of the hand to carry out hand posture estimation;

the gesture recognition module recognizes the specific contents of the relationship between the state of a single finger and the finger according to the 3D coordinates of the key joint points of the hand in the image coordinate system:

the state of a single finger is determined by the variance and relative values of the x, y, z coordinates of the key joint points on the finger, i.e.:

when the finger state is upward, the variance of x and z coordinates of key joint points on the finger is small, and the variance of y coordinates is large;

when the finger is in a bent state, the variance of x coordinates of key joint points on the finger is small, and the variance of y and z coordinates is large;

when the finger state is forward (aiming at other four fingers except the thumb), the variance of x and y coordinates of key joint points on the finger is small, and the variance of z coordinates is large;

when the finger state is a side edge (only aiming at the thumb), the variance of x and y coordinates of key joint points on the finger is larger, and the variance of z coordinates is small;

when the finger state is semi-closed, the variances of y and z coordinates of key joint points on the finger are larger and close, and the variance of x coordinates is smaller;

when the finger state is closed, the y coordinate of the key joint point on the finger from the interphalangeal to the palm is not monotonously increased.

The interphalangeal relationship is determined by the state of a single finger and the relative coordinates of the corresponding joint points between two fingers, namely:

when the interphalangeal relations of the two fingers are combined, the states of the two fingers are upward, and the difference of the x coordinate values of the corresponding joint points between the fingers is small;

when the interphalangeal relations of the two fingers are separated, the difference of the x coordinate values of the corresponding joint points between the fingers is larger;

when the interphalangeal relationship of the two fingers is crossed, the difference value of the x coordinates of the corresponding joint points between the fingers is positive or negative;

when the interphalangeal relationship of the two fingers is a loop, the x coordinates of the joint points corresponding to the interphalangeal and the palm center are close, and the x coordinates of the other corresponding joint points are larger in distance.

The digital gesture coded content is:

the digital gesture code is a digital vector consisting of 12 numbers, and the specific steps are as follows: (f)₁，f₂，f₃，f₄，f₅，f₁₂，f₁₃，f₁₄，f₁₅，f₂₃，f₃₄，f₄₅)^TWherein the element f_iThe index i belongs to {1, 2, 3, 4, 5}, and specifically: f. of₁Representing the state of a single finger of the thumb, f₂RepresentsState of index finger alone, f₃Representing the state of a single finger of the middle finger, f₄Representing the state of a single ring finger, f₅Representing the state of a single little finger; element f_ijThe relation between the fingers is shown, the subscript i belongs to {1, 2, 3, 4, 5}, the subscript j belongs to {2, 3, 4, 5}, and specifically: f. of₁₂Indicating the interphalangeal relationship between the thumb and index finger, f₁₃Indicating the interphalangeal relationship between the thumb and middle finger, f₁₄Indicating the interphalangeal relationship between the thumb and ring finger, f₁₅Representing the interphalangeal relationship between the thumb and the little finger, f₂₃Indicating the interphalangeal relationship between the index finger and the middle finger, f₃₄Indicating the interphalangeal relationship between the middle finger and ring finger, f₄₅Representing the interphalangeal relationship between the ring finger and the little finger;

the specific values are as follows:

f_ithe value is 1, which indicates that the finger state is upward; f. of_iThe value is 2, which indicates that the finger is bent; f. of_iThe value is 3, which indicates that the finger state is forward; f. of_iThe value is 4, which indicates that the finger state is a side edge; f. of_iThe value is 5, which indicates that the finger state is semi-closed; f. of_iThe value is 6, which indicates that the finger state is closed; f. of_iA value of 0 indicates undefined;

f_ija value of 1 indicates that the inter-finger relationship is separation; f. of_ijThe value is 2, which represents the inter-finger relationship as a combination; f. of_ijThe value is 3, which indicates that the inter-finger relationship is bifurcated; f. of_ijThe value is 4, and the inter-finger relation is represented as a loop; f. of_ijA value of 0 indicates undefined;

the invention also provides a method for realizing gesture recognition based on vision, which comprises the following operation steps:

(1) inputting the aligned RGB pictures into a hand detection module to obtain a hand boundary frame;

(2) the hand posture estimation module carries out data preprocessing on a depth map corresponding to the aligned RBG picture by using the bounding box of the hand, and intercepts the corresponding part of the hand in the depth map to obtain the 3D coordinates of key joint points of the hand; in order to improve robustness, when the hand detection module does not detect the hand, points within a certain depth threshold of the depth map are intercepted as the part of the hand;

(3) inputting the 3D coordinates of the key joint points of the hand part into a gesture recognition module to obtain digital gesture codes;

(4) and according to the digital gesture codes, carrying out similarity measurement on the gestures, thereby realizing gesture recognition.

The specific content of the step (4) is as follows: similarity measurement is carried out on the two gestures by calculating L1 paradigm distance d of the two digital gesture codes, and the calculation method is shown as the following formula:

d＝∑_i|x_i-y_i|

in the above formula, x ═ x₁,x₂,…,x_n)^T,y＝(y₁,y₂,…,y_n)^TDigital gesture codes that are two gestures, respectively; the smaller d represents the greater similarity of the two gestures, and when the d is smaller than a set threshold value, the two gestures are judged to be the same, so that gesture recognition is realized.

In the step (1), the step (2) and the step (3), the RGB picture input as the system, the depth map corresponding to the RGB picture, and the 3D coordinates of the key joint point of the hand output as the system all belong to the same coordinate system, i.e., an image coordinate system, so that the recognition accuracy can be improved and the stability of the accuracy can be maintained; the 3D coordinates of the key joint points of the hand are the positions of the joint points in an image coordinate system, can be converted into a camera coordinate system, and the camera coordinate system is a world coordinate system after the camera is calibrated;

the invention has the beneficial effects that: by adding the real-time hand detection module, the problem that the hand detection is lost in a hand posture estimation scheme based on a depth map and the hand posture estimation scheme is difficult to apply to a complex scene is solved, particularly, the position of a hand can be accurately determined under the condition that the front of the hand is shielded, and the system has good accuracy, real-time performance and robustness; the invention also provides an implementation scheme of gesture coding, improves the practicability of gesture recognition, and provides a new idea for the landing application of gesture control.

Drawings

FIG. 1 is a block diagram of a system for visually recognizing gestures according to the present invention;

FIG. 2 is a network diagram of a hand detection module of the system for visually recognizing gestures according to the present invention;

FIG. 3 is a schematic diagram of 14 key joint points of a hand in an embodiment of the present invention;

FIG. 4 is a single finger state diagram;

FIG. 5 is a schematic view of the relationship between the two elements;

FIG. 6 is a flow chart of a method for implementing gesture recognition based on vision according to the present invention;

FIG. 7 is a graph showing the results of an experiment according to an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

Referring to fig. 1, a system for visually recognizing gestures according to the present invention is described, the system comprising the following modules:

the hand detection module: the function of this module is to derive the hand bounding box from the input aligned RGB picture (color picture in RGB color space); the module is modified based on the SSD network (for SSD networks see the documents Liu Wei, Anguelov Dragomir, Erhan Dumitru, et al. SSD: Single Shot MultiBox Detector. in: ECCV.2016: 21-37.). Referring to fig. 2, the module consists of three parts: a basic network sub-module, an additional layer sub-module and a prediction layer sub-module;

the basic network sub-module has the main functions of completing feature extraction, generating a feature map with a larger resolution, and generating a default bounding box with a set size and a set aspect ratio by using the feature map; the submodule is modified based on a VGG16 network, and specifically comprises the following steps: the submodule uses all the convolutional layers of the VGG16 network and replaces the two fully-connected layers of the VGG16 network with two common convolutional layers (i.e. the Conv6 layer and the Conv7 layer in fig. 2); for VGG16 network, please refer to document K.Simony and dA.Zisserman, "Very deep capacitive networks for large-scale image retrieval," arXiv preprinting arXiv:1409.1556,2014.

The convolutional layer for predicting the position offset of the default bounding box is composed of 4 × q convolutional kernels with the size of 3 × 3 × p, wherein the parameter q is the number of default bounding boxes (4 or 6 in the embodiment) generated at each point of the feature map, and the parameter p is the number of channels of the feature map;

the convolutional layer predicting the confidence of default bounding box classes consists of c × q convolutional kernels of size 3 × 3 × p, where the parameter c is the total number of classes predicted (2 in the embodiment, i.e., hand and background classes).

A hand pose estimation module: the function of the module is: utilizing a hand boundary frame obtained by a hand detection module to carry out data preprocessing on a depth map corresponding to the aligned RBG picture, and intercepting a part of a corresponding hand in the depth map to obtain a 3D coordinate of a key joint point of the hand; the 3D coordinates of the key joint points are the positions of the joint points in an image coordinate system, and can be converted into a camera coordinate system, and when a camera is calibrated, the camera coordinate system is a world coordinate system; in order to improve robustness, when the hand detection module does not detect a hand, the hand posture estimation module intercepts points within a certain depth threshold of the depth map as a part of the hand; the hand pose estimation module directly uses the Resnet18 network to predict the 3D coordinates of the key joint points of the hand in the image coordinate system, i.e., (u, v, D); for the Resnet18 network, please refer to the references: he Kaim, Zhang Xiangyu, Ren Shaoqing, et al, deep residual learning for image recognition. in: CVPR.2016: 770-778.

Referring to fig. 3, in the embodiment, key joint points of the hand are as shown in fig. 3, and there are 14.

referring to fig. 2, in the embodiment, six feature maps, i.e., Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2, are extracted. For a network input image with a resolution of 300 × 300, the resolutions of the six feature maps are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1, q default bounding boxes with different aspect ratios are generated on each feature map unit, q is 4, 6, 4 and 4 for the above 6 feature maps, and the corresponding aspect ratio is 4

Each feature map then has its corresponding convolution kernel in the prediction layer sub-module, the feature map being processedAnd after the convolution operation, obtaining the position offset and the category confidence of each default bounding box.

At each point of the feature map, 2 squares (corresponding to an aspect ratio α of 1) and 2 squares (corresponding to an aspect ratio) are generated, respectively

Or 4 (corresponding to aspect ratio)

Rectangle default bounding box, default bounding box width is

Gao Wei

Where m is the total number of said profiles, where m is 6, s_min＝0.2,s_max0.9, i.e. s_k0.2+0.14 (k-1). When alpha is 1, one more side length is added

The square bounding box of (1).

For a feature map with a resolution of n × n, q default bounding boxes are generated at each point, and n × n × k default bounding boxes are generated from one feature map, in this embodiment, a total of 8732 default bounding boxes are generated from 6 feature maps, i.e., 38 × 38 × 4+19 × 19 × 6+10 × 10 × 6+5 × 5 × 6+3 × 3 × 4+1 × 1 × 4.

because the hand detection module may have a situation that the hand detection is incomplete (e.g., a part of the hand is missing, but the bounding box of the hand is approximately correct) or the hand cannot be detected, directly intercepting the depth value in the bounding box of the hand may cause the depth value of the hand to be seriously missing, so that the depth map needs to be preprocessed according to the bounding box of the hand, and a reasonable area of the hand is intercepted;

the specific method comprises the following steps:

referring to fig. 4, fig. 4 shows a single finger state, which sequentially from left to right: upward, forward, sideways, curved, semi-closed, and closed.

Referring to fig. 5, fig. 5 shows the interphalangeal relationship, from left to right: combine, separate, cross and loop.

Referring to table 1, the digital gesture code is:

TABLE 1

The digital gesture code is a digital vector consisting of 12 numbers, and the specific steps are as follows: (f)₁，f₂，f₃，f₄，f₅，f₁₂，f₁₃，f₁₄，f₁₅，f₂₃，f₃₄，f₄₅)^TWherein the element f_iThe index i belongs to {1, 2, 3, 4, 5}, and specifically: f. of₁Representing the state of a single finger of the thumb, f₂Representing the state of a single index finger, f₃Representing the state of a single finger of the middle finger, f₄Representing the state of a single ring finger, f₅Representing the state of a single little finger; element f_ijThe relation between the fingers is shown, the subscript i belongs to {1, 2, 3, 4, 5}, the subscript j belongs to {2, 3, 4, 5}, and specifically: f. of₁₂Indicating the interphalangeal relationship between the thumb and index finger, f₁₃Indicating the interphalangeal relationship between the thumb and middle finger, f₁₄Indicating the interphalangeal relationship between the thumb and ring finger, f₁₅Representing the interphalangeal relationship between the thumb and the little finger, f₂₃Indicating the interphalangeal relationship between the index finger and the middle finger, f₃₄Indicating the interphalangeal relationship between the middle finger and ring finger, f₄₅Representing the interphalangeal relationship between the ring finger and the little finger;

the specific values are as follows:

referring to fig. 6, a method for visually recognizing gestures according to the present invention is described, the method comprising the following steps:

d＝∑_i|x_i-y_i|

in the above formula, x ═ x₁,x₂,…,x_n)^T，y＝(y₁,y₂,…,y_n)^TDigital gesture codes that are two gestures, respectively; the smaller d represents the greater similarity of the two gestures, and when the d is smaller than a set threshold value, the two gestures are judged to be the same, so that gesture recognition is realized.

the inventor carries out a large number of experiments on the proposed system and method of the present invention using EgoHands (see http:// vision. sic. indiana. edu/projects/EgoHands /) and NYU dataset (see https:// jonathantompson. githu. io/NYU _ Handhan _ Pose _ Dataset. htm), respectively, and finally carries out system tests on the real dataset collected by the Intel Realsense D415 camera, the experimental results are shown in FIG. 7, the first column is a bounding box output by the Hand detection module, the second column is a depth map of the Hand region obtained by the Hand Pose module after data preprocessing, the third column is positions of 14 key joint points of the Hand output by the Hand Pose module, the fourth column is corresponding digital gesture codes, experiments prove that the system and method of the present invention have high accuracy and strong robustness when the actual scene is applied, and the average processing time of Nvisla is 030.100. GPU, about 30 frames/second, substantially meeting the real-time requirements.

Claims

1. The system for realizing gesture recognition based on vision is characterized in that: the system comprises the following modules:

the prediction layer submodule consists of convolution layers and has the main function of performing two convolution filtering treatments on each feature map and respectively predicting the position offset of a default boundary box on the feature map and the category confidence coefficient of the default boundary box, namely the probability that the default boundary box contains a hand;

predicting the convolution layer of the confidence coefficient of the default bounding box category, wherein the convolution layer consists of c × q convolution kernels with the size of 3 × 3 × p, and the parameter c is the total number of the predicted categories;

a hand pose estimation module: the function of the module is: utilizing a hand boundary frame obtained by a hand detection module to carry out data preprocessing on a depth map corresponding to the aligned RBG picture, and intercepting a part of a corresponding hand in the depth map to obtain a 3D coordinate of a key joint point of the hand; the 3D coordinates of the key joint points are the positions of the joint points in an image coordinate system, and can be converted into a camera coordinate system, and when a camera is calibrated, the camera coordinate system is a world coordinate system; in order to improve robustness, when the hand detection module does not detect a hand, the hand posture estimation module intercepts points within a certain depth threshold of the depth map as a part of the hand; the hand pose estimation module directly uses the Resnet18 network to predict the 3D coordinates of the key joint points of the hand in the image coordinate system, i.e., (u, v, D);

the digital gesture coded content is:

the digital gesture code is a digital vector consisting of 12 numbers, and the specific steps are as follows: (f)₁，f₂，f₃，f₄，f₅，f₁₂，f₁₃，f₁₄，f₁₅，f₂₃，f₃₄，f₄₅)^TWherein the element f_iThe index i belongs to {1, 2, 3, 4, 5}, and specifically: f. of₁Representing the state of a single finger of the thumb, f₂Representing a single finger of the index fingerState, f₃Representing the state of a single finger of the middle finger, f₄Representing the state of a single ring finger, f₅Representing the state of a single little finger; element f_ijThe relation between the fingers is shown, the subscript i belongs to {1, 2, 3, 4, 5}, the subscript j belongs to {2, 3, 4, 5}, and specifically: f. of₁₂Indicating the interphalangeal relationship between the thumb and index finger, f₁₃Indicating the interphalangeal relationship between the thumb and middle finger, f₁₄Indicating the interphalangeal relationship between the thumb and ring finger, f₁₅Representing the interphalangeal relationship between the thumb and the little finger, f₂₃Indicating the interphalangeal relationship between the index finger and the middle finger, f₃₄Indicating the interphalangeal relationship between the middle finger and ring finger, f₄₅Representing the interphalangeal relationship between the ring finger and the little finger;

the specific values are as follows:

f_ija value of 1 indicates that the inter-finger relationship is separation; f. of_ijThe value is 2, which represents the inter-finger relationship as a combination; f. of_ijThe value is 3, which indicates that the inter-finger relationship is bifurcated; f. of_ijThe value is 4, and the inter-finger relation is represented as a loop; f. of_ijA value of 0 indicates undefined.

2. The system of claim 1, wherein the gesture recognition is performed based on vision, and wherein: the specific contents of the default bounding box for generating the feature map and generating the set size and the set aspect ratio by the hand detection module are as follows:

the hand detection module extracts a plurality of feature maps with different resolutions from each convolution layer, and generates q default bounding boxes with set size and set aspect ratio at each point of the feature maps by using the feature maps;

the resolution of the feature map of a lower level is higher, and the generated default bounding box is smaller and is responsible for detecting small objects; and the resolution of the feature map of a higher level is smaller, the generated default bounding box is larger, and the default bounding box is responsible for detecting large objects and combining the default bounding boxes of various sizes so as to improve the robustness of the system to the sizes of the detected objects.

3. The system of claim 1, wherein the gesture recognition is performed based on vision, and wherein: when the hand detection module does not detect the hand, the hand posture estimation module intercepts points within a certain depth threshold of the depth map as the specific content of the hand part:

the specific method comprises the following steps:

when the hands are not detected, some points closest to the camera are sampled to be used as the areas of the hands for hand posture estimation.

4. The system of claim 1, wherein the gesture recognition is performed based on vision, and wherein: the gesture recognition module recognizes the specific contents of the relationship between the state of a single finger and the finger according to the 3D coordinates of the key joint points of the hand in the image coordinate system:

when the finger state is forward, aiming at other four fingers except the thumb, the variance of x and y coordinates of key joint points on the fingers is small, and the variance of z coordinates is large;

when the finger state is the side edge, the variance of x and y coordinates of key joint points on the finger is larger and the variance of z coordinates is smaller only for the thumb;

when the finger state is closed, the y coordinate of the key joint point from the interphalangeal to the palm on the finger is not monotonously increased;

5. The method for realizing gesture recognition based on vision is characterized by comprising the following steps: the method comprises the following operation steps:

(4) according to the digital gesture codes, similarity measurement is carried out on the gestures, so that gesture recognition is realized;

the specific content is as follows: similarity measurement is carried out on the two gestures by calculating L1 paradigm distance d of the two digital gesture codes, and the calculation method is shown as the following formula:

d＝∑_i|x_i-y_i|

in the above formula, x ═ x₁，x₂，...，x_n)^T，y＝(y₁，y₂，...，y_n)^TDigital gesture codes that are two gestures, respectively; the smaller d represents the greater similarity of the two gestures, and when the d is smaller than a set threshold value, the two gestures are judged to be the same, so that gesture recognition is realized;

the digital gesture code is a digital vector consisting of 12 numbers, and the digital gesture code specifically comprises the following steps: (f)₁，f₂，f₃，f₄，f₅，f₁₂，f₁₃，f₁₄，f₁₅，f₂₃，f₃₄，f₄₅)^TWherein the element f_iThe index i belongs to {1, 2, 3, 4, 5}, and specifically: f. of₁Representing the state of a single finger of the thumb, f₂Representing the state of a single index finger, f₃Representing the state of a single finger of the middle finger, f₄Representing the state of a single ring finger, f₅Representing the state of a single little finger; element f_ijThe relation between the fingers is shown, the subscript i belongs to {1, 2, 3, 4, 5}, the subscript j belongs to {2, 3, 4, 5}, and specifically: f. of₁₂Indicating the interphalangeal relationship between the thumb and index finger, f₁₃Indicating the interphalangeal relationship between the thumb and middle finger, f₁₄Indicating the interphalangeal relationship between the thumb and ring finger, f₁₅Representing the interphalangeal relationship between the thumb and the little finger, f₂₃Indicating the interphalangeal relationship between the index finger and the middle finger, f₃₄Indicating the interphalangeal relationship between the middle finger and ring finger, f₄₅Representing the interphalangeal relationship between the ring finger and the little finger;

the specific values are as follows:

6. The method of claim 5, wherein the method comprises: in the step (1), the step (2) and the step (3), the RGB picture input as the system, the depth map corresponding to the RGB picture, and the 3D coordinates of the key joint point of the hand output as the system all belong to the same coordinate system, i.e., an image coordinate system, so that the recognition accuracy can be improved and the stability of the accuracy can be maintained; the 3D coordinates of the key joint points of the hand are positions of the joint points in an image coordinate system, can be converted into a camera coordinate system, and the camera coordinate system is a world coordinate system after the camera is calibrated.