CN116909393A

CN116909393A - Gesture recognition-based virtual reality input system

Info

Publication number: CN116909393A
Application number: CN202310819033.8A
Authority: CN
Inventors: 范泉涌; 柳天昕; 李家旋; 张乃宗; 许斌
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-10-20

Abstract

The application discloses a virtual reality input system based on gesture recognition, which is characterized in that firstly, image characteristic information is extracted through a gesture recognition network to generate a gesture position and a prediction image, a gesture depth image and a hand Mesh image, the input system is built by combining the generated hand joint position information, finger tips are selected to input upwards for design, whether a finger is pressed down is judged according to coordinate changes of finger tip key points in a three-dimensional space, meanwhile, a user can select and confirm whether an input result is correct, and finally, the input result confirmed by the user is output. Aiming at complex and rapid hand motions in virtual reality equipment or a scene, the application provides a solution for realizing information input based on gesture images, improves the accuracy of information input under the condition of abnormal gestures, and can enhance user experience.

Description

Gesture recognition-based virtual reality input system

Technical Field

The application relates to a gesture recognition technology, which is particularly applied to the field of intelligent input in the field of virtual reality.

Background

Gestures are a very expressive limb language of a human body, so gesture recognition is an important aspect of human body gesture recognition, and meanwhile, gestures can convey a lot of important information, such as sign language, computer vision, man-machine interaction and the like, and in recent years, the gesture recognition is also the field of hot research of many researchers, and the man-machine interaction technology mainly comprises interaction between a person and an executing mechanism (such as a robot and the like) and interaction between the executing mechanism and an environment. The meaning of the former is that planning and decision making which is difficult to do when an actuator is in an unknown or non-deterministic environment can be achieved by a person; the latter is significant in that the robot can perform work tasks in harsh environments or remote environments that are not reachable by humans. Gesture recognition is also an emerging man-machine interaction mode, and has wide application in the fields of intelligent input methods, intelligent home, sign language recognition systems, intelligent transportation, virtual reality, medical robots and the like.

The development of the common gesture recognition method at present is mainly divided into two types: a touch gesture recognition method based on a sensor and a non-touch gesture recognition method based on computer vision.

The gesture recognition method based on the sensor is characterized in that hand information is acquired through the wearable device, the sensor is worn and the hand motion state and the hand motion track are acquired through the sensor, and the gesture recognition method based on the sensor has the advantages of being simple in algorithm, high in acquired data accuracy, few in operation data, free from the influence of external conditions such as illumination, colors and camera pixels, and the like, and has the defects of being high in cost of the wearable device, inconvenient to operate and difficult to put into production and use on a large scale.

The gesture recognition mode based on computer vision is simple in data acquisition mode and low in cost, and meanwhile gesture action requirements of a man-machine interaction scene can be well met. However, the method mainly uses a skin color model to divide the hands of the person so as to realize gesture recognition and detection, and finally uses an inter-frame difference method to realize movement gesture tracking.

The gesture input in the virtual reality scene relates to a dynamic gesture recognition method, and a plurality of methods can realize dynamic gesture recognition at present, however, how to accurately perform dynamic gesture recognition and generation of a hand model are the problems to be solved at present under the condition that the hand action is complex and rapid in certain specific scenes or hand information is shielded.

Disclosure of Invention

The application provides a gesture recognition-based virtual reality input system, which mainly comprises a multi-mode dynamic gesture recognition method and a virtual keyboard scheme design, and can realize more efficient information input based on gesture image recognition.

In a first aspect, an embodiment of the present application provides a method for predicting a hand joint position, where the method can predict a hand joint position in real time and provide direct feedback to a user, and the portion is a res net neural network for three-dimensional hand joint detection, and includes: constructing a data set required by network model training and preprocessing the data set; extracting features in the data image through a feature extractor for learning; extracting features in the 2D image and generating 21 key point coordinates of the hand; inputting the gesture motion information into a convolutional neural network model, and outputting three-dimensional space information of the gesture motion; in a second aspect, the present application provides a hand joint rotation prediction method, where the method uses a fully connected neural network of IK to regress joint angles from key point coordinates, to implement end-to-end joint rotation from joint positions, and then reconstruct joint positions by forward kinematics to generate a Mesh map of the hand; in a third aspect, a gesture input system is built according to the generated hand joint position information, a finger tip is selected to input upwards for design, whether a keyboard is pressed down is judged according to the relative position change of a finger tip key point and a specific key in a three-dimensional space, whether the pressed keyboard belongs to a left hand or a right hand area is judged according to the coordinate relation between the finger tip key points, after the specific position is judged, a user can select and confirm whether an input result is correct, when the finger tip is erroneously operated to input, the user can delete the erroneously input result by pressing an enter key, and finally the input result after the user confirmation is output.

In the scheme provided by the application, the gesture recognition model is utilized to recognize the plurality of images containing the gesture action, so that the three-dimensional space information of the gesture action in the plurality of images can be obtained, and further, corresponding operation is carried out according to the obtained three-dimensional space information of the gesture action in the plurality of images, and the situation of misrecognition of the gesture action caused by hand shielding can be avoided.

In a possible implementation scheme, acquiring a plurality of sample image information, wherein the sample image is a plurality of images containing gesture actions after data enhancement; training an initial gesture recognition model according to the sample images of the types to obtain a gesture recognition model.

According to the scheme provided by the application, the three-dimensional space information of the gesture motion recorded in the sample image can be obtained in advance, and then the initial gesture recognition model is trained by utilizing a plurality of target sample images, so that the trained gesture recognition model has the capability of recognizing gesture motions and types recorded in a plurality of images, and thus, the hand moving target in a real-time environment can be captured through the monocular camera, and the three-dimensional space model of the hand motion is generated.

In the scheme, the hand model data set required for network training mainly comprises a 2D annotation image, a 3D annotation image and a synthetic data set, wherein the 2D annotation image is relatively easy to obtain in the existing data set, compared with the existing training data set which does not contain the 3D annotation image, the hand model data set is selected to manually capture data (MoCap data), a plurality of images of gesture actions are contained in a sample, and in order to contain enough gesture changes, the hand model data set is subjected to data enhancement operation, and the hand model data set mainly comprises:

1) Performing operations such as translation, rotation, symmetrical transformation and the like on the data set, and simultaneously introducing influences of different illumination conditions in the same environment and different gesture images on gesture recognition results in severe environment conditions;

2) Assuming that each finger of the gesture in the data set is independent of other fingers, expanding the image in the gesture data set according to different fingers;

3) It is assumed that any interpolation between poses from the extended dataset from stationary poses in the quaternion space is valid.

In the scheme provided by the application, the ResNet neural network is introduced into the first part of the network, and compared with the traditional backbone network, the ResNet neural network can maintain higher spatial resolution in the aspect of image classification.

In one possible implementation, the main network structure is ResNet50, activated by ReLU function, pre-processes the initial image so that the size of the input image is 128×128, and after passing through convolutional neural network, obtains a feature volume diagram V, inputs the feature volume diagram V into two encoders to extract 2D and 3D features thereof, and for the image with size H×W, the feature extraction encoder and decoder have 2 levels of image patches from 4×4 pixels to 8×8 pixels. At each layer, n×n patches form a window. The window size N of all layers is fixed, so the bottom layer hasWindow, top layer with->And a window, and then aggregating the information of the whole image to obtain the information of the key points of the hand 21. Global information is extracted using global averaging pooling of scales 1, 2, 3, 6, which is then connected to the input features, and predictive information Y of the current image is obtained by convolutional layer mapping.

And simultaneously inputting the feature F and the image global feature extracted by the feature extraction encoder into two decoders to predict the depth information of the gesture image, and finally generating a gesture depth position image L.

In one possible implementation manner, the res net50 is more beneficial to counter-propagation of gradients, so as to deepen a network, make the network more easy to train, and reduce the gradient vanishing phenomenon, and by adopting a Residual module, the operations of reducing and increasing dimensions of the network can be realized, and by using a one-dimensional convolution layer, the calculated amount can be effectively reduced, the parameter amount is reduced, the processing efficiency of the convolution layer of the feature extraction module is improved, dropout is discarded in the network, and the purpose of using Batch Normalization is to make the dimension corresponding to each Channel of a feature matrix corresponding to a Batch (Batch) of data satisfy a distribution rule with a mean value of 0 and a variance of 1, thereby accelerating convergence of the network and improving the accuracy.

In one possible implementation, the activation function in the network is a ReLU activation function, which is hard saturated when x < 0, and has a derivative of 1 when x > 0, so that the gradient is kept unattenuated when x > 0, so that the problem of gradient disappearance can be alleviated, convergence can be faster, and sparse expression capability of the neural network is provided, and the ReLU activation function is defined as follows:

f(x)＝max(0,x)

in one possible implementation, the decoder part uses a neural window full-connection conditional random field to obtain depth information of the gesture image, the window-based full-connection conditional random field divides an entire graph model into a plurality of patch-based windows, each window has n×n image patches, each patch is a node, and is composed of n×n pixels, in each window, all nodes are connected to each other, that is, full connection is performed, and no connection is performed between different windows, so that the calculation amount is greatly reduced, and an energy function between each node is calculated by an unary network:

the energy function between adjacent nodes is complex, and is composed of the predicted values of the current node and other nodes and a weight, and the weight is calculated according to the information of color, position and the like. We then write it as:

ψ _u (x _i )＝θ _u (I,x _i )

for each node, we sum its energy functions with all other nodes to get:

then, weight functions α and β are calculated:

the point multiplication of Q and K calculates the score between each node and any other nodes, the softMax outputs weight after adding the position P, the information transmission weight is determined, and then the point multiplication is carried out with the predicted value Y to carry out information transmission.

In the solution provided by the application, the second part of the network adopts an Inverse Kinematics (IK) method, the IK task is usually solved by an iterative optimization method, the fully connected neural network of IK is used for returning joint angles from key point coordinates, hand gesture priori can be directly learned from data, and with the help of MoCap data as an additional data form, comprehensive supervision is provided during training, the part realizes end-to-end joint rotation pushing from joint positions, then joint positions are reconstructed by forward kinematics, the IK process can correct noise 3D prediction of the res net, and the joint rotation representation is the original skeleton scale.

In one possible implementation manner, the IK method firstly inputs a gesture position prediction graph Y generated by a first part of a network into a CNN to generate two parts, wherein the first part regresses a three-dimensional joint R of a hand through deconvolution, the second part generates a shape parameter beta and a torsion angle phi of the hand through a 7-layer fully connected neural network with batch normalization, and the network is activated by adopting a Sigmoid function to generateInputting the two-part result of (2) into the IK to return to the pose parameter theta, and finally generating a Mesh map of the hand by combining the shape parameter beta

In one possible implementation manner, the IK network is activated by adopting a Sigmoid function, the value range of the Sigmoid function is (0, 1), the value range is monotonically increasing, the optimization is easy, the Sigmoid function derivation is easy, and the value range can be directly deduced. Its formula is defined as follows:

f'(x)＝f(x)(1-f(x))

in one possible implementation, when it comes to training an IK network, it is desirable to have paired samples of 3D hand joint positions and corresponding joint rotation angles. The MANO model is accompanied by a data set, i.e., moCap data, containing 1554 real human hand poses from 31 subjects. Initially, the rotation is in an axis angle representation, converting them into a quaternion representation, which makes interpolation between the two poses easier. However, this dataset alone still cannot contain enough pose changes, so the dataset is augmented with the data enhancement method described above.

In one possible implementation, the positioning of 21 key points of the hand specifically includes: firstly, inputting a gesture image into a neural network, extracting features of the gesture image by the neural network, defining a reference bone of a palm model, and searching key points of hands of the model in a large rectangular area with the palm reference bone as a center, wherein the principle is as follows:

let the distance from the skeletal palm node extracted by the neural network with the reference bone as the center to the depth direction of the camera be d, and the coordinates of the palm key point be (x) _p ,y _p ,d _p ) The coordinates of the wrist key are (x _r ,y _r ,d _r ) In the middle of the reference bone pointPerforming gesture key point searching within a rectangular pixel range with a center width W and a height H,

by T ₀ = { } represents a set of gesture key points at the initial time, d _ij Represents the ith row and jth column pixel points m in the rectangular area _ij Distance to camera, therefore:

where k represents the number of searches, threshold represents the threshold of the difference between the reference bone node and the gesture keypoints in the rectangular region to the network, abs (d _p -d _ij ) Representing the absolute value of the difference between the distances of the palm nodes and the rectangular area formed, T _k Representing the final detected set of gesture keypoints.

In one possible implementation, the IK network is used to model the hand using MANO, β represents the coefficients of the PCA base of the shape learned from hand scans, posture parameter θ represents joint rotation in the axis angle representation, J (θ) represents joint position, and after the hand motion is generated, the hand modelThe deformation is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,and->A hybrid shape of shape and pose, respectively. Definition of proposed hand modelThe method comprises the following steps:

since attention is paid not only to the posture of the hand but also to the shape of the hand, the shape parameter β of the MANO model is estimated using the predicted joint position. Since the predictions are scale-normalized, the estimated shape can only represent the shape of the opposing hand,

wherein the first term ensures, for each bone b, the bone length l of the deformed hand model _b (beta) and predicted 3D bone lengthIs derived from the 3D prediction of res net, ref is the reference bone of the malformed MANO model; the second term is used as an L2 canonical of shape parameters and passes lambda _β Weighting is performed.

In one possible implementation, the loss function of the ResNet network portion is formulated as follows:

wherein the method comprises the steps ofFeature extraction map F of ensured regression _j Near-true F ^GT Wherein->Only 2D annotation images are used to ensure that the encoder achieves the best training effect, defined as follows:

where I are the Frobenius norms,representing the regularization of the network weights L2, preventing network overfitting.

In one possible implementation, the loss function of the IK network part is formulated as follows:

L _cos +L _l2 +L _xyz +L _norm

wherein the method comprises the steps ofThe distance between the cosine value and the predicted value of the interpolation angle is measured as follows:

quaternion of direct supervision prediction->The definition is as follows:

the method is used for measuring the 3D coordinate error of the hand after making the corresponding gesture, and is defined as follows:

the method is mainly used for normalization operation and is defined as follows:

in the scheme provided by the application, the 21 key points of the hand are defined on one side of the palm center, and mainly comprise:

key point 0: taking a central point at the junction of the wrist and the palm;

key point 1: taking a central point at the junction of the wrist and the palm and a central point at the root position of the thumb;

key point 2: taking a point at the root position of the thumb;

key point 3: taking the point of the joint part between the root part of the thumb and the fingertip of the thumb;

key point 4: taking the point of the finger tip position of the thumb;

key point 5: taking the point of the root position of the index finger;

key point 6: taking the point from the root of the index finger to the first joint part in the direction of the tip of the index finger;

key point 7: taking the point from the root of the index finger to the second joint part in the direction of the tip of the index finger;

key point 8: taking the point of the position of the finger tip of the index finger;

key point 9: taking a point at the root position of the middle finger;

key point 10: taking the point from the root of the middle finger to the first joint part of the middle finger in the direction of the fingertip;

key point 11: taking the point of the second joint part of the middle finger tip in the direction of the middle finger tip;

key point 12: taking a point of the finger tip position;

key point 13: taking the point of the root position of the ring finger;

key point 14: taking the point from the root of the ring finger to the first joint part in the direction of the fingertip;

key point 15: taking the point from the root of the ring finger to the second joint part in the fingertip direction;

key point 16: taking the point of the finger tip position of the ring finger;

key point 17: taking a point at the root position of the little finger;

key point 18: taking the point from the root of the little finger to the first joint part in the direction of the fingertip;

key point 19: taking the point from the root of the little finger to the second joint part in the direction of the fingertip;

key point 20: the point of the position of the finger tip of the little finger is taken.

In the scheme provided by the application, in order to avoid the situation that whether a finger is pressed or not can not be judged due to hand shielding in the normal case, the gesture input system selects the upward input of finger fingertips for design, rearranges the main key positions of the virtual keyboard so as to accord with the typing habit of a 26-key input method in the normal typing case with the fingertips facing downwards, simultaneously, has a certain radian for better user experience, and divides the virtual keyboard into two parts respectively responsible for a left hand and a right hand, and mainly comprises: the left index finger is responsible for T, G, B, R, F and V six letters, the left middle finger is responsible for E, D and C three letters, the left ring finger is responsible for W, S and X three letters, the left little finger is responsible for Q, A and Z three letters, the right index finger is responsible for Y, H, N, U, J and M six letters, the right middle finger is responsible for I and K two letters and punctuation marks ",", the right ring finger is responsible for O and L two letters and punctuation marks ",", the right little finger is responsible for P and punctuation marks "; and "/", the left thumb and the right thumb are responsible for the space bar simultaneously.

In one possible implementation, according to the above-obtained hand 21 keypoint information, five keypoints of two finger tips, i.e., the keypoint 4 of the thumb tip (denoted as p ₄ ) Key point 8 of index finger tip (denoted p ₈ ) Key point 12 of middle finger fingertip (denoted p ₁₂ ) Finger tip keypoint 16 of the ring finger (denoted p ₁₆ ) And a tip key point 20 (denoted p) ₂₀ ) At the same time need to judgeWhether left hand or right hand is pressed is judged by the relation between x coordinates of key points of fingertips, and the method mainly comprises the following steps:

when x is ₄ ＜x ₈ ，x ₈ ＜x ₁₂ ，x ₁₂ ＜x ₁₆ ，x ₁₆ ＜x ₂₀ When the two modes are satisfied, judging that the two modes are left-handed;

when x is ₄ ＞x ₈ ，x ₈ ＞x ₁₂ ，x ₁₂ ＞x ₁₆ ，x ₁₆ ＞x ₂₀ And when the two modes are satisfied, judging that the two modes are right-handed.

In one possible implementation manner, a virtual reality input system based on gesture recognition is designed, whether a keyboard is pressed down is judged according to the relative position change of a finger tip key point and a specific key in a three-dimensional space, namely, an initial position palm is placed on the keyboard, three-dimensional coordinate information of the initial position of the finger tip key point before a finger starts typing is recorded, when the three-dimensional coordinate of the finger tip key point changes and exceeds a certain range, the finger is judged to be pressed down, after the finger is judged, whether the keyboard is pressed down by a left hand or a right hand is judged according to the method, after the specific finger is judged, a user can select the information which is required to be input specifically by the introduced keyboard area and confirm whether an input result is correct, and when the finger tip of the finger is in misoperation and is input by mistake, the user can delete the input result by pressing a carriage return key, and finally the input result confirmed by the user is output.

The effects provided in the summary of the application are merely effects of embodiments, not all effects of the application, and the above technical solution has the following advantages or beneficial effects:

compared with the prior art, the application provides an input system applied to virtual reality equipment, which mainly comprises an image acquisition preprocessing part, a ResNet part, an IK part and a virtual reality input system, wherein the network structure and parameters of the ResNet and IK methods can be adjusted and optimized according to specific application scenes. The method can achieve the purpose that 21 key point coordinates of a hand and a three-dimensional virtual multi-mode hand model are obtained from the plane position and depth information in a video stream, improve the accuracy of gesture motion recognition in a virtual reality scene, further design an input system based on gesture recognition, select an input scheme that finger fingertips face upwards and a semitransparent virtual keyboard floats on a palm, solve the problem that gesture motion cannot be accurately recognized due to mutual shielding of the fingers and the keyboard and the fingers to a certain extent, and provide an effective method for gesture input in the virtual reality scene.

The method provides a preliminary solution for accurately inputting the information under the condition that the hand motion is complex and rapid in the virtual reality equipment or the scene or the hand information is blocked, improves the accuracy of gesture recognition under the abnormal condition, and improves the experience of users.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description will be given below of the drawings that are used in the description of the embodiments or the prior art.

FIG. 1 is a schematic flow chart of an algorithm of a dynamic gesture recognition part in the application;

FIG. 2 is a schematic flow diagram of a convolutional neural network portion of the dynamic gesture recognition algorithm of the present application;

FIG. 3 is a schematic diagram of a complete flow of the virtual reality input system of the present application;

fig. 4 is a schematic diagram of the effect finally achieved by the virtual reality input system of the present application.

Detailed Description

The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Referring first to fig. 1, a multi-modal dynamic gesture recognition method is designed:

a schematic flow diagram of a convolutional neural network portion of the partial dynamic gesture recognition algorithm is shown in fig. 2.

Step 1: gesture data set required for building network training

The data set used in the method is mainly three public data sets, namely Stereo Hand Pose Tracking Benchmark (STB), a test set of Dexter+Object (DO) and a synthetic data set RHD, the public data sets are subjected to 2D labeling, the 3D joint positions of the hands are obtained by means of a Mocap data set, data enhancement operation is carried out on the data according to the introduction, the initial image is preprocessed, and the pixel size of the image is mainly uniformly adjusted to the size of the image which is required to be input by network training.

Step 2: training a network model of a first part

The ResNet50 is introduced into the first part of the network, a prepared data set is firstly input into a built ResNet network model, the size of an input original image is 128 multiplied by 128, a characteristic volume diagram V is obtained after a convolutional neural network is adopted, the characteristic volume diagram V is input into two encoders to extract 2D and 3D characteristics of the characteristic volume diagram V, the parameters of the network can be reduced by dividing the image into 4 pixels and 8 pixels respectively, the calculated amount is reduced, and then the information of the whole image is aggregated to obtain 21 key point information of the hand. Global information is extracted using global averaging pooling of scales 1, 2, 3, 6, which is then connected to the input features, and predictive information Y of the current image is obtained by convolutional layer mapping.

Step 3: obtaining gesture depth images

The part mainly fuses the obtained features to obtain deeper information of the image, the features F extracted by the feature extraction encoder and the global features of the image are simultaneously input into two decoders to predict the depth information of the gesture image, and finally a gesture depth position image L is generated.

Step 4: training a network model of the second part

The second part of the network adopts an Inverse Kinematics (IK) method, a fully connected neural network of the IK is used for returning to the joint angle from the key point coordinates, training is carried out by means of MoCap data, the part realizes end-to-end joint rotation pushing out from the joint position, and then the joint position is reconstructed through forward kinematics;

firstly, inputting a gesture position prediction graph Y generated by a first part of a network into CNN to generate two parts, wherein the first part returns to a three-dimensional joint R of a hand through deconvolution, the second part generates a shape parameter beta and a torsion angle phi of the hand through a 7-layer fully connected neural network with batch normalization, the network is activated by adopting a Sigmoid function, the generated two parts of results are input into IK to return to a pose parameter theta, and finally, a Mesh graph of the hand is generated by combining the shape parameter beta

Referring next to fig. 3, a virtual reality input system based on gesture recognition is designed:

and designing a virtual reality input system according to the gesture position prediction diagram Y, the gesture depth position diagram L and the gesture Mesh diagram generated by the gesture recognition network.

According to the above description, an input scheme is selected in which the finger tip is facing upwards and the semitransparent virtual keyboard floats on the palm, whether the keyboard is pressed down is judged according to the relative position change of the finger tip key point and a specific key in a three-dimensional space, namely, the palm at the initial position is placed on the keyboard, three-dimensional coordinate information of the initial position of the finger tip key point before the finger starts typing is recorded, when the three-dimensional coordinate of the finger tip key point changes and the change range exceeds a certain limit, the finger is judged to be pressed down, then the left hand or the right hand is judged according to the method, after the specific finger is judged, a user can select the information to be specifically input in the keyboard area which is responsible for the introduced finger and confirm whether the input result is correct, and when the finger tip is in misoperation and the misoperation occurs, the user can delete the misoperation input result by pressing the enter key, and finally the input result after the confirmation of the user is output.

The technology of the application can be applied to gesture input scenes in the fields of virtual reality, gesture interaction and the like. Meanwhile, the above embodiments are only examples of the present application, and are not limited to the patent scope of the present application, and all equivalent structures or equivalent processes using the content of the present application or direct or indirect application in other related technical fields are included in the patent protection scope of the present application.

Claims

1. A virtual reality input system based on gesture recognition, characterized in that:

firstly, building a gesture image data set used by a network, wherein the built data set contains the attribute and the type of gesture actions;

secondly, realizing multi-mode dynamic gesture recognition by using a neural network, wherein the multi-mode dynamic gesture recognition comprises two parts of hand key point position prediction and joint rotation prediction, the first part of the network inputs a real image into a network model to extract image features, extracts image global feature information to predict hand key point position information, and inputs the extracted global feature information into a decoder to obtain a gesture depth position map; the second part of the network takes the hand key point position prediction as input to finally regress joint rotation, and outputs a hand Mesh chart;

thirdly, designing a virtual reality input system based on gesture recognition, selecting an input scheme that a finger tip faces upwards and a semitransparent keyboard floats on a palm, judging whether the keyboard is pressed according to the relative position change of a finger tip key point and a specific key in a three-dimensional space, simultaneously enabling a user to select the key to be pressed, and finally outputting an input result of the user.

2. The method as recited in claim 1, wherein the method further comprises:

firstly, preparing a hand model data set required by network training, wherein the hand model data set mainly comprises a 2D annotation image, a 3D annotation image and a synthetic data set, the 2D annotation image adopts a public hand training data set, the 3D annotation image adopts hand motion capture (MoCap) data, a sample contains a plurality of images of gesture actions, the constructed gesture data set is input into a built neural network model for training, the neural network model learns according to different gesture types and attributes in the plurality of data, a hand prediction network model is finally output, and meanwhile, in order to contain enough gesture changes to learn more characteristic information of hands, the data enhancement operation is carried out on the gesture data set, and the method mainly comprises the following steps:

2) Assuming that each finger in the gesture image in the data set is independent of other fingers, expanding the image in the gesture data set according to different fingers;

3) It is assumed that any interpolation from the stationary pose from the pose of the extended dataset in the quaternion space is valid.

3. The method of claim 1, wherein the first portion of the network incorporates a res net neural network that maintains higher spatial resolution in terms of image classification than a conventional backbone network;

in terms of network structure, resNet50 is used, and ReLU function is adopted for activation, initial image is preprocessed, so that the size of an input original image is 128×128, after the input original image passes through a convolutional neural network, a characteristic volume diagram V is obtained, the characteristic volume diagram V is input into two encoders to extract 2D and 3D characteristics of the image, and for an image with the size of H×W, the characteristic extraction encoder and decoder have 2 levels of image patches from 4×4 pixels to 8×8 pixels. At each layer, n×n patches form a window. The window size N of all layers is fixed, so the bottom layer hasWindow, top layer with->And a window, and then aggregating the information of the whole image to obtain the information of the key points of the hand 21. Global information is extracted using global averaging pooling of scales 1, 2, 3, 6, which is then connected to the input features, and predictive information Y of the current image is obtained by convolutional layer mapping.

4. A method as claimed in claim 1 and claim 3, wherein the method further comprises:

the feature F extracted by the feature extraction encoder and the global feature of the image are simultaneously input into two decoders to predict the depth information of the gesture image, and finally a gesture depth position image L is generated;

the decoder part uses a neural window full-connection conditional random field to obtain depth information of the gesture image, the window-based full-connection conditional random field divides an entire graph model into a plurality of windows based on patches, each window has N multiplied by N images patches, each patch is used as a node and is composed of N multiplied by N pixels, in each window, all nodes are connected with each other, namely, are fully connected, and different windows are not connected, so that the calculation amount is greatly reduced, and an energy function between each node is calculated by a single network:

the energy function between adjacent nodes is complex, and is composed of the predicted values of the current node and other nodes and a weight, and the weight is calculated according to the information of color, position and the like. It is then written as:

ψ _u (x _i )＝θ _u (I,x _i )

for each node, by summing its energy functions with all other nodes, we get:

then, weight functions α and β are calculated:

5. The method according to claim 1, wherein the second part of the network employs an Inverse Kinematics (IK) method, typically the IK task is solved by iterative optimization methods, the joint angle is regressed from the keypoint coordinates using a fully connected neural network of IK, and training is performed with MoCap data, which part achieves end-to-end pushing out joint rotation from joint position, followed by reconstructing joint position by forward kinematics;

the IK method firstly inputs a gesture position prediction graph Y generated by a first part of a network into CNN to generate two parts, the first part returns to a three-dimensional joint R of a hand through deconvolution, the second part generates a shape parameter beta and a torsion angle phi of the hand through a 7-layer fully-connected neural network with batch normalization, the network is activated by adopting a Sigmoid function, and the generated gesture position prediction graph Y is generatedInputting the two results into the IK to return to the pose parameter theta, and finally generating a Mesh map of the hand by combining the shape parameter beta

6. A method according to claim 1 and claim 3, wherein the method further comprises:

positioning 21 key points of the hand, which specifically comprises: firstly, inputting a gesture image into a neural network, extracting features of the gesture image by the neural network, defining a reference bone of a palm model, and searching key points of hands of the model in a large rectangular area with the palm reference bone as a center, wherein the principle is as follows:

let the distance from the skeletal palm node extracted by the neural network with the reference bone as the center to the depth direction of the camera be d, and the coordinates of the palm key point be (x) _p ,y _p ,d _p ) The coordinates of the wrist key are (x _r ,y _r ,d _r ) And searching gesture key points in a rectangular pixel range with a reference bone point as a center and a width W and a height H, wherein:

where k represents the number of searches, threshold represents the threshold of the difference between the reference bone node and the gesture keypoints in the rectangular region to the network, abs (d _p -d _ij ) Representing the absolute value of the difference between the distances of the palm nodes and the rectangular area formed，T _k Representing the final detected set of gesture keypoints.

7. A method as claimed in claim 1 and claim 3, wherein the method further comprises:

a loss function of a first part of the network is defined, formulated as follows:

8. The method as claimed in claim 1 and claim 4, wherein the method further comprises:

a loss function of the second part of the network is defined, formulated as follows:

quaternion of direct supervision prediction->The definition is as follows:

9. the method according to claim 1 and claim 6, wherein the hand 21 key points are defined on the palm center side, and mainly comprise:

key point 0: taking a central point at the junction of the wrist and the palm;

key point 2: taking a point at the root position of the thumb;

key point 4: taking the point of the finger tip position of the thumb;

key point 5: taking the point of the root position of the index finger;

key point 9: taking a point at the root position of the middle finger;

key point 12: taking a point of the finger tip position;

key point 13: taking the point of the root position of the ring finger;

key point 16: taking the point of the finger tip position of the ring finger;

key point 17: taking a point at the root position of the little finger;

10. The method as claimed in claim 1 and claim 6, wherein the gesture input system, to avoid the situation that whether the finger is pressed or not cannot be judged due to the shielding of the hand in the normal case, selects the finger tip to input upwards for design, rearranges the main key positions of the virtual keyboard to conform to the typing habit of the 26-key input method in the normal case of downward-pointing fingertip, and meanwhile, for making the user experience better, designs the keyboard with a certain radian, and divides the virtual keyboard into two parts respectively responsible for the left hand and the right hand, mainly comprising: the left index finger is responsible for T, G, B, R, F and V six letters, the left middle finger is responsible for E, D and C three letters, the left ring finger is responsible for W, S and X three letters, the left little finger is responsible for Q, A and Z three letters, the right index finger is responsible for Y, H, N, U, J and M six letters, the right middle finger is responsible for I and K two letters and punctuation marks ",", the right ring finger is responsible for O and L two letters and punctuation marks ",", the right little finger is responsible for P and punctuation marks "; "AND"/", the left thumb and the right thumb are simultaneously responsible for the space key;

based on the above-obtained information of 21 key points of the hand, five key points of two finger tips, namely, key point 4 of thumb tip (denoted as p ₄ ) Key point 8 of index finger tip (denoted p ₈ ) Key point 12 of middle finger fingertip (denoted p ₁₂ ) Finger tip keypoint 16 of the ring finger (denoted p ₁₆ ) And a tip key point 20 (denoted p) ₂₀ ) Meanwhile, whether the left hand or the right hand is pressed needs to be judged, and the judgment is carried out through the relation between the x coordinates of the key points of the fingertips, and mainly comprises the following steps:

11. The method as claimed in claim 1 and claim 10, wherein the method further comprises:

the method comprises the steps of designing a virtual reality input system based on gesture recognition, establishing a three-dimensional space coordinate system, judging whether a finger is pressed down according to coordinate changes of a finger tip key point in a three-dimensional space, namely, an initial position palm is placed on a keyboard, recording three-dimensional coordinate information of the initial position of the finger tip key point before the finger starts typing, judging that the finger is pressed down when the x-axis coordinate and the y-axis coordinate of the finger tip key point change and the change range exceeds a certain limit, judging whether the finger is left hand or right hand according to the method after judging that the finger is pressed down, after judging that a specific finger, a user can select the introduced keyboard area which is in charge of the finger and the specific information to be input and confirm whether the input result is correct, and when the finger tip is in misoperation and the misoperation occurs, the user can delete the misoperation input result after confirming the user finally, outputting the input result.

12. The method of claims 1-5, wherein network architecture and parameters of the res net and IK methods are adjustable and optimized according to the specific application scenario.

13. The method of the application comprises an image acquisition preprocessing part, a ResNet part, an IK part and a virtual reality input system, and finally 21 key point coordinates of the hand and a three-dimensional virtual multi-mode hand model can be obtained from the angles of plane position and depth information in a video stream, so that the accuracy of gesture recognition under abnormal conditions is improved, an input system based on gesture recognition is further designed, an effective method is provided for gesture input under a virtual reality scene, and the user experience is improved.

14. A method according to claims 1 to 10, wherein said recognized gestures including but not limited to finger, palm, wrist motion gestures, can be applied in the fields of virtual reality, gesture interaction, etc.

[ EXAMPLES ]

The application designs a virtual reality input system based on gesture recognition, which can accurately recognize the three-dimensional gesture and joint angle of gestures, provides an effective method for accurately recognizing the gestures in a virtual reality scene, and simultaneously designs an input scheme that a semitransparent virtual keyboard floats on the palm, and the palm clicks the virtual keyboard upwards, so that the situation that the keyboard is blocked by fingers and fingers can be avoided, and a more accurate input effect can be realized.