CN111753669A

CN111753669A - Hand data identification method, system and storage medium based on graph convolution network

Info

Publication number: CN111753669A
Application number: CN202010473675.3A
Authority: CN
Inventors: 黄昌正; 周言明; 陈曦; 霍炼楚
Original assignee: Zhaoqing Anke Electronic Technology Co ltd; Guangzhou Huanjing Technology Co ltd
Current assignee: Zhaoqing Anke Electronic Technology Co ltd; Guangzhou Huanjing Technology Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-09
Also published as: WO2021237875A1

Abstract

The invention discloses a hand data identification method, a hand data identification system and a hand data identification storage medium based on a graph convolution network, wherein the hand data identification method comprises the following steps: acquiring a hand image in a preset state; extracting a feature image, a key point coordinate and a two-dimensional thermal image of the hand image; combining the feature image and the two-dimensional thermal image to generate a feature vector; generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates; and restoring the hand posture according to the position coordinates of the three-dimensional joint points. According to the invention, in the virtual interaction process, interaction personnel can accurately complete the virtual interaction process without wearing specific gloves, so that the application equipment in the virtual interaction process is simplified, and the application scene is widened to a certain extent. The invention can be widely applied to the technical field of computer vision.

Description

Hand data identification method, system and storage medium based on graph convolution network

Technical Field

The invention relates to the technical field of computer vision, in particular to a hand data identification method, a hand data identification system and a hand data identification storage medium based on a graph convolution network.

Background

In the interaction process of the virtual reality, the hand gesture recognition process is to wear a specific glove on a hand to enable the specific glove to track hand gesture data, the virtual equipment receives the real-time gesture of the hand and performs tracking display in the virtual reality interface, and therefore the sense of reality in the virtual reality interface is improved. However, the application range is severely limited by the specific gloves and the supporting facilities thereof, so that the virtual equipment cannot be effectively popularized.

Disclosure of Invention

To solve the above technical problems, the present invention aims to: provided are a hand data recognition method, system and storage medium based on a graph convolution network, which can widen application scenarios to a certain extent.

A first aspect of an embodiment of the present invention provides:

a hand data identification method based on a graph convolution network comprises the following steps:

acquiring a hand image in a preset state;

extracting a feature image, a key point coordinate and a two-dimensional thermal image of the hand image;

combining the feature image and the two-dimensional thermal image to generate a feature vector;

generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates;

and restoring the hand posture according to the position coordinates of the three-dimensional joint points.

Further, the extracting of the keypoint coordinates and the two-dimensional thermal image of the hand image comprises:

extracting key point feature positions from the first image by adopting a stacked hourglass network;

predicting the two-dimensional heat map from the keypoint feature locations, and determining the keypoint coordinates.

Further, the combining the feature image and the two-dimensional thermal image to generate a feature vector includes:

converting the dimensional size of the two-dimensional thermal image to a dimensional size of the feature image;

and calculating to obtain a feature vector through a convolution network according to the feature image and the two-dimensional heat map after size conversion.

Further, the generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates comprises:

calculating to obtain the vertex coordinates of the three-dimensional mesh according to the feature vectors;

and calculating to obtain the position coordinates of the three-dimensional joint points according to the vertex coordinates and the key point coordinates.

Further, the calculating according to the feature vector to obtain the vertex coordinates of the three-dimensional mesh specifically includes:

and calculating all vertex coordinates of the three-dimensional grid by adopting a graph convolution network according to the feature vector.

Further, the calculating according to the vertex coordinates and the key point coordinates to obtain three-dimensional joint point position coordinates specifically includes:

and regressing the position coordinates of the three-dimensional joint points by adopting a linear graph convolution network according to the vertex coordinates and the key point coordinates.

Further, the hand posture is restored according to the three-dimensional joint point position coordinates, which specifically comprises:

and restoring the hand gesture corresponding to the hand image in a virtual reality interface according to the three-dimensional joint point position coordinates.

A second aspect of an embodiment of the present invention provides:

a hand data recognition system based on a graph convolution network comprises:

the acquisition module is used for acquiring a hand image in a preset state;

the extraction module is used for extracting a characteristic image, a key point coordinate and a two-dimensional thermal image of the hand image;

a combination module for combining the feature image and the two-dimensional thermal image to generate a feature vector;

the generating module is used for generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates;

and the restoring module is used for restoring the hand gesture according to the three-dimensional joint point position coordinates.

A third aspect of embodiments of the present invention provides:

a hand data recognition system based on a graph convolution network comprises:

at least one memory for storing a program;

at least one processor for loading the program to execute the hand data identification method based on the graph volume network.

A fourth aspect of an embodiment of the present invention provides:

a computer readable storage medium having stored therein processor executable instructions for implementing the graph convolution network based hand data recognition method when executed by a processor.

The embodiment of the invention has the beneficial effects that: according to the embodiment of the invention, the hand image in the preset state is acquired, the feature image, the key point coordinate and the two-dimensional thermal image of the hand image are picked up, the feature image and the two-dimensional thermal image are combined to generate the feature vector, the three-dimensional joint point position coordinate is generated according to the feature vector and the key point coordinate, and finally the hand gesture is restored according to the three-dimensional joint point position coordinate, so that an interactive person can complete the interactive process without wearing a specific glove in the virtual interactive process, and therefore, the application equipment in the virtual interactive process is simplified, and the application scene is widened to a certain extent.

Drawings

FIG. 1 is a flowchart of a hand data recognition method based on graph convolution network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a stacked hourglass network configuration according to one embodiment;

fig. 3 is a schematic distribution diagram of 21 joint nodes according to an embodiment.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Referring to fig. 1, an embodiment of the present invention provides a hand data recognition method based on a graph convolution network, and the embodiment is applied to a control server, where the control server may communicate with a plurality of terminal devices. The terminal device may be a camera, a virtual display device, or the like.

The present embodiment includes steps S11-S15:

s11, acquiring a hand image in a preset state; the hand image can be acquired by a common RGB camera. The preset state refers to that the hand is in the center of the image in the shooting scene, and meanwhile, the image proportion occupied by the hand is moderate.

S12, extracting a characteristic image, a key point coordinate and a two-dimensional thermal image of the hand image; in particular, a stacked hourglass network may be used to extract keypoint pixel locations from hand images, predict hand keypoint heat maps, and determine initial keypoint coordinates.

As shown in fig. 2, the stacked hourglass network is a network architecture with a symmetric structure. In the step, the multi-scale features of the network layer are used for recognizing the gesture, and each network layer in the process of acquiring the low-resolution features is correspondingly provided with one corresponding network layer in the process of up-sampling. The overall network architecture first uses convolution and pooling operations to reduce the features to a very low resolution, e.g., 4 x 4. In each step of maximum pooling operation, the network adds a new convolution branch for directly extracting features from the original resolution before pooling, similar to residual operation, and fusing the extracted features with the features extracted after the subsequent up-sampling operation. After reaching the lowest resolution, the network starts up sampling the features, i.e. nearest neighbor interpolation, and combines the information at different scales, and then adds the features with the previously connected features according to element positions. When the output resolution is reached, the final operation is performed following 2 convolutions. The final output of the network is a set of keypoint heatmaps used to predict the probability of 21 keypoints existing at each pixel point as shown in fig. 3. As shown in fig. 2, C1 to C4 are a down-sampling process, the resolution of the feature map is gradually reduced, and C1a, C2a, C3a and C4a are backup copies of the corresponding feature map before down-sampling. And gradually upsampling the feature map reaching the lowest resolution, and then combining the feature map with the restored resolution with the corresponding original feature map backed up to obtain C1b, C2b, C3b and C4 b. Under different characteristic maps, different key points of the hand can be correspondingly extracted, so that better precision can be obtained.

S13, combining the characteristic image and the two-dimensional thermal image to generate a characteristic vector; the feature vector is a feature vector of the keypoint. Specifically, when the feature image and the two-dimensional thermal image are combined, a residual network composed of 8 residual layers and 4 pooling layers is input to generate a keypoint feature vector.

In some embodiments, step S13 may be implemented by:

converting the size of the two-dimensional thermal image into the size of the feature image; which can convert the size of a two-dimensional thermal image containing keypoints into the size of a feature image using a convolution of 1 x 1.

In this embodiment, the structure of the convolutional network is similar to resnet18, and is composed of 8 residual error layers and 4 pooling layers, and the convolutional network is used for feature vector calculation, so that the accuracy of the calculation result is improved.

S14, generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates;

specifically, the step is to calculate vertex coordinates of the three-dimensional mesh according to the feature vector, and then calculate position coordinates of the three-dimensional joint points according to the vertex coordinates and the coordinates of the key points.

In some embodiments, the vertex coordinates of the three-dimensional mesh are obtained by calculation according to the feature vector, which may be specifically implemented by the following steps:

and calculating all vertex coordinates of the three-dimensional mesh by adopting a graph convolution network according to the feature vectors.

Specifically, key point feature vectors are input into a graph convolution network, the graph convolution network outputs 3D coordinates of all vertexes in a 3D mesh through calculation of a series of network layers, and the 3D mesh of the hand surface is reconstructed by using the 3D coordinates of the vertexes in the 3D mesh.

The hand 3D mesh is a graphical structure in nature, and therefore, the 3D mesh may be represented by an undirected graph, M ═ V, W, where,

is a set of N vertices in the mesh,

is the set of E edges in the mesh, W ═ W_ij}_N×NIs a contiguous matrix.

Define the signal f ═ at the vertices of graph M (f)₁,…,f_N)^T∈R^N×FF-dimensional features representing N vertices in a 3D mesh, in chebyshev convolution, the signal

The above graph convolution operation is defined as equation 1:

wherein, T_K(x)＝2xT_K-1(x)-T_K-2(x) Is a Chebyshev polynomial of order k, T₀＝1，T₀＝x，

Is the laplacian of rescaling that is,

λ_maxis the maximum eigenvalue of L, θ_k∈R^Fin×FoutAre trainable parameters in the graph convolution layer,

is the output signal of the map convolution layer.

On a pre-defined graph structure of a triangular mesh for identifying a hand surface, firstly, a graph coarsening operation is performed, similar to the process of convolutional neural network pooling, Graclus multilevel clustering algorithm is used for coarsening graph vectors, a tree structure is created to store the corresponding relation of vertexes in the graph vectors of adjacent coarsening levels, vertex features in the coarsened graph vectors are sampled to corresponding sub-vertexes in the graph structure in a graph convolution forward propagation device, finally graph convolution is performed to update the features in the graph network, and the parameter K of all graph convolution layers is set to be 3.

Specifically, feature vectors extracted from an hourglass network are used as input of graph convolution, the feature vectors are converted into 80 vertexes with 64-dimensional features in a graph coarsening process through two fully-connected layers, and then the features are up-sampled in the convolution process and are converted from low dimension to high dimension. The network outputs the 3D coordinates of 1280 mesh vertices through two upsampling layers and four graph convolution layers.

In some embodiments, the three-dimensional joint point position coordinates are calculated from the vertex coordinates and the key point coordinates, which may be implemented by:

and (4) regressing the position coordinates of the three-dimensional joint points by adopting a linear graph convolution network according to the vertex coordinates and the key point coordinates.

In this embodiment, the 3D hand joint point position coordinates may be linearly regressed from the three-dimensional hand mesh vertex coordinates using a simplified linear graph convolution. The three-dimensional mesh vertex coordinates include the coordinates of key points of the whole hand, and the three-dimensional coordinates of 21 joint nodes can be directly screened out from the coordinates, as shown in fig. 3, on one hand, 21 joint nodes from 0 joint point to 20 joint points cover the whole hand posture. And estimating three-dimensional joint depth information from the vertex of the three-dimensional mesh directly by using a two-layer graph convolution network without a nonlinear activation module, and generating three-dimensional joint position coordinates by using the two-dimensional key points acquired previously.

In the embodiment, joint point coordinates covering the whole hand gesture can be extracted, so that the accuracy in the virtual hand gesture synchronization process in the virtual reality is improved.

And S15, restoring the hand gesture according to the position coordinates of the three-dimensional joint points. The method specifically comprises the step of restoring a hand gesture corresponding to a hand image in a virtual reality interface according to three-dimensional joint point position coordinates so as to enable hand gesture data in virtual reality to synchronize an actual hand gesture to the maximum extent and enhance the synchronism in the virtual interaction process.

In summary, in the embodiment, a hand image in a preset state is acquired, a feature image, a key point coordinate and a two-dimensional thermal image of the hand image are taken, a feature vector is generated after the feature image and the two-dimensional thermal image are combined, a three-dimensional joint point position coordinate is generated according to the feature vector and the key point coordinate, and finally a hand gesture is restored according to the three-dimensional joint point position coordinate, so that an interactive person can complete a virtual interaction process without wearing a specific glove in the virtual interaction process, and thus application equipment in the virtual interaction process is simplified, and an application scene is widened to a certain extent.

The embodiment of the invention provides a hand data identification system based on a graph convolution network corresponding to the method shown in figure 1, which comprises the following steps:

the acquisition module is used for acquiring a hand image in a preset state;

The content of the embodiment of the method of the invention is all applicable to the embodiment of the system, the function of the embodiment of the system is the same as the embodiment of the method, and the beneficial effect achieved by the embodiment of the system is the same as the beneficial effect achieved by the method.

The embodiment of the invention provides a hand data identification system based on a graph convolution network, which comprises:

at least one memory for storing a program;

Furthermore, an embodiment of the present invention provides a computer-readable storage medium, in which processor-executable instructions are stored, and when the processor-executable instructions are executed by a processor, the hand data identification method based on a graph convolution network is implemented.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A hand data identification method based on a graph convolution network is characterized by comprising the following steps:

acquiring a hand image in a preset state;

2. The hand data recognition method based on the graph volume network, according to claim 1, wherein the extracting of the key point coordinates and the two-dimensional thermal image of the hand image comprises:

3. The hand data recognition method based on the graph volume network, according to claim 1, wherein the combining the feature image and the two-dimensional thermal image to generate a feature vector comprises:

4. The hand data recognition method based on graph volume network as claimed in claim 1, wherein the generating three-dimensional joint point position coordinates according to the feature vectors and the key point coordinates comprises:

5. The hand data recognition method based on the graph convolution network as claimed in claim 4, wherein the vertex coordinates of the three-dimensional mesh calculated according to the feature vector are specifically:

6. The hand data recognition method based on graph convolution network as claimed in claim 4, wherein the three-dimensional joint point position coordinates calculated according to the vertex coordinates and the key point coordinates are specifically:

7. The hand data recognition method based on the graph convolution network as claimed in claim 1, wherein the hand gesture is restored according to the three-dimensional joint point position coordinates, and specifically includes:

8. A hand data recognition system based on a graph convolution network is characterized by comprising:

the acquisition module is used for acquiring a hand image in a preset state;

9. A hand data recognition system based on a graph convolution network is characterized by comprising:

at least one memory for storing a program;

at least one processor configured to load the program to perform the method for hand data recognition based on a graph and volume network according to any one of claims 1 to 7.

10. A computer readable storage medium having stored therein processor executable instructions for implementing a method for graph convolution network based hand data recognition of any one of claims 1-7 when executed by a processor.