WO2021237875A1

WO2021237875A1 - Hand data recognition method and system based on graph convolutional network, and storage medium

Info

Publication number: WO2021237875A1
Application number: PCT/CN2020/099766
Authority: WO
Inventors: 黄昌正; 周言明; 陈曦; 霍炼楚
Original assignee: 广州幻境科技有限公司; 肇庆市安可电子科技有限公司
Priority date: 2020-05-29
Filing date: 2020-07-01
Publication date: 2021-12-02
Also published as: CN111753669A

Abstract

A hand data recognition method and system based on a graph convolutional network, and a storage medium. The method comprises the following steps: obtaining a hand image in a preset state (S11); extracting a feature image of the hand image, a key point coordinate, and a two-dimensional thermal image (S12); combining the feature image with the two-dimensional thermal image to generate a feature vector (S13); generating a three-dimensional joint point position coordinate according to the feature vector and the key point coordinate (S14); and restoring a hand gesture according to the three-dimensional joint point position coordinate (S15). By means of the method, in the virtual interaction process, interaction personnel can accurately complete the virtual interaction process without wearing specific gloves, thus simplifying an application device for the virtual interaction process, and broadening application scenarios to a certain extent.

Description

Hand data recognition method, system and storage medium based on graph convolutional network

Technical field

The present invention relates to the field of computer vision technology, in particular to a method, system and storage medium for hand data recognition based on graph convolutional networks.

Background technique

In the interactive process of virtual reality, the hand gesture recognition process is to wear a specific glove on the hand to make the specific glove track the hand posture data. The virtual device receives the real-time posture of the hand and performs it in the virtual reality interface. Tracking display to improve the sense of realism in the virtual reality interface. However, specific gloves and their supporting facilities severely limit the scope of application, making virtual devices unable to be effectively promoted.

Summary of the invention

In order to solve the above technical problems, the purpose of the present invention is to provide a hand data recognition method, system and storage medium based on graph convolutional network, which can broaden application scenarios to a certain extent.

The first aspect of the embodiments of the present invention provides:

A hand data recognition method based on graph convolutional network includes the following steps:

Obtain a hand image in a preset state;

Extracting feature images, key point coordinates and two-dimensional thermal images of the hand image;

Combining the feature image and the two-dimensional thermal image to generate a feature vector;

Generating three-dimensional joint point position coordinates according to the feature vector and the key point coordinates;

Restore the hand posture according to the three-dimensional joint point position coordinates.

Further, the extracting the key point coordinates and the two-dimensional thermal image of the hand image includes:

Extracting key point feature positions from the first image by using a stacked hourglass network;

Predicting the two-dimensional heat map according to the characteristic positions of the key points, and determining the coordinates of the key points.

Further, the combining the characteristic image and the two-dimensional thermal image to generate a characteristic vector includes:

Converting the size of the two-dimensional thermal image into the size of the characteristic image;

According to the feature image and the size-converted two-dimensional heat map, a feature vector is calculated through a convolutional network.

Further, the generating three-dimensional joint point position coordinates according to the feature vector and the key point coordinates includes:

Calculating the vertex coordinates of the three-dimensional grid according to the feature vector;

The three-dimensional joint point position coordinates are calculated according to the vertex coordinates and the key point coordinates.

Further, the vertex coordinates of the three-dimensional grid obtained by calculating according to the feature vector are specifically:

According to the feature vector, the graph convolution network is used to calculate the coordinates of all the vertices of the three-dimensional grid.

Further, the calculation of the three-dimensional joint point position coordinates according to the vertex coordinates and the key point coordinates is specifically:

According to the vertex coordinates and the key point coordinates, a linear graph convolution network is used to regress the three-dimensional joint point position coordinates.

Further, the restoration of the hand posture according to the three-dimensional joint point position coordinates is specifically:

Restore the hand posture corresponding to the hand image in the virtual reality interface according to the three-dimensional joint point position coordinates.

The second aspect of the embodiments of the present invention provides:

A hand data recognition system based on graph convolutional network, including:

The acquisition module is used to acquire a hand image in a preset state;

An extraction module for extracting feature images, key point coordinates and two-dimensional thermal images of the hand image;

The combining module is used to combine the feature image and the two-dimensional thermal image to generate a feature vector;

A generating module, configured to generate three-dimensional joint point position coordinates according to the feature vector and the key point coordinates;

The restoration module is used to restore the hand posture according to the position coordinates of the three-dimensional joint points.

The third aspect of the embodiments of the present invention provides:

A hand data recognition system based on graph convolutional network, including:

At least one memory for storing programs;

At least one processor is configured to load the program to execute the hand data recognition method based on the graph convolutional network.

The fourth aspect of the embodiments of the present invention provides:

A computer-readable storage medium, in which instructions executable by a processor are stored, and the instructions executable by the processor are used to implement the hand data recognition method based on graph convolutional network when executed by the processor.

The beneficial effect of the embodiment of the present invention is that the embodiment of the present invention obtains a hand image in a preset state, and handles the feature image, key point coordinates, and two-dimensional thermal image of the hand image, and then combines the feature image and the two-dimensional thermal image After the combination, the feature vector is generated, and then the three-dimensional joint point position coordinates are generated according to the feature vector and the key point coordinates, and finally the hand posture is restored according to the three-dimensional joint point position coordinates, so that during the virtual interaction process, the interactor can complete the process without wearing special gloves. The interaction process, thereby simplifying the application equipment of the virtual interaction process, in order to broaden the application scenarios to a certain extent.

Description of the drawings

FIG. 1 is a flowchart of a method for hand data recognition based on a graph convolutional network according to a specific embodiment of the present invention;

FIG. 2 is a schematic diagram of a stacked hourglass network structure according to a specific embodiment;

Fig. 3 is a schematic diagram of the distribution of 21 joint nodes in a specific embodiment.

Detailed ways

The present invention will be further described in detail below in conjunction with the drawings and specific embodiments. For the step numbers in the following embodiments, they are set only for ease of elaboration and description, and there is no restriction on the order between the steps. The execution order of the steps in the embodiments can be adapted according to the understanding of those skilled in the art. Sexual adjustment.

In the following description, “some embodiments” are referred to, which describe a subset of all possible embodiments, but it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments, and Can be combined with each other without conflict.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terminology used herein is only for the purpose of describing the embodiments of the application, and is not intended to limit the application.

1, an embodiment of the present invention provides a hand data recognition method based on a graph convolutional network. This embodiment is applied to a control server, and the control server can communicate with multiple terminal devices. Among them, the terminal device may be a camera, a virtual display device, and so on.

This embodiment includes steps S11-S15:

S11. Obtain a hand image in a preset state; the hand image can be obtained through an ordinary RGB camera. The preset state means that the hand is at the center of the image in the shooting scene, and the hand occupies a moderate proportion of the image.

S12. Extract the feature image, key point coordinates and two-dimensional thermal image of the hand image; specifically, the stacked hourglass network can be used to extract the key point pixel position from the hand image, predict the hand key point heat map, and determine the initial key point coordinate.

As shown in Figure 2, the stacked hourglass network is a symmetrical network architecture. In this step, its multi-scale features are used to recognize gestures, and for each network layer in the process of obtaining low-resolution features, there will be a corresponding network layer during the up-sampling process. The overall network architecture first uses convolution and pooling operations to reduce the features to a very low resolution, such as 4*4. In each step of the maximum pooling operation, the network will add a new convolution branch, which is used to directly extract features from the original resolution before pooling, similar to the residual operation, and extracted from the subsequent upsampling operation Feature fusion. After reaching the lowest resolution, the network starts to up-sampling the features, that is, nearest neighbor interpolation, and combines the information at different scales, and then adds the previously connected features according to the element position. When the output resolution is reached, two more convolutions are used for the final calculation. The output of the final network is a set of key point heat maps, which are used to predict the probability of the existence of 21 key points in each pixel as shown in Figure 3. As shown in Figure 2, from C1 to C4 is a down-sampling process, the resolution of the feature map is gradually reduced, and C1a, C2a, C3a, and C4a are a backup of the corresponding feature maps before down-sampling. The feature map that reaches the lowest resolution is gradually up-sampled, and then the restored feature map and the corresponding backup original feature map are combined to obtain C1b, C2b, C3b, and C4b. Under different feature maps, correspondingly extracting different key points of the hand can achieve better accuracy.

S13. Combine the feature image and the two-dimensional thermal image to generate a feature vector; the feature vector is the feature vector of the key point. Specifically, when the feature image and the two-dimensional thermal image are combined, the residual network composed of 8 residual layers and 4 pooling layers is input to generate the key point feature vector.

In some embodiments, step S13 can be implemented through the following steps:

The size of the two-dimensional thermal image is converted into the size of the feature image; it can use 1*1 convolution to convert the size of the two-dimensional thermal image containing the key points into the size of the feature image.

According to the feature image and the transformed two-dimensional heat map, the feature vector is calculated through the convolutional network.

In this embodiment, the structure of the convolutional network is similar to resnet18 and consists of 8 residual layers and 4 pooling layers. The convolutional network is used to perform feature vector calculations to improve the accuracy of the calculation results.

S14: Generate three-dimensional joint point position coordinates according to the feature vector and key point coordinates;

Specifically, this step is to first calculate the vertex coordinates of the three-dimensional grid according to the feature vector, and then calculate the three-dimensional joint point position coordinates according to the vertex coordinates and the key point coordinates.

In some embodiments, the vertex coordinates of the three-dimensional grid are calculated according to the feature vector, which can be specifically implemented by the following steps:

Specifically, the key point feature vector is input to the graph convolutional network. The graph convolutional network outputs the 3D coordinates of all vertices in the 3D grid through a series of network layer calculations, and uses the 3D coordinates of the vertices in the 3D grid to reconstruct the hand surface 3D grid.

The 3D grid of the hand is essentially a graphic structure. Therefore, the 3D grid can be represented by an undirected graph M=(V,ε,W), where,

Is the set of N vertices in the grid,

Is the set of E edges in the grid, W={w _ij } _N×N is the adjacency matrix.

Define the signal f=(f ₁ ,…,f _N ) ^T ∈R ^N×F on the vertices of the graph M, which is used to represent the F-dimensional features of the N vertices in the 3D grid. In the Chebyshev graph convolution, the signal

The graph convolution operation above is defined as formula 1:

Among them, T _K (x)=2xT _K-1 (x)-T _K-2 (x) is a Chebyshev polynomial of order k, T ₀ =1, T ₀ =x,

Is the rescaled Laplacian,

λ _max is the maximum eigenvalue of L, θ _k ∈R ^Fin×Fout is the trainable parameter in the graph convolutional layer,

Is the output signal of the graph convolutional layer.

On the pre-defined map structure of the triangle grid that identifies the hand surface, first perform the image coarsening operation, similar to the process of convolutional neural network pooling, using the Graclus multi-level clustering algorithm to coarsen the image vector, and create The tree structure stores the correspondence between the vertices in the graph vectors of adjacent coarsening levels, and the forward propagation device in the graph convolution will upsample the vertex features in the coarsened graph vectors to the corresponding sub-vertices in the graph structure, Finally, the graph convolution is performed to update the features in the graph network, and the parameter K of all graph convolution layers is set to 3.

Specifically, the feature vector extracted from the hourglass network is used as the input of the graph convolution. Through two fully connected layers, the feature vector is converted into 80 vertices with 64-dimensional features during the graph coarsening process, and these features are then used in the convolution process. The up-sampling is transformed from low-dimensional to high-dimensional. Through two upsampling layers and four graphics convolutional layers, the network outputs the 3D coordinates of 1280 mesh vertices.

In some embodiments, the three-dimensional joint point position coordinates are calculated according to the vertex coordinates and the key point coordinates, which can be implemented in the following ways:

According to the vertex coordinates and key point coordinates, a linear graph convolution network is used to return the three-dimensional joint point position coordinates.

In this embodiment, a simplified linear graph convolution can be specifically used to linearly regress the position coordinates of the 3D hand joint points from the coordinates of the vertices of the three-dimensional hand grid. The vertex coordinates of the three-dimensional mesh include the coordinates of the key points of the entire hand, from which the three-dimensional coordinates of 21 joint nodes can be directly filtered. As shown in Figure 3, on a hand, there are 21 joint points from 0 joint points to 20 joint points. A joint node covers the entire hand posture. A two-layer graph convolutional network without a nonlinear activation module is used to directly estimate the 3D joint depth information from the 3D mesh vertices, and then use the previously obtained 2D key points to generate 3D joint position coordinates.

In this embodiment, the coordinates of the joint points covering the entire hand posture can be extracted, thereby improving the accuracy of the virtual hand posture synchronization process in virtual reality.

S15: Restore the hand posture according to the position coordinates of the three-dimensional joint points. Specifically, it restores the hand posture corresponding to the hand image in the virtual reality interface according to the three-dimensional joint point position coordinates, so that the hand posture data in the virtual reality can be synchronized with the actual hand posture to the greatest extent, and the synchronization in the virtual interaction process is enhanced sex.

In summary, this embodiment obtains a hand image in a preset state, and handles the feature image, key point coordinates, and two-dimensional thermal image of the hand image, and then combines the feature image and the two-dimensional thermal image to generate the feature Vector, and then generate the three-dimensional joint point position coordinates according to the feature vector and the key point coordinates, and finally restore the hand posture according to the three-dimensional joint point position coordinates, so that in the virtual interaction process, the interactor can complete the virtual interaction process without wearing special gloves. Application equipment that simplifies the virtual interaction process to expand application scenarios to a certain extent.

The embodiment of the present invention provides a hand data recognition system based on a graph convolutional network corresponding to the method in FIG. 1, including:

The acquisition module is used to acquire a hand image in a preset state;

The contents of the method embodiments of the present invention are all applicable to the system embodiments, and the specific functions implemented by the system embodiments are the same as those of the foregoing method embodiments, and the beneficial effects achieved are also the same as those of the foregoing method.

The embodiment of the present invention provides a hand data recognition system based on graph convolutional network, including:

At least one memory for storing programs;

In addition, an embodiment of the present invention provides a computer-readable storage medium, in which instructions executable by a processor are stored, and the instructions executable by the processor are used to implement the graph-based convolution when executed by the processor. Hand data recognition method of the network.

The above is a detailed description of the preferred implementation of the present invention, but the present invention is not limited to the described embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. These equivalent modifications or replacements are all included in the scope defined by the claims of this application.

Claims

A hand data recognition method based on graph convolutional network is characterized in that it comprises the following steps:

Obtain a hand image in a preset state;

Extracting feature images, key point coordinates and two-dimensional thermal images of the hand image;

Combining the feature image and the two-dimensional thermal image to generate a feature vector;

Generating three-dimensional joint point position coordinates according to the feature vector and the key point coordinates;

Restore the hand posture according to the three-dimensional joint point position coordinates.
The hand data recognition method based on graph convolutional network according to claim 1, wherein said extracting the key point coordinates and two-dimensional thermal image of the hand image comprises:

Extracting key point feature positions from the first image by using a stacked hourglass network;

Predicting the two-dimensional heat map according to the characteristic positions of the key points, and determining the coordinates of the key points.
The hand data recognition method based on graph convolutional network according to claim 1, wherein said combining said characteristic image and said two-dimensional thermal image to generate a characteristic vector comprises:

Converting the size of the two-dimensional thermal image into the size of the characteristic image;

According to the feature image and the size-converted two-dimensional heat map, a feature vector is calculated through a convolutional network.
The hand data recognition method based on graph convolutional network according to claim 1, wherein said generating three-dimensional joint point position coordinates according to said feature vector and said key point coordinates comprises:

Calculating the vertex coordinates of the three-dimensional grid according to the feature vector;

The three-dimensional joint point position coordinates are calculated according to the vertex coordinates and the key point coordinates.
The hand data recognition method based on graph convolutional network according to claim 4, characterized in that the vertex coordinates of the three-dimensional grid are calculated according to the feature vector, which is specifically:

According to the feature vector, the graph convolution network is used to calculate the coordinates of all the vertices of the three-dimensional grid.
The method for recognizing hand data based on graph convolutional network according to claim 4, wherein the three-dimensional joint point position coordinates are calculated according to the vertex coordinates and the key point coordinates, which are specifically:

According to the vertex coordinates and the key point coordinates, a linear graph convolution network is used to regress the three-dimensional joint point position coordinates.
The hand data recognition method based on graph convolutional network according to claim 1, wherein said restoring hand posture according to said three-dimensional joint point position coordinates is specifically:

Restore the hand posture corresponding to the hand image in the virtual reality interface according to the three-dimensional joint point position coordinates.
A hand data recognition system based on graph convolutional network, which is characterized in that it comprises:

The acquisition module is used to acquire a hand image in a preset state;

An extraction module for extracting feature images, key point coordinates and two-dimensional thermal images of the hand image;

The combining module is used to combine the feature image and the two-dimensional thermal image to generate a feature vector;

A generating module, configured to generate three-dimensional joint point position coordinates according to the feature vector and the key point coordinates;

The restoration module is used to restore the hand posture according to the position coordinates of the three-dimensional joint points.
A hand data recognition system based on graph convolutional network, which is characterized in that it comprises:

At least one memory for storing programs;

At least one processor is configured to load the program to execute the hand data recognition method based on graph convolutional network according to any one of claims 1-7.
A computer-readable storage medium storing instructions executable by a processor, wherein the instructions executable by the processor are used to implement the instructions in any one of claims 1-7 when executed by the processor. Hand data recognition method based on graph convolutional network.