CN112115894A

CN112115894A - Training method and device for hand key point detection model and electronic equipment

Info

Publication number: CN112115894A
Application number: CN202011016465.8A
Authority: CN
Inventors: 董亚娇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-22
Anticipated expiration: 2040-09-24
Also published as: CN112115894B

Abstract

The present disclosure relates to a method and an apparatus for training a hand key point detection model, an electronic device and a storage medium, wherein the method comprises: acquiring three-dimensional position marking information of gesture feature points in a real gesture sample image; the three-dimensional position marking information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model is a teacher model, and the second hand key point detection model is a student model. By adopting the method, the detection accuracy of the hand key points through the second hand key point detection model is improved.

Description

Training method and device for hand key point detection model and electronic equipment

Technical Field

The present disclosure relates to the field of hand key point recognition technologies, and in particular, to a method and an apparatus for training a hand key point detection model, an electronic device, and a storage medium.

Background

The hand key point detection technology is an important component in interaction with mixed Reality scenes such as AR (Augmented Reality), VR (Virtual Reality), and the like, and in the hand key point detection technology, three-dimensional position information of hand key points in a real hand image is generally recognized through a hand key point detection model obtained through training in a mobile terminal.

In the related art, the training method of the hand key point detection model is very difficult to label the three-dimensional position information of the real hand key points, generally, the hand key point detection model is trained through a virtual hand sample image generated by a computer and the three-dimensional position information of the hand key points in the virtual hand sample image, and the trained hand key point detection model is carried in a mobile terminal so as to identify the hand key points of the shot hand image; however, the virtual hand sample image generated by the computer and the virtual hand data such as the three-dimensional position information of the hand key points in the virtual hand sample image have a certain difference from the real hand data, and the level of falseness is not reached, so that the accuracy of detecting the hand key points by the hand key point detection model obtained through training is low.

Disclosure of Invention

The present disclosure provides a method and an apparatus for training a hand key point detection model, an electronic device, and a storage medium, so as to at least solve the problem in the related art that the accuracy of detection of a hand key point by a hand key point detection model obtained through training is low. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for training a hand keypoint detection model, including:

acquiring three-dimensional position marking information of hand key points in a real hand sample image; the three-dimensional position labeling information is obtained by identifying the real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises the real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image;

training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the pre-trained first hand key point detection model is a teacher model, and the trained second hand key point detection model is a student model.

In an exemplary embodiment, the training a second hand key point detection model to be trained according to the real gesture sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model includes:

inputting the real hand sample image into a second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in the real hand sample image;

determining a loss value of the second hand key point detection model to be trained according to a difference value between the three-dimensional position prediction information and the three-dimensional position marking information;

and adjusting the model parameters of the second hand key point detection model to be trained according to the loss value to obtain the trained second hand key point detection model.

In an exemplary embodiment, the inputting the real hand sample image into a second hand keypoint detection model to be trained to obtain three-dimensional position prediction information of hand keypoints in the real hand sample image includes:

performing segmentation processing on the real hand sample image to obtain a plurality of key area images in the real hand sample image; the key area image comprises a hand key point;

inputting the multiple key area images in the real hand sample image into a second hand key point detection model to be trained, and performing convolution processing and full-connection processing on the multiple key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in the real hand sample image.

In an exemplary embodiment, the pre-trained first hand keypoint detection model is trained by:

inputting the real hand sample image and the virtual hand sample image into a first hand key point detection model to be trained respectively to obtain two-dimensional position prediction information of hand key points in the real hand sample image and three-dimensional position prediction information of the hand key points in the virtual hand sample image;

obtaining a first sub-loss value according to a difference value between two-dimensional position prediction information of a hand key point in the real hand sample image and the two-dimensional position marking information, obtaining a second sub-loss value according to a difference value between three-dimensional position prediction information of the hand key point in the virtual hand sample image and the three-dimensional position marking information, and obtaining a total loss value of the first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value;

and adjusting the model parameters of the first hand key point detection model to be trained according to the total loss value to obtain the pre-trained first hand key point detection model.

In an exemplary embodiment, the set of virtual hand samples is obtained by:

acquiring a plurality of different hand models and background images corresponding to the hand models;

performing three-dimensional rendering processing on each hand model on a background image corresponding to each hand model respectively to obtain a virtual hand sample image corresponding to each hand model and three-dimensional position marking information of a hand key point in each virtual hand sample image; the virtual hand sample image corresponding to each hand model comprises a background image of the corresponding hand model;

and constructing the virtual hand sample set according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images.

In an exemplary embodiment, the method further comprises:

acquiring a real hand test image, inputting the real hand test image into the trained second hand key point detection model, and obtaining three-dimensional position prediction information of the hand key points in the real hand test image; the real hand test image carries three-dimensional position marking information of corresponding hand key points;

screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image, wherein the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position marking information is larger than a preset difference value;

according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image, retraining the trained second hand key point detection model to obtain a retrained second hand key point detection model;

and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for detecting a hand keypoint, including:

collecting hand images;

inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the hand image; the trained second hand keypoint detection model is obtained by training according to the training method of the hand keypoint detection model described in any embodiment of the first aspect.

According to a third aspect of the embodiments of the present disclosure, there is provided a training device for a hand keypoint detection model, comprising:

an information acquisition unit configured to perform acquisition of three-dimensional position labeling information of a hand key point in a real hand sample image; the three-dimensional position labeling information is obtained by identifying the real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises the real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image;

the first training unit is configured to execute training of a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image, and obtain the trained second hand key point detection model; the pre-trained first hand key point detection model is a teacher model, and the trained second hand key point detection model is a student model.

In an exemplary embodiment, the first training unit is further configured to perform inputting the real hand sample image into a second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of hand keypoints in the real hand sample image; determining a loss value of the second hand key point detection model to be trained according to a difference value between the three-dimensional position prediction information and the three-dimensional position marking information; and adjusting the model parameters of the second hand key point detection model to be trained according to the loss value to obtain the trained second hand key point detection model.

In an exemplary embodiment, the first training unit is further configured to perform a segmentation process on the real hand sample image, resulting in a plurality of key area images in the real hand sample image; the key area image comprises a hand key point; inputting the multiple key area images in the real hand sample image into a second hand key point detection model to be trained, and performing convolution processing and full-connection processing on the multiple key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in the real hand sample image.

In an exemplary embodiment, the apparatus further includes a second training unit configured to perform inputting the real hand sample image and the virtual hand sample image into a first hand key point detection model to be trained, respectively, to obtain two-dimensional position prediction information of hand key points in the real hand sample image and three-dimensional position prediction information of hand key points in the virtual hand sample image; obtaining a first sub-loss value according to a difference value between two-dimensional position prediction information of a hand key point in the real hand sample image and the two-dimensional position marking information, obtaining a second sub-loss value according to a difference value between three-dimensional position prediction information of the hand key point in the virtual hand sample image and the three-dimensional position marking information, and obtaining a total loss value of the first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value; and adjusting the model parameters of the first hand key point detection model to be trained according to the total loss value to obtain the pre-trained first hand key point detection model.

In an exemplary embodiment, the apparatus further includes a sample set acquisition unit configured to perform acquiring a plurality of different hand models and background images corresponding to the hand models; performing three-dimensional rendering processing on each hand model on a background image corresponding to each hand model respectively to obtain a virtual hand sample image corresponding to each hand model and three-dimensional position marking information of a hand key point in each virtual hand sample image; the virtual hand sample image corresponding to each hand model comprises a background image of the corresponding hand model; and constructing the virtual hand sample set according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images.

In an exemplary embodiment, the apparatus further includes a model updating unit, further configured to perform acquiring a real hand test image, and input the real hand test image into the trained second hand keypoint detection model, so as to obtain three-dimensional position prediction information of the hand keypoints in the real hand test image; the real hand test image carries three-dimensional position marking information of corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image, wherein the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position marking information is larger than a preset difference value; according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image, retraining the trained second hand key point detection model to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a device for detecting a hand keypoint, comprising:

a hand image acquisition unit configured to perform acquisition of a hand image;

a position information determining unit configured to perform inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key point in the hand image; the trained second hand keypoint detection model is obtained by training according to the training method of the hand keypoint detection model described in any embodiment of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a method of training a hand keypoint detection model as described in any embodiment of the first aspect and a method of detecting hand keypoints as described in any embodiment of the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium comprising: the instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of training a hand keypoint detection model as described in any of the embodiments of the first aspect and the method of detecting hand keypoints as described in any of the embodiments of the second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, the program product comprising a computer program stored in a readable storage medium, from which the at least one processor of the device reads and executes the computer program, so that the device performs the method of training a hand keypoint detection model as described in any of the first aspect embodiments, and the method of detecting hand keypoints as described in any of the second aspect embodiments.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

training to obtain a pre-trained first hand key point detection model through a real hand sample set comprising a real hand sample image and two-dimensional position marking information of hand key points in the real hand sample image and a virtual hand sample set comprising a virtual hand sample image and three-dimensional position marking information of the hand key points in the virtual hand sample image; then, three-dimensional position marking information of the hand key points in the real hand sample image is obtained through a pre-trained first hand key point detection model; finally, training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the pre-trained first hand key point detection model is a teacher model, and the trained second hand key point detection model is a student model; therefore, the aim of training the second hand key point detection model according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image is fulfilled, because the pre-trained first hand key point detection model is obtained by training the real hand sample image and the virtual hand sample set, the three-dimensional position labeling information of the hand key points in the real hand sample image output by the pre-trained first hand key point detection model can approach to the real hand data, the model precision of the second hand key point detection model obtained by training the three-dimensional position labeling information of the hand key points in the real hand sample image and the real hand sample image is higher, and the detection accuracy of the hand key points by the trained second hand key point detection model is improved, the defect that the detection accuracy of the hand key points of the hand key point detection model obtained by virtual hand data training is low is overcome.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is an application environment diagram illustrating a method for training a hand keypoint detection model according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of training a hand keypoint detection model, according to an example embodiment.

FIG. 3a is a schematic diagram illustrating an RGB image containing a real human hand according to an example embodiment.

FIG. 3b is a schematic diagram illustrating two-dimensional position information for a joint point of a human hand according to an exemplary embodiment.

FIG. 3c is a schematic diagram illustrating an RGB image containing a virtual human hand, according to an example embodiment.

FIG. 4 is a flowchart illustrating steps for training a second hand keypoint detection model to be trained in accordance with an exemplary embodiment.

FIG. 5 is a flow diagram illustrating another method of training a hand keypoint detection model, according to an example embodiment.

FIG. 6 is a flow diagram illustrating yet another method of training a hand keypoint detection model, according to an example embodiment.

FIG. 7 is a block diagram illustrating a training apparatus for a hand keypoint detection model, according to an example embodiment.

FIG. 8 is a block diagram illustrating an apparatus for detecting a hand keypoint, according to an exemplary embodiment.

Fig. 9 is an internal block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The training method of the hand key point detection model provided by the disclosure can be applied to the application environment shown in fig. 1. Referring to fig. 1, the application environment diagram includes a terminal 110, where the terminal 110 is an electronic device with a hand key point detection function, and the electronic device may be a smart phone, a tablet computer, a notebook computer, a personal computer, or the like. In fig. 1, a terminal 110 is an example of a smart phone, and referring to fig. 1, the terminal 110 obtains three-dimensional position labeling information of a hand key point in a real hand sample image; the three-dimensional position marking information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position marking information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position marking information of the hand key points in the virtual hand sample image; training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained in advance is a student model. The trained second hand key point detection model can output three-dimensional position prediction information of the hand key points in the real hand image or the Virtual hand image, gesture recognition, finger orientation prediction, finger joint rotation angle prediction and the like can be performed by utilizing the three-dimensional position prediction information of the hand key points, and the method has an important role in mixed Reality scenes such as AR (Augmented Reality), VR (Virtual Reality) and the like.

Fig. 2 is a flowchart illustrating a method for training a hand key point detection model according to an exemplary embodiment, where as shown in fig. 2, the method for training a hand key point detection model is used in the terminal shown in fig. 1, and includes the following steps:

in step S210, three-dimensional position labeling information of a hand key point in a real hand sample image is obtained; the three-dimensional position marking information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of the hand key points in the virtual hand sample image.

The real hand sample image is an RGB image including a real hand, and as shown in fig. 3a, the real hand sample image can be obtained by shooting with a camera, such as a monocular camera; the hand key points refer to joint points of a human hand, and each joint point in the human hand has corresponding two-dimensional position information in an image, as shown in fig. 3 b; the three-dimensional position labeling information of the hand key points refers to that depth information, namely the distance between each hand key point and a visual reference point (such as a position point where an image pickup lens is located), is additionally added on the basis of two-dimensional position information including the hand key points.

The real hand sample set comprises a plurality of real hand sample images and two-dimensional position labeling information of hand key points in each real hand sample image; the two-dimensional position labeling information of the hand key points in the real hand sample image is obtained by labeling, such as manual labeling.

The virtual hand sample set comprises a plurality of virtual hand sample images and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image is an RGB image containing a virtual hand, and may be generated by a three-dimensional rendering technique, as shown in fig. 3 c; the virtual hand sample image is generated by a three-dimensional rendering technique, and the three-dimensional position labeling information of the hand key points in the virtual hand sample image is generated.

The pre-trained first hand key point detection model is a teacher model, specifically is a large neural network model with a complex model structure and a large calculation amount, such as a convolutional neural network, a deep neural network and the like with a complex model structure and a large calculation amount, and the disclosure is not limited specifically; in an actual scene, the pre-trained first hand key point detection model can be a large neural network model with large calculated quantity and parameter quantity, and can also be a large neural network model with large network capacity; in addition, the pre-trained first hand key point detection model has high model precision and fitting capability, can accurately output the three-dimensional position information of the hand key points in the real hand sample image, but is long in time consumption and high in calculation capability requirement, and is not suitable for being mounted in a mobile terminal with weak processing capability. In addition, the neural network large model of the first hand key point detection model has better migration capability and can solve the problem of domain adaptation between the real hand sample set and the virtual hand sample set, so that even if the real hand sample set does not have the three-dimensional position marking information of the hand key points, the neural network large model of the first hand key point detection model can achieve better three-dimensional position information prediction effect in the real hand image by the aid of the three-dimensional position marking information of the hand key points in the virtual hand sample set.

It should be noted that, the application scenario of the present disclosure is in a mobile terminal, and the mobile terminal has limited processing capability and high real-time requirement, and if a large neural network model, such as a pre-trained first hand key point detection model, is installed in the mobile terminal, it is difficult to achieve a real-time effect of predicting three-dimensional position information of hand key points.

Specifically, the terminal acquires a real hand sample image, which may be a hand image obtained by real-time shooting or a hand image obtained from a network or a local database; and then inputting the real hand sample image into a pre-trained first hand key point detection model, and performing image detection processing, such as convolution processing and full-connection processing, on the real hand sample image through the pre-trained first hand key point detection model to obtain three-dimensional position information of the hand key points in the real hand sample image, wherein the three-dimensional position information is used as three-dimensional position marking information of the hand key points in the real hand sample image. Therefore, the three-dimensional position marking information of the hand key points in the real hand sample image is output through the pre-trained first hand key point detection model, and the three-dimensional position information of the hand key points is not required to be marked manually, so that the marking process of the three-dimensional position information of the hand key points in the real hand sample image is simplified, the obtaining efficiency of the three-dimensional position marking information of the hand key points in the real hand sample image is further improved, and the problem that the three-dimensional position information of the real hand key points is difficult to mark is solved; and the defect that errors are easy to occur in the manual marking of the three-dimensional position information of the hand key points is avoided, and the accuracy of the obtained three-dimensional position marking information of the hand key points is further improved.

Further, the pre-trained first hand keypoint detection model is trained by: generating RGB pictures containing virtual hands and three-dimensional coordinate information of corresponding hand key points in the RGB pictures containing the virtual hands by using a three-dimensional rendering technology, and correspondingly using the RGB pictures as virtual hand sample images and three-dimensional position marking information of the hand key points in the virtual hand sample images; and then, constructing a virtual hand sample set according to the virtual hand sample image and the three-dimensional position marking information of the hand feature points in the virtual hand sample image. Then, acquiring an RGB picture containing a real human hand and acquired by a monocular camera and two-dimensional coordinate information (obtained by artificial marking) of a hand key point in the RGB picture containing the real human hand, and correspondingly taking the two-dimensional coordinate information as a real hand sample image and two-dimensional position marking information of the hand key point in the real hand sample image; and constructing a real hand sample set according to the real hand sample image and the two-dimensional position marking information of the hand key points in the real hand sample image. Then, training a first hand key point detection model to be trained according to the real hand sample set and the virtual hand sample set until the loss value of the trained first hand key point detection model is smaller than a first preset threshold value, and taking the trained first hand key point detection model as a pre-trained first hand key point detection model; through the pre-trained first hand key point detection model, three-dimensional position information of hand key points in a real hand sample image can be output.

It should be noted that three-dimensional position information of the hand key points in the virtual hand sample image may be output by the first hand key point detection model trained in advance.

It should be noted that, the real hand sample set and the virtual hand sample set are used as a training sample set of the first hand keypoint detection model together, and there are two main reasons: (1) although the three-dimensional rendering technology can be used for directly generating virtual data comprising the RGB picture of the virtual human hand and the three-dimensional coordinate information of the corresponding hand key point in the RGB picture of the virtual human hand, due to the limitation of the three-dimensional rendering technology, the background and the hand type in the generated picture are limited, and the effect of the currently generated virtual data has a certain difference with real data and cannot reach the level of falseness and falseness; (2) although the acquisition scene of the RGB image containing the real hand is not limited and can contain rich background and hand type information, the three-dimensional coordinate information of the real hand key points is very difficult to label, and only the two-dimensional coordinate information of the real hand key points can be obtained. Therefore, the real hand sample set and the virtual hand sample set are used for training the first hand key point detection model together, so that the trained first hand key point detection model can accurately predict the three-dimensional position information of the hand key points.

In step S220, training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image, to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained in advance is a student model.

The second hand key point detection model is a student model, specifically a small neural network model with a simple model structure and a small calculation amount, such as a convolutional neural network with a simple model structure and a small calculation amount, a deep neural network and the like, and the disclosure is not limited specifically; in an actual scene, the second hand key point detection model may refer to a small neural network model with small calculated amount and parameter amount, or may refer to a small neural network model with small network capacity; in addition, the second hand key point detection model is low in fitting capacity and low in calculation capacity requirement, and is suitable for being operated in a mobile terminal, such as a smart phone, so that the effect of detecting the three-dimensional position information of the hand key points in real time is achieved.

In the technical field of knowledge distillation, a large neural network model with a complex model structure and a large calculation amount, such as a first hand key point detection model, is called a teacher model; a second hand key point detection model, namely a small neural network model with a simple structure and small calculated amount, is called a student model; in the scheme, three-dimensional position labeling information of hand key points in a real hand sample image output by a teacher model, namely a first hand key point detection model with a complex model structure and large calculation amount, is utilized to train a student model, namely a second hand key point detection model with a simple model structure and small calculation amount, and the field adaptation capacity of the first hand key point detection model between a real hand sample set and a virtual hand sample set can be indirectly transferred to the second hand key point detection model, so that the three-dimensional position information prediction effect of the second hand key point detection model on the hand key points is close to the three-dimensional position information prediction effect of the first hand key point detection model on the hand key points.

Specifically, the terminal trains a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image until the loss value of the trained second hand key point detection model is smaller than a second preset threshold value, and then the trained second hand key point detection model is used as the trained second hand key point detection model. Therefore, the second hand key point detection model to be trained is trained for multiple times according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image, so that the accuracy of the three-dimensional position information of the hand key points output by the trained second hand key point detection model is improved, and the defect that the detection accuracy of the hand key points of the hand key point detection model obtained by virtual hand data training is low is overcome.

Furthermore, the trained second hand key point detection model is mounted in the mobile terminal, so that the three-dimensional position information of the hand key points in the real hand image or the virtual hand image can be predicted in real time, and special hardware equipment does not need to be additionally connected, so that the use threshold of the three-dimensional position information prediction technology of the hand key points is reduced.

In the training method of the hand key point detection model, a pre-trained first hand key point detection model is trained through a real hand sample set comprising a real hand sample image and two-dimensional position marking information of hand key points in the real hand sample image, and a virtual hand sample set comprising a virtual hand sample image and three-dimensional position marking information of the hand key points in the virtual hand sample image; then, three-dimensional position marking information of the hand key points in the real hand sample image is obtained through a pre-trained first hand key point detection model; finally, training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the pre-trained first hand key point detection model is a teacher model, and the trained second hand key point detection model is a student model; the method realizes the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image, and because the pre-trained first hand key point detection model is obtained by training the real hand sample image and the virtual hand sample set, the three-dimensional position marking information of the hand key points in the real hand sample image output by the pre-trained first hand key point detection model can approach to the real hand data, so that the model precision of the second hand key point detection model obtained by training the three-dimensional position marking information of the hand key points in the real hand sample image and the real hand sample image is higher, thereby improving the detection accuracy of the second hand key point detection model obtained by training on the hand key points, the defect that the detection accuracy of the hand key points of the hand key point detection model obtained by virtual hand data training is low is overcome.

In an exemplary embodiment, as shown in fig. 4, in step S220, a second hand keypoint detection model to be trained is trained according to a real hand sample image and three-dimensional position labeling information of hand keypoints in the real hand sample image, so as to obtain a trained second hand keypoint detection model, which may specifically be implemented by the following steps:

in step S410, the real hand sample image is input into the second hand keypoint detection model to be trained, and three-dimensional position prediction information of the hand keypoints in the real hand sample image is obtained.

Specifically, the terminal inputs a real hand sample image into a second hand key point detection model to be trained, extracts image features of the real hand sample image through the second hand key point detection model to be trained, and performs neural network processing (such as convolution processing, pooling processing, full-link processing and the like) on the image features of the real hand sample image to obtain three-dimensional position prediction information of the hand key points in the real hand sample image.

In step S420, a loss value of the second hand keypoint detection model to be trained is determined according to a difference between the three-dimensional position prediction information and the three-dimensional position labeling information.

And the loss value of the second hand key point detection model is used for measuring the detection accuracy of the second hand key point detection model on the hand key points.

Specifically, the server obtains a difference value between three-dimensional position prediction information of a hand key point in a real hand sample image and corresponding three-dimensional position marking information, and calculates to obtain a loss value of a second hand key point detection model to be trained by combining a loss function (such as a minimum mean square error loss function).

In step S430, according to the loss value, the model parameters of the second hand keypoint detection model to be trained are adjusted to obtain the trained second hand keypoint detection model.

Specifically, the server determines a model parameter updating gradient of a second hand key point detection model to be trained according to a loss value of the second hand key point detection model to be trained; updating the model parameters of the second hand key point detection model to be trained in a back propagation mode according to the model parameter updating gradient to obtain the second hand key point detection model with updated model parameters; and training the second hand key point detection model with the updated model parameters again until the loss value of the trained second hand key point detection model is smaller than a second preset threshold value, and taking the trained second hand key point detection model as the trained second hand key point detection model.

For example, the terminal uses the second hand key point detection model with the updated model parameters as the second hand key point detection model to be trained, and repeats steps S410 to S430 to continuously adjust the model parameters of the second hand key point detection model until the loss value obtained according to the trained second hand key point detection model is smaller than the second preset threshold value, and then uses the trained second hand key point detection model as the trained second hand key point detection model.

It should be noted that, if the real hand sample set and the virtual hand sample set for training the first hand key point detection model are directly utilized to train the second hand key point detection model, since the virtual data in the virtual hand sample set cannot reach a level of falseness, and the second hand key point detection model (a small neural network model) has fewer parameters to be optimized, and the generalization capability is not as good as that of a large neural network model (such as the first hand key point detection model), the problem of domain adaptation between the virtual hand sample set and the real hand sample set cannot be solved, so that when the second hand key point detection model is directly trained by combining the real hand sample set and the virtual hand sample set, the prediction effect of the second hand key point detection model on the three-dimensional position information of the hand key points is very poor; therefore, the second hand key point detection model is trained by using the three-dimensional position labeling information of the hand key points in the real hand sample image identified by the real hand sample image and the pre-trained first hand key point detection model, and the capability of adapting the domain of the first hand key point detection model can be indirectly transferred to the second hand key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is realized, the defect that the detection accuracy of the hand key point detection model obtained by training only through virtual hand data is low on the hand key points is avoided, and the detection accuracy of the second hand key point detection model obtained by training on the hand key points is improved; meanwhile, the three-dimensional position information of the hand key points in the real hand sample image output by the pre-trained first hand key point detection model is used as the three-dimensional position marking information of the hand key points in the real hand sample image and is not required to be obtained through manual marking, so that the defects that the process of manually marking the three-dimensional position information of the hand key points is complicated and errors are easy to occur are overcome, the accuracy of the obtained three-dimensional position marking information of the hand key points in the real hand sample image is improved, and the model precision of a second hand key point detection model obtained through training based on the three-dimensional position marking information of the hand key points in the real hand sample image is further improved.

In an exemplary embodiment, in step S410, inputting the real hand sample image into the second hand keypoint detection model to be trained, and obtaining three-dimensional position prediction information of the hand keypoints in the real hand sample image, includes: carrying out segmentation processing on the real hand sample image to obtain a plurality of key area images in the real hand sample image; the key area image comprises a hand key point; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively performing convolution processing and full-connection processing on a plurality of candidate area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in the real hand sample image.

The key area image is an area image containing a hand key point, and the image size of the key area image is smaller than that of the whole real hand sample image.

Specifically, the terminal inputs a real hand sample image into a pre-trained hand image segmentation model, and the real hand sample image is segmented through the pre-trained hand image segmentation model to obtain a plurality of key area images in the real hand sample image; then, the terminal respectively inputs a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, the plurality of candidate area images are respectively subjected to convolution processing and full connection processing through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in each candidate area image, and the three-dimensional position prediction information of the hand key points in the real hand sample image is obtained according to the three-dimensional position prediction information of the hand key points in each candidate area image.

The pre-trained hand image segmentation model is obtained by identifying a training sample according to an acquired image, training based on a preset neural network (such as a convolutional neural network), and is used for performing segmentation processing on an input hand image to obtain a plurality of key area images containing hand key points corresponding to the hand image; the image recognition training sample comprises an input hand image and a plurality of key area image labeling information corresponding to the hand image.

According to the technical scheme provided by the embodiment of the disclosure, a real hand sample image is firstly segmented to obtain a plurality of key area images in the real hand sample image, and then the plurality of key area images in the real hand sample image are input into a second hand key point detection model to be trained, so that the second hand key point detection model only needs to detect a plurality of key area images containing hand key points in the real hand sample image, and does not need to detect invalid or redundant images in the whole hand area image, thereby simplifying the image detection range of the hand key points and further improving the detection efficiency of the hand key points; meanwhile, the interference of invalid or redundant images in the whole hand region image is reduced, and the detection accuracy of the hand key points is improved.

In an exemplary embodiment, the training method of the hand keypoint detection model provided by the present disclosure further includes a training step of the first hand keypoint detection model, which specifically includes the following steps: respectively inputting the real hand sample image and the virtual hand sample image into a first hand key point detection model to be trained to obtain two-dimensional position prediction information of hand key points in the real hand sample image and three-dimensional position prediction information of the hand key points in the virtual hand sample image; obtaining a first sub-loss value according to a difference value between two-dimensional position prediction information and two-dimensional position marking information of a hand key point in a real hand sample image, obtaining a second sub-loss value according to a difference value between three-dimensional position prediction information and three-dimensional position marking information of the hand key point in a virtual hand sample image, and obtaining a total loss value of a first hand key point identification model to be trained according to the first sub-loss value and the second sub-loss value; and adjusting the model parameters of the first hand key point detection model to be trained according to the total loss value to obtain the pre-trained first hand key point detection model.

The first sub-loss value is used for measuring the prediction accuracy of the first hand key point detection model on the two-dimensional position information of the hand key points in the real hand sample image; the second sub-loss value is used for measuring the prediction accuracy of the first hand key point detection model on the three-dimensional position information of the hand key points in the virtual hand sample image; the total loss value is used for measuring the detection accuracy of the first hand key point detection model on the hand key points.

Specifically, the terminal inputs a real hand sample image into a first hand key point detection model to be trained, extracts image features of the real hand sample image through the first hand key point detection model to be trained, and performs neural network processing (such as convolution processing, pooling processing, full-connection processing and the like) on the image features of the real hand sample image to obtain two-dimensional position prediction information of hand key points in the real hand sample image; similarly, inputting the virtual hand sample image into a first hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in the virtual hand sample image; obtaining a difference value between two-dimensional position prediction information and two-dimensional position marking information of a hand key point in a real hand sample image, and obtaining a first sub-loss value by combining a loss function (for example, combining a minimum mean square error loss function); obtaining a difference value between three-dimensional position prediction information and three-dimensional position marking information of a hand key point in a virtual hand sample image, and obtaining a second sub-loss value by combining a loss function (for example, combining a minimum mean square error loss function); and adding the first sub-loss value and the second sub-loss value, or acquiring the weighted sum of the first sub-loss value and the second sub-loss value to obtain the total loss value of the first hand key point detection model to be trained.

Then, the terminal determines a model parameter updating gradient of a first hand key point detection model to be trained according to the total loss value; updating the gradient according to the model parameters, and adjusting the model parameters of the first hand key point detection model to be trained in a back propagation mode to obtain the first hand key point detection model after the model parameters are adjusted; and (3) retraining the first hand key point detection model after model parameter adjustment, for example, using the first hand key point detection model after model parameter adjustment as the first hand key point detection model to be trained, repeatedly executing the training process to continuously adjust the model parameters of the first hand key point detection model until the loss value of the first hand key point detection model after training is smaller than a first preset threshold value, and using the first hand key point detection model after training as the first hand key point detection model pre-trained.

According to the technical scheme provided by the embodiment of the disclosure, the first hand key point detection model is trained for multiple times, so that the three-dimensional position information of the hand key points in the real hand sample image can be output according to the trained first hand key point detection model, and the three-dimensional position information of the hand key points is not required to be manually marked, so that the marking process of the three-dimensional position information of the hand key points in the real hand sample image is simplified, the acquisition efficiency of the three-dimensional position information of the hand key points in the real hand sample image is further improved, and the problem that the three-dimensional position marking information of the real hand key points is difficult to acquire is solved; meanwhile, the second hand key point detection model is trained by utilizing the three-dimensional position information of the hand key points in the real hand sample image output by the first hand key point detection model, so that the capability of adapting the first hand key point detection model domain is favorably and indirectly transferred to the second hand key point detection model, the second hand key point detection model obtained by training can achieve a better three-dimensional position information prediction effect of the hand key points in the real hand sample image under the condition that only the three-dimensional position marking information of the hand key points in the virtual hand sample image and the two-dimensional position marking information of the hand key points in the real hand sample image exist, and the real-time three-dimensional position information prediction of the hand key points can be realized in the mobile terminal.

In an exemplary embodiment, the training method of the hand keypoint detection model provided by the present disclosure further includes a step of constructing a virtual hand sample set, which specifically includes the following steps: acquiring a plurality of different hand models and background images corresponding to the hand models; respectively performing three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain a virtual hand sample image corresponding to each hand model and three-dimensional position marking information of a hand key point in each virtual hand sample image; the virtual hand sample image corresponding to each hand model comprises a background image of the corresponding hand model; and constructing a virtual hand sample set according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images.

The hand model refers to a three-dimensional model of a hand, and specifically refers to a hand model drawn by a designer.

For example, the terminal acquires a plurality of different hand models drawn by a designer, and configures a background image for each hand model to obtain a background image corresponding to each hand model; performing three-dimensional rendering processing on each hand model on the background image corresponding to each hand model respectively by using a three-dimensional rendering processing technology to obtain a virtual hand sample image corresponding to each hand model; the method comprises the steps that when virtual hand sample images corresponding to all hand models are generated through a three-dimensional rendering technology, three-dimensional position marking information of hand key points in the virtual hand sample images corresponding to all hand models can be generated; and constructing a virtual hand sample set according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images.

According to the technical scheme provided by the embodiment of the disclosure, virtual hand sample images corresponding to different hand models and three-dimensional position marking information of hand key points in the virtual hand sample images are generated on background images corresponding to different hand models through a three-dimensional rendering technology; the virtual hand sample set is constructed according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images, so that the types of the virtual hand sample images in the virtual hand sample set can be enriched; meanwhile, the method is beneficial to training the first hand key point detection model to be trained subsequently according to the virtual hand sample set and the real hand sample set, so that the trained first hand key point detection model can accurately predict the three-dimensional position information of the hand key point.

In an exemplary embodiment, the training method of the hand keypoint detection model provided by the present disclosure further includes an updating step of the second hand keypoint detection model, which specifically includes the following steps: acquiring a real hand test image, inputting the real hand test image into a trained second hand key point detection model, and obtaining three-dimensional position prediction information of hand key points in the real hand test image; the real hand test image carries three-dimensional position marking information of corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image, wherein the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position marking information is larger than a preset difference value; according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image, retraining the trained second hand key point detection model to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into a retrained second hand key point detection model.

The real hand test image is a re-collected real hand image (not the real hand sample image) and is used for testing the trained second hand key point detection model; the real hand test image carries three-dimensional position marking information of corresponding hand key points, and the three-dimensional position marking information can be obtained through prediction of a pre-trained first hand key point detection model.

The difference value between the three-dimensional position prediction information of the hand key point of the target hand test image and the corresponding three-dimensional position marking information is larger than a preset difference value, which indicates that the hand key point of the target hand test image is detected wrongly by the second hand key point detection model obtained through training.

Specifically, the terminal collects a real hand test image, inputs the real hand test image into a trained second hand key point detection model, and obtains three-dimensional position prediction information of the hand key points in the real hand test image; comparing the three-dimensional position prediction information of the hand key points in the real hand test image with the corresponding three-dimensional position marking information to obtain a comparison result; according to the comparison result, screening out a real hand test image which contains a difference value between the three-dimensional position prediction information of the hand key point and the corresponding three-dimensional position marking information and is larger than a preset difference value from the collected real hand test image as a target hand test image; and according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image, retraining the trained second hand key point detection model again until the difference value between the three-dimensional position prediction information of the hand key points contained in the target hand test image output by the retrained second hand key point detection model and the corresponding three-dimensional position marking information is less than or equal to a preset difference value, and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the trained second hand key point detection model is used for detecting the wrong target hand test image, and the trained second hand key point detection model is retrained again, so that the detection accuracy of the second hand key point detection model on the hand key points is improved; meanwhile, the generalization capability of the second hand key point detection model is improved, the defect that the prediction accuracy of the three-dimensional position information of the hand key points is low due to overfitting of the second hand key point detection model is avoided, and the detection accuracy of the second hand key point detection model on the hand key points is further improved.

Fig. 5 is a flowchart illustrating another training method for a hand key point detection model according to an exemplary embodiment, where as shown in fig. 5, the training method for a hand key point detection model is used in the terminal shown in fig. 1, and includes the following steps:

in step S501, a real hand sample set and a virtual hand sample set are acquired, the real hand sample set includes a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set includes a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image.

In step S502, the real hand sample image and the virtual hand sample image are respectively input into the first hand keypoint detection model to be trained, so as to obtain two-dimensional position prediction information of the hand keypoints in the real hand sample image and three-dimensional position prediction information of the hand keypoints in the virtual hand sample image.

In step S503, a first sub-loss value is obtained according to a difference between the two-dimensional position prediction information and the two-dimensional position labeling information of the hand key point in the real hand sample image, and a second sub-loss value is obtained according to a difference between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand key point in the virtual hand sample image.

In step S504, a total loss value of the first hand keypoint detection model to be trained is obtained according to the first sub loss value and the second sub loss value.

In step S505, the model parameters of the first hand keypoint detection model to be trained are adjusted according to the total loss value, so as to obtain a pre-trained first hand keypoint detection model.

In step S506, the real hand sample image is input into the first hand key point detection model trained in advance, and the three-dimensional position labeling information of the hand key points in the real hand sample image is obtained.

In step S507, the real hand sample image is input into the second hand keypoint detection model to be trained, and three-dimensional position prediction information of the hand keypoints in the real hand sample image is obtained.

In step S508, a loss value of the second hand keypoint detection model to be trained is determined according to a difference between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand keypoints in the real hand sample image.

In step S509, the model parameters of the second hand key point detection model to be trained are adjusted according to the loss value, so as to obtain the trained second hand key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is realized, the defect that the detection accuracy of the hand key point detection model obtained by virtual hand data training is low on the hand key points is avoided, and therefore the detection accuracy of the second hand key point detection model obtained by training on the hand key points is improved.

In an exemplary embodiment, as shown in fig. 6, the present disclosure also provides an application scenario applying the above-mentioned training method of the hand keypoint detection model. Specifically, the application of the training method of the hand key point detection model in the application scene is as follows:

the terminal generates an RGB picture containing a virtual human hand and three-dimensional coordinate information of a corresponding human hand key point by using a three-dimensional rendering technology, and the three-dimensional coordinate information is used as virtual data; meanwhile, marking two-dimensional coordinate information of the hand key points in the RGB pictures containing the real hands, which are acquired by using the monocular camera, and taking the RGB pictures containing the real hands and the two-dimensional coordinate information of the corresponding hand key points as real data; then, combining the real data and the virtual data together to serve as training data of a large neural network model (such as a first hand key point detection model) with a complex structure and a large calculation amount, and training the large neural network model according to the training data to obtain a trained large neural network model; predicting training data containing a real human hand by utilizing a trained neural network large model through forward propagation, and taking three-dimensional coordinate information of a hand key point obtained by predicting the neural network large model as three-dimensional coordinate marking information of the real human hand key point; then, inputting an image containing a real hand as a training sample into a pre-established small neural network model (such as a second hand key point detection model), and outputting three-dimensional coordinate information of a hand key point obtained by predicting the large neural network model as a target of the small neural network model so as to perform model training on the small neural network model; and constructing a minimum mean square error loss function by utilizing the forward propagation prediction result of the small neural network model and the prediction result of the large neural network model, and continuously updating the network parameters of the small neural network model in a backward propagation mode to enable the effect of the small neural network model to be close to that of the large neural network model.

According to the technical scheme, the problem of domain adaptation between virtual data and real data is solved by using the strong migration capacity of the neural network large model, and the problem that the key points of the real human hand are difficult to acquire in three-dimensional labeling mode is solved by carrying out three-dimensional labeling on the real hand data through the trained neural network large model. And then, training a small neural network model by using the marked real hand data, and indirectly transmitting the domain adaptation capability to the small neural network model, so that under the condition that only virtual three-dimensional marking data and real two-dimensional marking data exist, the small neural network model can also achieve a good three-dimensional coordinate prediction effect based on the real data, and the effect of predicting the three-dimensional coordinate information of the key points of the human hand in real time can be realized at the mobile terminal.

In an exemplary embodiment, the present disclosure further provides a method for detecting a hand key point, which specifically includes the following steps: collecting hand images; inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the hand image; the trained second hand keypoint detection model is obtained by training according to the training method of the hand keypoint detection model of any embodiment of the disclosure.

The hand image may be a real hand image, such as an RGB image including a human hand, or a virtual hand image, such as an RGB image including a virtual human hand.

For example, the terminal obtains an RGB picture including a human hand captured by a monocular camera (e.g., a mobile phone camera), and inputs the RGB picture including the human hand into a trained second hand key point detection model, so as to obtain three-dimensional position prediction information of a hand key point in the RGB picture including the human hand. It should be noted that the above technical solution can also be applied to a server, and is not described in detail herein.

It should be noted that, through the trained second hand keypoint detection model, three-dimensional position prediction information of the hand keypoints in the real hand image may be output, and three-dimensional position prediction information of the hand keypoints in the virtual hand image may also be output, which is not limited in the present disclosure.

Further, the terminal can connect all the hand key points according to the three-dimensional position prediction information of the hand key points and the hand skeleton structure to obtain a hand key point connection diagram; and inquiring the corresponding relation between the hand key point connection diagram and the gesture information to obtain the gesture information corresponding to the hand key point connection diagram as the gesture information corresponding to the human hand in the hand image.

Of course, the terminal can also determine the finger orientation in the hand image according to the three-dimensional position prediction information of the hand key point; in addition, in an AR/VR application scene, the terminal can also calculate the rotation angle of each hand key point according to the three-dimensional position prediction information of the hand key point, and control the corresponding hand model to move according to the rotation angle of each hand key point, for example, what actions the hand does, the hand model simulates to do corresponding actions.

According to the method for detecting the hand key points, external special hardware equipment such as a multi-view camera system or a depth camera system is not needed, and the three-dimensional position information of the hand key points in the hand image can be accurately predicted in real time only by using the hand image acquired by a monocular camera, so that the prediction threshold of the three-dimensional position information of the hand key points is reduced; meanwhile, the three-dimensional position information of the hand key points in the hand image can be accurately predicted through the hand key point detection model obtained through real data training, the defect that the detection accuracy of the hand key points is low through the hand key point detection model obtained through virtual hand data training is overcome, and the detection accuracy of the hand key points is further improved.

In an exemplary embodiment, inputting the hand image into the trained second hand keypoint detection model to obtain three-dimensional position prediction information of the hand keypoints in the hand image includes: carrying out image enhancement processing on the hand image to obtain the hand image after the image enhancement processing; and inputting the hand image subjected to the image enhancement processing into a trained second hand key point detection model, and performing convolution processing and full-connection processing on the hand image subjected to the image enhancement processing through the trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the hand image.

For example, the terminal performs image enhancement processing on the hand image, such as image denoising processing, image resolution enhancement processing, image contrast enhancement processing, and the like, to obtain the hand image after the image enhancement processing; inputting the hand image subjected to the image enhancement processing into a trained second hand key point detection model, extracting image features in the hand image subjected to the image enhancement processing through the trained second hand key point detection model, and performing neural network processing (such as convolution processing, pooling processing, full-link processing and the like) on the image features in the hand image subjected to the image enhancement processing to obtain three-dimensional position prediction information of the hand key points in the hand image.

According to the technical scheme provided by the embodiment of the disclosure, the image enhancement processing is performed on the hand image, and then the hand image after the image enhancement processing is input into the trained second hand key point detection model, so that the accuracy of the three-dimensional position prediction information of the hand key points in the hand image output by the trained second hand key point detection model is favorably improved.

It should be understood that although the steps in the flowcharts of fig. 2, 4 and 5 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, and 5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

FIG. 7 is a block diagram illustrating a training apparatus for a hand feature point detection model in accordance with an exemplary embodiment. Referring to fig. 7, the apparatus includes an information acquisition unit 710 and a first training unit 720.

An information obtaining unit 710 configured to perform obtaining three-dimensional position labeling information of a hand key point in a real hand sample image; the three-dimensional position marking information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of the hand key points in the virtual hand sample image.

A first training unit 720, configured to perform training on a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image, so as to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained in advance is a student model.

In an exemplary embodiment, the first training unit 720 is further configured to perform inputting the real hand sample image into the second hand keypoint detection model to be trained, and obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image; determining a loss value of a second hand key point detection model to be trained according to a difference value between the three-dimensional position prediction information and the three-dimensional position marking information; and adjusting the model parameters of the second hand key point detection model to be trained according to the loss value to obtain the trained second hand key point detection model.

In an exemplary embodiment, the first training unit 720 is further configured to perform a segmentation process on the real hand sample image, resulting in a plurality of key area images in the real hand sample image; the key area image comprises a hand key point; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and performing convolution processing and full connection processing on the plurality of key area images through the second hand key point detection model to be trained respectively to obtain three-dimensional position prediction information of the hand key points in the real hand sample image.

In an exemplary embodiment, the training device for the hand key point detection model further includes a second training unit configured to perform inputting the real hand sample image and the virtual hand sample image into the first hand key point detection model to be trained, respectively, and obtain two-dimensional position prediction information of the hand key points in the real hand sample image and three-dimensional position prediction information of the hand key points in the virtual hand sample image; obtaining a first sub-loss value according to a difference value between two-dimensional position prediction information and two-dimensional position marking information of a hand key point in a real hand sample image, obtaining a second sub-loss value according to a difference value between three-dimensional position prediction information and three-dimensional position marking information of the hand key point in a virtual hand sample image, and obtaining a total loss value of a first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value; and adjusting the model parameters of the first hand key point detection model to be trained according to the total loss value to obtain the pre-trained first hand key point detection model.

In an exemplary embodiment, the training device for the hand keypoint detection model further includes a sample set acquisition unit configured to perform acquiring a plurality of different hand models and background images corresponding to the respective hand models; respectively performing three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain a virtual hand sample image corresponding to each hand model and three-dimensional position marking information of a hand key point in each virtual hand sample image; the virtual hand sample image corresponding to each hand model comprises a background image of the corresponding hand model; and constructing a virtual hand sample set according to the virtual hand sample images and the three-dimensional position marking information of the hand key points in the virtual hand sample images.

In an exemplary embodiment, the training device for the hand key point detection model further includes a model updating unit configured to perform acquiring a real hand test image, and input the real hand test image into the trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the real hand test image; the real hand test image carries three-dimensional position marking information of corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image, wherein the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position marking information is larger than a preset difference value; according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image, retraining the trained second hand key point detection model to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into a retrained second hand key point detection model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 8 is a block diagram illustrating an apparatus for detecting a hand keypoint, according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a hand image acquisition unit 810 and a position information determination unit 820.

A hand image acquisition unit 810 configured to perform acquiring a hand image.

A position information determination unit 820 configured to perform input of a hand image into a trained second hand key point detection model, resulting in three-dimensional position prediction information of a hand key point in the hand image; the trained second hand keypoint detection model is obtained by training according to the training method of the hand keypoint detection model in any embodiment of the disclosure.

Fig. 9 is a block diagram of an electronic device 900 for performing the above-described training method of the hand keypoint scout model or the detection method of the hand keypoints, according to an exemplary embodiment. For example, the electronic device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.

Referring to fig. 9, electronic device 900 may include one or more of the following components: processing component 1002, memory 1004, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the electronic device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the electronic device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 900.

The multimedia components 908 include a screen that provides an output interface between the electronic device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals. I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status evaluations of various aspects of the electronic device 900. For example, sensor assembly 914 may detect an open/closed state of electronic device 900, the relative positioning of components, such as a display and keypad of electronic device 900, sensor assembly 914 may also detect a change in the position of electronic device 900 or a component of electronic device 900, the presence or absence of user contact with electronic device 900, orientation or acceleration/deceleration of electronic device 900, and a change in the temperature of electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate wired or wireless communication between the electronic device 900 and other devices. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described object acquisition methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the electronic device 900 to perform the above method of training a hand keypoint detection model or method of detecting hand keypoints is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product, which includes a computer program stored in a readable storage medium, from which at least one processor of an electronic device reads and executes the computer program, so that the electronic device performs the training method of the hand keypoint detection model or the detection method of the hand keypoint as described in any embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a hand key point detection model is characterized by comprising the following steps:

2. The method for training a hand key point detection model according to claim 1, wherein the training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain the trained second hand key point detection model comprises:

3. The method for training a hand key point detection model according to claim 2, wherein the step of inputting the real hand sample image into a second hand key point detection model to be trained to obtain three-dimensional position prediction information of a hand key point in the real hand sample image comprises:

4. A method for training a hand keypoint detection model according to claim 1, wherein said pre-trained first hand keypoint detection model is trained by:

5. A method for training a hand keypoint detection model according to any one of claims 1 to 4, further comprising:

6. A method for detecting key points of a hand, comprising:

collecting hand images;

inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the hand image; the trained second hand key point detection model is trained according to the training method of the hand key point detection model according to any one of claims 1 to 5.

7. The utility model provides a training device of hand key point detection model which characterized in that includes:

8. A hand key point detection device, comprising:

a position information determining unit configured to perform inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key point in the hand image; the trained second hand key point detection model is trained according to the training method of the hand key point detection model according to any one of claims 1 to 5.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.

10. A storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-6.