CN112115894B

CN112115894B - Training method and device of hand key point detection model and electronic equipment

Info

Publication number: CN112115894B
Application number: CN202011016465.8A
Authority: CN
Inventors: 董亚娇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2023-08-25
Anticipated expiration: 2040-09-24
Also published as: CN112115894A

Abstract

The disclosure relates to a training method, a training device, electronic equipment and a storage medium of a hand key point detection model, wherein the method comprises the following steps: acquiring three-dimensional position labeling information of gesture feature points in a real gesture sample image; the three-dimensional position labeling information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model is a teacher model, and the second hand key point detection model is a student model. By adopting the method, the detection accuracy of the hand key points is improved through the second hand key point detection model.

Description

Training method and device of hand key point detection model and electronic equipment

Technical Field

The disclosure relates to the technical field of hand key point recognition, in particular to a training method and device of a hand key point detection model, electronic equipment and a storage medium.

Background

In the hand keypoint detection technology, three-dimensional position information of hand keypoints in a real hand image is generally identified by a hand keypoint detection model trained in a mobile terminal, which is an important component in interaction with mixed Reality scenes such as AR (Augmented Reality), VR (Virtual Reality) and the like.

In the related art, since the three-dimensional position information of the real hand keypoints is very difficult to be marked, the training method of the hand keypoint detection model generally trains the hand keypoint detection model through a virtual hand sample image generated by a computer and the three-dimensional position information of the hand keypoints in the virtual hand sample image, and carries the trained hand keypoint detection model in a mobile terminal so as to identify the hand keypoints of the photographed hand image; however, virtual hand data such as a virtual hand sample image generated by a computer and three-dimensional position information of hand key points in the virtual hand sample image have a certain gap from real hand data, and cannot reach a spurious level, so that the detection accuracy of the hand key point detection model obtained by training is low.

Disclosure of Invention

The disclosure provides a training method, a device, electronic equipment and a storage medium for a hand key point detection model, so as to at least solve the problem that the hand key point detection model obtained through training in the related art is low in detection accuracy of the hand key point. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a training method of a hand keypoint detection model, including:

acquiring three-dimensional position labeling information of hand key points in a real hand sample image; the three-dimensional position labeling information is obtained by identifying the real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises the real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image;

training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained in advance is a student model.

In an exemplary embodiment, the training the second hand keypoint detection model to be trained according to the real gesture sample image and the three-dimensional position labeling information of the hand keypoints in the real hand sample image to obtain a trained second hand keypoint detection model, including:

inputting the real hand sample image into a second hand key point detection model to be trained, and obtaining three-dimensional position prediction information of hand key points in the real hand sample image;

determining a loss value of the second hand key point detection model to be trained according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information;

and according to the loss value, adjusting the model parameters of the second hand key point detection model to be trained to obtain the trained second hand key point detection model.

In an exemplary embodiment, the inputting the real hand sample image into the second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image includes:

dividing the real hand sample image to obtain a plurality of key area images in the real hand sample image; the key area image comprises a hand key point;

Inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively carrying out convolution processing and full connection processing on the plurality of key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

In an exemplary embodiment, the pre-trained first hand keypoint detection model is trained by:

respectively inputting the real hand sample image and the virtual hand sample image into a first hand key point detection model to be trained to obtain two-dimensional position prediction information of hand key points in the real hand sample image and three-dimensional position prediction information of hand key points in the virtual hand sample image;

obtaining a first sub-loss value according to the difference value between the two-dimensional position prediction information of the hand key points in the real hand sample image and the two-dimensional position labeling information, obtaining a second sub-loss value according to the difference value between the three-dimensional position prediction information of the hand key points in the virtual hand sample image and the three-dimensional position labeling information, and obtaining a total loss value of the first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value;

And according to the total loss value, adjusting the model parameters of the first hand key point detection model to be trained to obtain the first hand key point detection model trained in advance.

In an exemplary embodiment, the virtual hand sample set is obtained by:

acquiring a plurality of different hand models and background images corresponding to the hand models;

respectively carrying out three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain virtual hand sample images corresponding to each hand model and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image corresponding to each hand model contains a background image of the corresponding hand model;

and constructing the virtual hand sample set according to the virtual hand sample images and the three-dimensional position labeling information of the hand key points in the virtual hand sample images.

In an exemplary embodiment, the method further comprises:

acquiring a real hand test image, inputting the real hand test image into the trained second hand key point detection model, and obtaining three-dimensional position prediction information of hand key points in the real hand test image; the real hand test image carries three-dimensional position labeling information of corresponding hand key points;

Screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value;

retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image to obtain a retrained second hand key point detection model;

and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for detecting a hand key point, including:

collecting hand images;

inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of hand key points in the hand image; the trained second hand keypoint detection model is obtained by training the hand keypoint detection model according to the training method of any one of the embodiments of the first aspect.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a hand keypoint detection model, including:

an information acquisition unit configured to perform acquisition of three-dimensional position annotation information of hand key points in a real hand sample image; the three-dimensional position labeling information is obtained by identifying the real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises the real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image;

the first training unit is configured to perform training on a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image, so as to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained in advance is a student model.

In an exemplary embodiment, the first training unit is further configured to perform inputting the real hand sample image into a second hand keypoint detection model to be trained, and obtain three-dimensional position prediction information of hand keypoints in the real hand sample image; determining a loss value of the second hand key point detection model to be trained according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information; and according to the loss value, adjusting the model parameters of the second hand key point detection model to be trained to obtain the trained second hand key point detection model.

In an exemplary embodiment, the first training unit is further configured to perform segmentation processing on the real hand sample image, so as to obtain a plurality of key area images in the real hand sample image; the key area image comprises a hand key point; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively carrying out convolution processing and full connection processing on the plurality of key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

In an exemplary embodiment, the apparatus further includes a second training unit configured to perform inputting the real hand sample image and the virtual hand sample image into a first hand keypoint detection model to be trained, respectively, to obtain two-dimensional position prediction information of hand keypoints in the real hand sample image and three-dimensional position prediction information of hand keypoints in the virtual hand sample image; obtaining a first sub-loss value according to the difference value between the two-dimensional position prediction information of the hand key points in the real hand sample image and the two-dimensional position labeling information, obtaining a second sub-loss value according to the difference value between the three-dimensional position prediction information of the hand key points in the virtual hand sample image and the three-dimensional position labeling information, and obtaining a total loss value of the first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value; and according to the total loss value, adjusting the model parameters of the first hand key point detection model to be trained to obtain the first hand key point detection model trained in advance.

In an exemplary embodiment, the apparatus further comprises a sample set acquisition unit configured to perform acquiring a plurality of different hand models and corresponding background images of each of the hand models; respectively carrying out three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain virtual hand sample images corresponding to each hand model and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image corresponding to each hand model contains a background image of the corresponding hand model; and constructing the virtual hand sample set according to the virtual hand sample images and the three-dimensional position labeling information of the hand key points in the virtual hand sample images.

In an exemplary embodiment, the apparatus further includes a model updating unit further configured to perform acquisition of a real hand test image, input the real hand test image into the trained second hand keypoint detection model, and obtain three-dimensional position prediction information of hand keypoints in the real hand test image; the real hand test image carries three-dimensional position labeling information of corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value; retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to a fourth aspect of embodiments of the present disclosure, there is provided a device for detecting a hand key point, including:

a hand image acquisition unit configured to perform acquisition of hand images;

a position information determining unit configured to perform inputting the hand image into a trained second hand keypoint detection model, and obtain three-dimensional position prediction information of hand keypoints in the hand image; the trained second hand keypoint detection model is obtained by training the hand keypoint detection model according to the training method of any one of the embodiments of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the training method of the hand keypoint detection model as described in any of the embodiments of the first aspect and the detection method of the hand keypoints as described in any of the embodiments of the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium comprising: the instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of training the hand keypoint detection model described in any one of the embodiments of the first aspect, and the method of detecting hand keypoints described in any one of the embodiments of the second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the training method of the hand keypoint detection model described in any one of the embodiments of the first aspect, and the detection method of the hand keypoints described in any one of the embodiments of the second aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

training to obtain a pre-trained first hand key point detection model through a real hand sample set comprising a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image and a virtual hand sample set comprising a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image; then obtaining three-dimensional position labeling information of hand key points in a real hand sample image through a pre-trained first hand key point detection model; finally, training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained is a student model; therefore, the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is achieved, and because the pre-trained first hand key point detection model is obtained through training of the real hand sample set and the virtual hand sample set, the three-dimensional position marking information of the hand key points in the real hand sample image output through the pre-trained first hand key point detection model can approach to real hand data, so that the model precision of the second hand key point detection model obtained through training based on the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is higher, and the defect that the hand key point detection accuracy of the second hand key point detection model obtained through training is lower is overcome.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an application environment diagram illustrating a training method of a hand keypoint detection model according to an exemplary embodiment.

FIG. 2 is a flow chart illustrating a method of training a hand keypoint detection model, according to an exemplary embodiment.

Fig. 3a is a schematic diagram showing an RGB image containing a real human hand, according to an example embodiment.

FIG. 3b is a schematic diagram illustrating two-dimensional positional information of a human node of interest, according to an example embodiment.

Fig. 3c is a schematic diagram illustrating an RGB image containing a virtual human hand, according to an example embodiment.

FIG. 4 is a flowchart illustrating steps for training a second hand keypoint detection model to be trained, according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating another method of training a hand keypoint detection model, according to an exemplary embodiment.

FIG. 6 is a flowchart illustrating a method of training a further hand keypoint detection model, according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating a training apparatus for a hand keypoint detection model, according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a hand keypoint detection device, according to an exemplary embodiment.

Fig. 9 is an internal structural diagram of an electronic device, which is shown according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The training method of the hand key point detection model provided by the disclosure can be applied to an application environment as shown in fig. 1. Referring to fig. 1, the application environment diagram includes a terminal 110, and the terminal 110 is an electronic device with a hand key point detection function, and the electronic device may be a smart phone, a tablet computer, a notebook computer, a personal computer, or the like. In fig. 1, taking an example that the terminal 110 is a smart phone as an example, referring to fig. 1, the terminal 110 obtains three-dimensional position labeling information of a hand key point in a real hand sample image; the three-dimensional position labeling information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image; training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained is a student model. The trained second hand key point detection model can output three-dimensional position prediction information of hand key points in a real hand image or a Virtual hand image, and gesture recognition, finger orientation prediction, finger joint rotation angle prediction and the like can be performed by using the three-dimensional position prediction information of the hand key points, so that the method has important roles in mixed Reality scenes such as AR (Augmented Reality ), VR (Virtual Reality) and the like.

Fig. 2 is a flowchart illustrating a training method of a hand keypoint detection model according to an exemplary embodiment, and as shown in fig. 2, the training method of the hand keypoint detection model is used in the terminal shown in fig. 1, and includes the following steps:

in step S210, three-dimensional position labeling information of hand key points in a real hand sample image is obtained; the three-dimensional position labeling information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image.

The real hand sample image refers to an RGB image containing a real hand, as shown in fig. 3a, and may be obtained by shooting with a camera, such as a monocular camera; the hand key points refer to joint points of a human hand, and each joint point in the human hand has corresponding two-dimensional position information in an image, as shown in fig. 3 b; the three-dimensional position labeling information of the hand key points refers to adding depth information, namely the distance between each hand key point and a visual reference point (such as the position point of an imaging lens) based on two-dimensional position information containing the hand key points.

The real hand sample set comprises a plurality of real hand sample images and two-dimensional position labeling information of hand key points in each real hand sample image; the two-dimensional position labeling information of the hand key points in the real hand sample image is obtained through labeling, for example, manual labeling.

The virtual hand sample set comprises a plurality of virtual hand sample images and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image refers to an RGB image containing a virtual human hand, and can be generated by a three-dimensional rendering technique, as shown in fig. 3 c; the virtual hand sample image is generated by a three-dimensional rendering technique, and three-dimensional position labeling information of hand key points in the virtual hand sample image is generated.

The first hand key point detection model trained in advance refers to a teacher model, specifically refers to a large neural network model with a complex model structure and large calculation amount, such as a convolutional neural network, a deep neural network and the like with a complex model structure and large calculation amount, and the disclosure is not limited in particular; in an actual scene, the pre-trained first hand key point detection model can be a neural network large model with larger calculation amount and parameter amount, or can be a neural network large model with larger network capacity; in addition, the model precision and fitting capacity of the pre-trained first hand key point detection model are high, three-dimensional position information of hand key points in a real hand sample image can be accurately output, but the time for obtaining the three-dimensional position information is long, the operation capacity requirement is high, and the method is not suitable for being mounted in a mobile terminal with weak processing capacity. In addition, because the neural network large model of the first hand key point detection model has good migration capability, the domain adaptation problem between the real hand sample set and the virtual hand sample set can be solved, and even if the three-dimensional position labeling information of the hand key points in the real hand sample set is not available, the neural network large model of the first hand key point detection model can achieve good three-dimensional position information prediction effect in the real hand image through the assistance of the three-dimensional position labeling information of the hand key points in the virtual hand sample set.

It should be noted that, the application scenario of the present disclosure is in a mobile terminal, but the processing capability of the mobile terminal is limited, and the real-time requirement is high, and if a neural network large model, i.e. a pre-trained first hand key point detection model, is mounted in the mobile terminal, it is difficult to achieve the real-time three-dimensional position information prediction effect of the hand key points.

Specifically, the terminal acquires a real hand sample image, which may be a hand image obtained by real-time shooting, or may be a hand image obtained from a network or a local database; and inputting the real hand sample image into a pre-trained first hand key point detection model, and performing image detection processing, such as convolution processing and full connection processing, on the real hand sample image through the pre-trained first hand key point detection model to obtain three-dimensional position information of the hand key points in the real hand sample image, wherein the three-dimensional position information is used as three-dimensional position labeling information of the hand key points in the real hand sample image. In this way, the three-dimensional position labeling information of the hand key points in the real hand sample image is output through the pre-trained first hand key point detection model, and the three-dimensional position information of the hand key points does not need to be labeled manually, so that the labeling flow of the three-dimensional position information of the hand key points in the real hand sample image is simplified, the acquisition efficiency of the three-dimensional position labeling information of the hand key points in the real hand sample image is further improved, and the problem that the three-dimensional position information labeling of the real hand key points is very difficult is solved; the defect that errors are easy to occur in the three-dimensional position information of the hand key points through manual marking is avoided, and the accuracy of the obtained three-dimensional position marking information of the hand key points is further improved.

Further, the pre-trained first hand keypoint detection model is trained by: generating an RGB picture containing a virtual hand and three-dimensional coordinate information of corresponding hand key points in the RGB picture containing the virtual hand by utilizing a three-dimensional rendering technology, wherein the three-dimensional coordinate information corresponds to the virtual hand sample image and the three-dimensional position labeling information of the hand key points in the virtual hand sample image; then, a virtual hand sample set is constructed according to the virtual hand sample image and the three-dimensional position labeling information of the hand feature points in the virtual hand sample image. Then, acquiring RGB pictures containing the hands of the real person and two-dimensional coordinate information (obtained by manual labeling) of hand key points in the RGB pictures containing the hands of the real person, which are acquired by a monocular camera, and correspondingly serving as a real hand sample image and two-dimensional position labeling information of the hand key points in the real hand sample image; and constructing a real hand sample set according to the real hand sample image and the two-dimensional position labeling information of the hand key points in the real hand sample image. Then training a first hand key point detection model to be trained according to the real hand sample set and the virtual hand sample set until the loss value of the trained first hand key point detection model is smaller than a first preset threshold value, and taking the trained first hand key point detection model as a pre-trained first hand key point detection model; the three-dimensional position information of the hand key points in the real hand sample image can be output through a pre-trained first hand key point detection model.

The three-dimensional position information of the hand keypoints in the virtual hand sample image may be output by the first hand keypoint detection model trained in advance.

It should be noted that, the real hand sample set and the virtual hand sample set are used together as the training sample set of the first hand keypoint detection model, mainly for two reasons: (1) Although the three-dimensional rendering technology can be utilized to directly generate virtual data comprising RGB pictures of virtual hands and three-dimensional coordinate information of corresponding hand key points in the RGB pictures of the virtual hands, the background and the hand shape in the generated pictures are limited due to the limitation of the three-dimensional rendering technology, and the effect of the virtual data generated at present is different from that of the real data to some extent, so that the level of false and spurious results cannot be achieved; (2) Although the acquisition scene containing the RGB picture of the real hand is not limited and can contain rich background and hand information, the three-dimensional coordinate information of the key points of the real hand is very difficult to mark, and only the two-dimensional coordinate information of the key points of the real hand can be obtained. Therefore, the real hand sample set and the virtual hand sample set are used for training the first hand key point detection model together, so that the trained first hand key point detection model can accurately predict three-dimensional position information of hand key points.

In step S220, training the second hand keypoint detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand keypoints in the real hand sample image to obtain a trained second hand keypoint detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained is a student model.

The second hand key point detection model refers to a student model, specifically refers to a neural network small model with a simple model structure and small calculation amount, such as a convolutional neural network with a simple model structure and small calculation amount, a deep neural network and the like, and the specific disclosure is not limited; in an actual scene, the second hand key point detection model can be a neural network small model with smaller calculated amount and parameter amount, and can also be a neural network small model with smaller network capacity; in addition, the fitting capability of the second hand key point detection model is low, the requirement on the computing capability is low, and the method is suitable for being operated in a mobile terminal, such as a smart phone, so that the effect of detecting three-dimensional position information of the hand key points in real time is achieved.

In the technical field of knowledge distillation, a large neural network model with a complex model structure and large calculation amount, namely a first hand key point detection model, is called a teacher model; the neural network small model with simple structure and small calculation amount, namely the second hand key point detection model, is called a student model; according to the technical scheme, the three-dimensional position marking information of the hand key points in the real hand sample image output by the teacher model, namely the first hand key point detection model with a complex model structure and large calculation amount, is utilized to train the student model, namely the second hand key point detection model with a simple model structure and small calculation amount, and the domain adaptation capability of the first hand key point detection model to the real hand sample set and the virtual hand sample set can be indirectly migrated to the second hand key point detection model, so that the three-dimensional position information prediction effect of the hand key points by the second hand key point detection model is close to the three-dimensional position information prediction effect of the hand key points by the first hand key point detection model.

Specifically, the terminal trains a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image until the loss value of the trained second hand key point detection model is smaller than a second preset threshold value, and the trained second hand key point detection model is used as a second hand key point detection model obtained through training. In this way, according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image, the second hand key point detection model to be trained is trained for multiple times, so that the accuracy of the three-dimensional position information of the hand key points output by the trained second hand key point detection model is improved, and the defect that the detection accuracy of the hand key points by the hand key point detection model obtained by training only the virtual hand data is lower is overcome.

Furthermore, the trained second hand key point detection model is carried in the mobile terminal, so that real-time prediction of three-dimensional position information of hand key points in a real hand image or a virtual hand image can be realized, special hardware equipment is not required to be additionally connected, and the use threshold of a three-dimensional position information prediction technology of the hand key points is reduced.

In the training method of the hand key point detection model, training is carried out through a real hand sample set comprising a real hand sample image and two-dimensional position marking information of hand key points in the real hand sample image and a virtual hand sample set comprising a virtual hand sample image and three-dimensional position marking information of hand key points in the virtual hand sample image to obtain a pre-trained first hand key point detection model; then obtaining three-dimensional position labeling information of hand key points in a real hand sample image through a pre-trained first hand key point detection model; finally, training a second hand key point detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand key points in the real hand sample image to obtain a trained second hand key point detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained is a student model; the method and the device realize the aim of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image, and because the pre-trained first hand key point detection model is obtained by training the real hand sample set and the virtual hand sample set, the three-dimensional position marking information of the hand key points in the real hand sample image output by the pre-trained first hand key point detection model can approach to real hand data, so that the model accuracy of the second hand key point detection model obtained by training based on the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is higher, and the defect that the hand key point detection accuracy of the second hand key point detection model obtained by training is lower is overcome.

In an exemplary embodiment, as shown in fig. 4, in step S220, according to the real hand sample image and the three-dimensional position labeling information of the hand keypoints in the real hand sample image, training the second hand keypoint detection model to be trained to obtain a trained second hand keypoint detection model, which may be specifically implemented by the following steps:

in step S410, the real hand sample image is input into the second hand keypoint detection model to be trained, so as to obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image.

Specifically, the terminal inputs the real hand sample image into a second hand key point detection model to be trained, extracts image features of the real hand sample image through the second hand key point detection model to be trained, and performs neural network processing (such as convolution processing, pooling processing, full connection processing and the like) on the image features of the real hand sample image to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

In step S420, a loss value of the second hand key point detection model to be trained is determined according to the difference between the three-dimensional position prediction information and the three-dimensional position labeling information.

The loss value of the second hand key point detection model is used for measuring the detection accuracy of the second hand key point detection model to the hand key points.

Specifically, the server acquires the difference value between the three-dimensional position prediction information of the hand key points in the real hand sample image and the corresponding three-dimensional position labeling information, and calculates the loss value of the second hand key point detection model to be trained by combining the loss function (such as a minimum mean square error loss function).

In step S430, according to the loss value, the model parameters of the second hand-portion key point detection model to be trained are adjusted, so as to obtain the trained second hand-portion key point detection model.

Specifically, the server determines a model parameter update gradient of the second hand key point detection model to be trained according to a loss value of the second hand key point detection model to be trained; updating the model parameters of the second hand key point detection model to be trained in a counter propagation mode according to the model parameter updating gradient to obtain a second hand key point detection model with updated model parameters; and retraining the second hand key point detection model with updated model parameters until the loss value of the trained second hand key point detection model is smaller than a second preset threshold value, and taking the trained second hand key point detection model as a trained second hand key point detection model.

For example, the terminal takes the second hand-part key point detection model with updated model parameters as a second hand-part key point detection model to be trained, and repeatedly executes steps S410 to S430 to continuously adjust the model parameters of the second hand-part key point detection model until a loss value obtained according to the trained second hand-part key point detection model is smaller than a second preset threshold value, and takes the trained second hand-part key point detection model as a trained second hand-part key point detection model.

It should be noted that, if the second hand keypoint detection model is directly trained by using the real hand sample set and the virtual hand sample set for training the first hand keypoint detection model, because virtual data in the virtual hand sample set cannot reach a spurious level, and parameters to be optimized of the second hand keypoint detection model (small neural network model) are relatively small, the generalization capability is not good for a large neural network model (such as the first hand keypoint detection model), and the domain adaptation problem between the virtual hand sample set and the real hand sample set cannot be solved, so when the second hand keypoint detection model is jointly trained by directly using the real hand sample set and the virtual hand sample set, the prediction effect of the second hand keypoint detection model on the three-dimensional position information of the hand keypoints is very poor; therefore, the three-dimensional position marking information of the hand key points in the real hand sample image identified by the real hand sample image and the pre-trained first hand key point detection model is utilized to train the second hand key point detection model, and the adaptive capacity of the first hand key point detection model domain can be indirectly transferred to the second hand key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is achieved, and the defect that the hand key point detection model obtained by training only the virtual hand data is low in detection accuracy of the hand key points is avoided, so that the detection accuracy of the hand key points of the second hand key point detection model obtained by training is improved; meanwhile, the three-dimensional position information of the hand key points in the real hand sample image output by the pre-trained first hand key point detection model is used as the three-dimensional position marking information of the hand key points in the real hand sample image, the three-dimensional position marking information is not needed to be obtained through manual marking, the defect that the process of manually marking the three-dimensional position information of the hand key points is complex and error occurs easily is overcome, the accuracy of the obtained three-dimensional position marking information of the hand key points in the real hand sample image is improved, and the model accuracy of the second hand key point detection model obtained through training based on the three-dimensional position marking information of the hand key points in the real hand sample image is further improved.

In an exemplary embodiment, in step S410, inputting the real hand sample image into the second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image, including: dividing the real hand sample image to obtain a plurality of key area images in the real hand sample image; the key region image contains hand key points; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively carrying out convolution processing and full connection processing on the plurality of candidate area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

The key area image refers to an area image containing key points of the hand, and the image size of the key area image is smaller than that of the whole real hand sample image.

Specifically, the terminal inputs a real hand sample image into a pre-trained hand image segmentation model, and segments the real hand sample image through the pre-trained hand image segmentation model to obtain a plurality of key area images in the real hand sample image; and then, respectively inputting a plurality of key region images in the real hand sample image into a second hand key point detection model to be trained by the terminal, respectively carrying out convolution processing and full connection processing on the plurality of candidate region images by the second hand key point detection model to be trained to obtain three-dimensional position prediction information of the hand key points in each candidate region image, and obtaining the three-dimensional position prediction information of the hand key points in the real hand sample image according to the three-dimensional position prediction information of the hand key points in each candidate region image.

The hand image segmentation model is obtained by identifying training samples according to collected images and training based on a preset neural network (such as a convolutional neural network), and is used for carrying out segmentation processing on an input hand image to obtain a plurality of key area images corresponding to the hand image and containing hand key points; the image recognition training sample comprises input hand images and a plurality of key area image annotation information corresponding to the hand images.

According to the technical scheme provided by the embodiment of the disclosure, the real hand sample image is segmented to obtain a plurality of key area images in the real hand sample image, and then the plurality of key area images in the real hand sample image are input into the second hand key point detection model to be trained, so that the second hand key point detection model only needs to detect a plurality of key area images containing hand key points in the real hand sample image, and invalid or redundant images in the whole hand area image are not needed to be detected, thereby simplifying the image detection range of the hand key points and further improving the detection efficiency of the hand key points; meanwhile, the interference of invalid or redundant images in the whole hand area image is reduced, and the detection accuracy of the hand key points is improved.

In an exemplary embodiment, the training method of the hand keypoint detection model provided in the present disclosure further includes a training step of the first hand keypoint detection model, which specifically includes the following contents: respectively inputting a real hand sample image and a virtual hand sample image into a first hand key point detection model to be trained to obtain two-dimensional position prediction information of hand key points in the real hand sample image and three-dimensional position prediction information of hand key points in the virtual hand sample image; obtaining a first sub-loss value according to the difference value between the two-dimensional position prediction information and the two-dimensional position labeling information of the hand key points in the real hand sample image, obtaining a second sub-loss value according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand key points in the virtual hand sample image, and obtaining a total loss value of the first hand key point identification model to be trained according to the first sub-loss value and the second sub-loss value; and according to the total loss value, adjusting the model parameters of the first hand key point detection model to be trained to obtain a pre-trained first hand key point detection model.

The first sub-loss value is used for measuring the prediction accuracy of the first hand key point detection model on the two-dimensional position information of the hand key points in the real hand sample image; the second sub-loss value is used for measuring the prediction accuracy of the first hand key point detection model on the three-dimensional position information of the hand key points in the virtual hand sample image; the total loss value is used for measuring the detection accuracy of the first hand key point detection model to the hand key points.

Specifically, the terminal inputs a real hand sample image into a first hand key point detection model to be trained, extracts image features of the real hand sample image through the first hand key point detection model to be trained, and performs neural network processing (such as convolution processing, pooling processing, full connection processing and the like) on the image features of the real hand sample image to obtain two-dimensional position prediction information of hand key points in the real hand sample image; similarly, inputting the virtual hand sample image into a first hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the virtual hand sample image; acquiring a difference value between two-dimensional position prediction information and two-dimensional position labeling information of a hand key point in a real hand sample image, and combining a loss function (such as combining a minimum mean square error loss function) to obtain a first sub-loss value; obtaining a difference value between three-dimensional position prediction information and three-dimensional position labeling information of hand key points in the virtual hand sample image, and obtaining a second sub-loss value by combining a loss function (such as combining a minimum mean square error loss function); and adding the first sub-loss value and the second sub-loss value, or obtaining a weighted sum of the first sub-loss value and the second sub-loss value to obtain the total loss value of the first hand key point detection model to be trained.

Then, the terminal determines a model parameter updating gradient of a first hand key point detection model to be trained according to the total loss value; according to the model parameter updating gradient, adjusting the model parameter of the first hand key point detection model to be trained in a back propagation mode to obtain a first hand key point detection model with the adjusted model parameter; and training the first hand key point detection model with the model parameters adjusted again, for example, taking the first hand key point detection model with the model parameters adjusted as a first hand key point detection model to be trained, repeatedly executing the training process to continuously adjust the model parameters of the first hand key point detection model until the loss value of the first hand key point detection model after training is smaller than a first preset threshold value, and taking the first hand key point detection model after training as a first hand key point detection model trained in advance.

According to the technical scheme provided by the embodiment of the disclosure, the first hand key point detection model is trained for multiple times, so that three-dimensional position information of hand key points in a real hand sample image can be conveniently output according to the first hand key point detection model obtained by training, and the three-dimensional position information of the hand key points does not need to be marked manually, so that the marking flow of the three-dimensional position information of the hand key points in the real hand sample image is simplified, the acquisition efficiency of the three-dimensional position information of the hand key points in the real hand sample image is further improved, and the problem that the three-dimensional position marking information of the real hand key points is difficult to acquire is solved; meanwhile, the three-dimensional position information of the hand key points in the real hand sample image output by the first hand key point detection model is utilized to train the second hand key point detection model, so that the capability of domain adaptation of the first hand key point detection model is indirectly transmitted to the second hand key point detection model, and the second hand key point detection model obtained through training can achieve a better three-dimensional position information prediction effect of the hand key points in the real hand image and can realize real-time three-dimensional position information prediction of the hand key points in the mobile terminal under the condition that only the three-dimensional position labeling information of the hand key points in the virtual hand sample image and the two-dimensional position labeling information of the hand key points in the real hand sample image are adopted.

In an exemplary embodiment, the training method of the hand keypoint detection model provided by the present disclosure further includes a step of constructing a virtual hand sample set, which specifically includes the following contents: acquiring a plurality of different hand models and background images corresponding to the hand models; respectively carrying out three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain virtual hand sample images corresponding to each hand model and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image corresponding to each hand model contains a background image of the corresponding hand model; and constructing a virtual hand sample set according to each virtual hand sample image and the three-dimensional position labeling information of the hand key points in each virtual hand sample image.

The hand model refers to a three-dimensional model of a hand, and specifically refers to a hand model drawn by a designer.

For example, the terminal acquires a plurality of different hand models drawn by a designer, and configures a background image for each hand model respectively to obtain background images corresponding to the hand models; performing three-dimensional rendering processing on each hand model on the background image corresponding to each hand model by using a three-dimensional rendering processing technology, so as to obtain a virtual hand sample image corresponding to each hand model; generating virtual hand sample images corresponding to all hand models through a three-dimensional rendering technology, and simultaneously generating three-dimensional position labeling information of hand key points in the virtual hand sample images corresponding to all hand models; and constructing a virtual hand sample set according to each virtual hand sample image and the three-dimensional position labeling information of the hand key points in each virtual hand sample image.

According to the technical scheme provided by the embodiment of the disclosure, through a three-dimensional rendering technology, virtual hand sample images corresponding to different hand models and three-dimensional position labeling information of hand key points in the virtual hand sample images are generated on background images corresponding to the different hand models; according to each virtual hand sample image and the three-dimensional position labeling information of the hand key points in each virtual hand sample image, a virtual hand sample set is constructed, and the types of the virtual hand sample images in the virtual hand sample set are enriched; meanwhile, the training of the first hand key point detection model to be trained according to the virtual hand sample set and the real hand sample set is facilitated, so that the trained first hand key point detection model can accurately predict three-dimensional position information of hand key points.

In an exemplary embodiment, the training method of the hand keypoint detection model provided by the present disclosure further includes a step of updating the second hand keypoint detection model, which specifically includes the following contents: acquiring a real hand test image, inputting the real hand test image into a trained second hand key point detection model, and obtaining three-dimensional position prediction information of hand key points in the real hand test image; the real hand test image carries three-dimensional position labeling information of the corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value; retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position labeling information of the hand key points in the target hand test image to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into a retrained second hand key point detection model.

The real hand test image is a re-collected real hand image (not the real hand sample image) and is used for testing the trained second hand key point detection model; the real hand test image carries the three-dimensional position labeling information of the corresponding hand key points, and can be obtained through prediction of a pre-trained first hand key point detection model.

The difference value between the three-dimensional position prediction information of the hand key points of the target hand test image and the corresponding three-dimensional position labeling information is larger than a preset difference value, which indicates that the second hand key point detection model obtained through training detects errors on the hand key points of the target hand test image.

Specifically, the terminal acquires a real hand test image, and inputs the real hand test image into a trained second hand key point detection model to obtain three-dimensional position prediction information of hand key points in the real hand test image; comparing the three-dimensional position prediction information of the hand key points in the real hand test image with the corresponding three-dimensional position labeling information to obtain a comparison result; according to the comparison result, screening out a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value from the acquired real hand test image, and taking the real hand test image as a target hand test image; and retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position labeling information of the hand key points in the target hand test image until the difference between the three-dimensional position prediction information of the hand key points contained in the target hand test image output by the retrained second hand key point detection model and the corresponding three-dimensional position labeling information is smaller than or equal to a preset difference value, and updating the trained second hand key point detection model into the retrained second hand key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the trained second hand key point detection model is used for retraining the trained target hand test image with the wrong detection of the second hand key point detection model, so that the accuracy of detecting the hand key points by the second hand key point detection model is improved; meanwhile, the generalization capability of the second hand key point detection model is improved, the defect that the prediction accuracy of the three-dimensional position information of the hand key points is low due to the fact that the second hand key point detection model is fitted is avoided, and the detection accuracy of the hand key points by the second hand key point detection model is further improved.

Fig. 5 is a flowchart illustrating another method for training a hand keypoint detection model according to an exemplary embodiment, and as shown in fig. 5, the method for training a hand keypoint detection model is used in the terminal shown in fig. 1, and includes the following steps:

in step S501, a real hand sample set and a virtual hand sample set are acquired, the real hand sample set including a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set including a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image.

In step S502, the real hand sample image and the virtual hand sample image are input into the first hand keypoint detection model to be trained, respectively, to obtain two-dimensional position prediction information of the hand keypoints in the real hand sample image and three-dimensional position prediction information of the hand keypoints in the virtual hand sample image.

In step S503, a first sub-loss value is obtained according to a difference between the two-dimensional position prediction information and the two-dimensional position labeling information of the hand key points in the real hand sample image, and a second sub-loss value is obtained according to a difference between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand key points in the virtual hand sample image.

In step S504, a total loss value of the first hand keypoint detection model to be trained is obtained according to the first sub-loss value and the second sub-loss value.

In step S505, according to the total loss value, the model parameters of the first hand keypoint detection model to be trained are adjusted to obtain a pre-trained first hand keypoint detection model.

In step S506, the real hand sample image is input into a pre-trained first hand keypoint detection model, and three-dimensional position labeling information of the hand keypoints in the real hand sample image is obtained.

In step S507, the real hand sample image is input into the second hand keypoint detection model to be trained, so as to obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image.

In step S508, a loss value of the second hand keypoint detection model to be trained is determined according to a difference between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand keypoints in the real hand sample image.

In step S509, according to the loss value, the model parameters of the second hand-portion key point detection model to be trained are adjusted to obtain a trained second hand-portion key point detection model.

According to the technical scheme provided by the embodiment of the disclosure, the purpose of training the second hand key point detection model according to the real hand sample image and the three-dimensional position marking information of the hand key points in the real hand sample image is achieved, and the defect that the detection accuracy of the hand key points of the hand key point detection model obtained by training only through virtual hand data is low is avoided, so that the detection accuracy of the hand key points of the second hand key point detection model obtained by training is improved.

In an exemplary embodiment, as shown in fig. 6, the present disclosure further provides an application scenario, where the application scenario applies the training method of the hand keypoint detection model described above. Specifically, the training method of the hand key point detection model is applied to the application scene as follows:

the terminal generates three-dimensional coordinate information containing RGB pictures of virtual hands and corresponding hand key points by using a three-dimensional rendering technology, and the three-dimensional coordinate information is used as virtual data; simultaneously marking two-dimensional coordinate information of hand key points in RGB pictures containing real hands acquired by using a monocular camera, and taking the RGB pictures containing the real hands and the corresponding two-dimensional coordinate information of the hand key points as real data; then, combining the real data and the virtual data together to be used as training data of a large neural network model (such as a first hand key point detection model) with complex structure and large calculation amount, and training the large neural network model according to the training data to obtain a trained large neural network model; predicting training data containing real human hands by forward propagation by using the trained neural network large model, and taking three-dimensional coordinate information of hand key points obtained by predicting the neural network large model as three-dimensional coordinate labeling information of the real human hand key points; then, inputting an image containing a real human hand as a training sample into a pre-established neural network small model (such as a second hand key point detection model), and outputting three-dimensional coordinate information of the human hand key points obtained by predicting a neural network large model as a target of the neural network small model so as to train the neural network small model; and constructing a minimum mean square error loss function by utilizing the result of the forward propagation prediction of the small neural network model and the prediction result of the large neural network model, and continuously updating the network parameters of the small neural network model in a back propagation mode, so that the effect of the small neural network model is close to that of the large neural network model.

According to the technical scheme, the problem of domain adaptation between virtual data and real data is solved by utilizing the strong migration capability of the large neural network model, three-dimensional labeling is performed on the real hand data through the trained large neural network model, and the problem of difficulty in obtaining three-dimensional labeling of key points of the real hands is solved. And then training a neural network small model by using the marked real hand data, and indirectly transmitting domain adaptation capability to the neural network small model, so that the neural network small model can achieve a good three-dimensional coordinate prediction effect based on the real data under the condition that only virtual three-dimensional marking data and real two-dimensional marking data are provided, and can realize the effect of predicting three-dimensional coordinate information of key points of a human hand in real time at the mobile terminal.

In an exemplary embodiment, the present disclosure further provides a method for detecting a hand key point, which specifically includes the following steps: collecting hand images; inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of the hand key points in the hand image; the trained second hand keypoint detection model is obtained by training according to the training method of the hand keypoint detection model according to any embodiment of the disclosure.

The hand image may be a real hand image, such as an RGB image including a human hand, or may be a virtual hand image, such as an RGB image including a virtual human hand.

For example, the terminal obtains an RGB picture including a human hand obtained by shooting with a monocular camera (such as a mobile phone camera), and inputs the RGB picture including the human hand into a trained second hand key point detection model to obtain three-dimensional position prediction information of hand key points in the RGB picture including the human hand. It should be noted that the above technical solution may also be applied to a server, and will not be described in detail herein.

The three-dimensional position prediction information of the hand key points in the real hand image may be output through the trained second hand key point detection model, or the three-dimensional position prediction information of the hand key points in the virtual hand image may be output, which is not limited in the disclosure.

Further, the terminal can also connect each hand key point according to the three-dimensional position prediction information of the hand key point and the hand skeleton structure to obtain a hand key point connection diagram; and inquiring the corresponding relation between the hand key point connection diagram and the gesture information to obtain the gesture information corresponding to the hand key point connection diagram, wherein the gesture information is used as the gesture information corresponding to the hands of the person in the hand image.

Of course, the terminal can also determine the finger orientation in the hand image according to the three-dimensional position prediction information of the hand key points; in addition, in the AR/VR application scenario, the terminal may also calculate a rotation angle of each hand key point according to the three-dimensional position prediction information of the hand key point, and control the corresponding hand model to move according to the rotation angle of each hand key point, for example, what action is performed by the hand, and then the hand model simulates to perform a corresponding action.

According to the method for detecting the hand key points, external special hardware equipment such as a multi-camera system or a depth camera system is not required, and three-dimensional position information of the hand key points in the hand images can be accurately predicted in real time only by using the hand images acquired by the monocular cameras, so that the prediction threshold of the three-dimensional position information of the hand key points is reduced; meanwhile, the hand key point detection model obtained through real data training can accurately predict the three-dimensional position information of the hand key points in the hand image, the defect that the detection accuracy of the hand key points is low in the hand key point detection model obtained through virtual hand data training is avoided, and the detection accuracy of the hand key points is further improved.

In an exemplary embodiment, inputting the hand image into the trained second hand keypoint detection model to obtain three-dimensional position prediction information of the hand keypoints in the hand image, including: performing image enhancement processing on the hand image to obtain the hand image after the image enhancement processing; inputting the hand image subjected to the image enhancement processing into a trained second hand key point detection model, and carrying out convolution processing and full connection processing on the hand image subjected to the image enhancement processing through the trained second hand key point detection model to obtain three-dimensional position prediction information of hand key points in the hand image.

For example, the terminal performs image enhancement processing on the hand image, such as image denoising processing, image resolution enhancement processing, image contrast enhancement processing, and the like, to obtain the hand image after the image enhancement processing; inputting the hand image after the image enhancement processing into a trained second hand key point detection model, extracting image features in the hand image after the image enhancement processing through the trained second hand key point detection model, and performing neural network processing (such as convolution processing, pooling processing, full connection processing and the like) on the image features in the hand image after the image enhancement processing to obtain three-dimensional position prediction information of hand key points in the hand image.

According to the technical scheme provided by the embodiment of the disclosure, the image enhancement processing is performed on the hand image, and then the hand image after the image enhancement processing is input into the trained second hand key point detection model, so that the accuracy of the three-dimensional position prediction information of the hand key points in the hand image output by the trained second hand key point detection model is further improved.

It should be understood that, although the steps in the flowcharts of fig. 2, 4, and 5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2, 4, 5 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

Fig. 7 is a block diagram of a training apparatus for a hand feature point detection model, according to an exemplary embodiment. Referring to fig. 7, the apparatus includes an information acquisition unit 710 and a first training unit 720.

An information acquisition unit 710 configured to perform acquisition of three-dimensional position annotation information of hand keypoints in a real hand sample image; the three-dimensional position labeling information is obtained by identifying a real hand sample image through a pre-trained first hand key point detection model, and the pre-trained first hand key point detection model is obtained by training a real hand sample set and a virtual hand sample set; the real hand sample set comprises a real hand sample image and two-dimensional position labeling information of hand key points in the real hand sample image, and the virtual hand sample set comprises a virtual hand sample image and three-dimensional position labeling information of hand key points in the virtual hand sample image.

The first training unit 720 is configured to perform training on the second hand keypoint detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand keypoints in the real hand sample image, so as to obtain a trained second hand keypoint detection model; the first hand key point detection model trained in advance is a teacher model, and the second hand key point detection model trained is a student model.

In an exemplary embodiment, the first training unit 720 is further configured to perform inputting the real hand sample image into the second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of the hand keypoints in the real hand sample image; determining a loss value of a second hand key point detection model to be trained according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information; and according to the loss value, adjusting the model parameters of the second hand key point detection model to be trained to obtain the trained second hand key point detection model.

In an exemplary embodiment, the first training unit 720 is further configured to perform a segmentation process on the real hand sample image to obtain a plurality of key area images in the real hand sample image; the key region image contains hand key points; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively carrying out convolution processing and full connection processing on the plurality of key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

In an exemplary embodiment, the training device of the hand keypoint detection model further includes a second training unit configured to perform inputting the real hand sample image and the virtual hand sample image into the first hand keypoint detection model to be trained, respectively, to obtain two-dimensional position prediction information of the hand keypoints in the real hand sample image and three-dimensional position prediction information of the hand keypoints in the virtual hand sample image; obtaining a first sub-loss value according to the difference value between the two-dimensional position prediction information and the two-dimensional position labeling information of the hand key points in the real hand sample image, obtaining a second sub-loss value according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information of the hand key points in the virtual hand sample image, and obtaining a total loss value of a first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value; and according to the total loss value, adjusting the model parameters of the first hand key point detection model to be trained to obtain a pre-trained first hand key point detection model.

In an exemplary embodiment, the training device of the hand keypoint detection model further includes a sample set acquisition unit configured to perform acquisition of a plurality of different hand models and background images corresponding to the respective hand models; respectively carrying out three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain virtual hand sample images corresponding to each hand model and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image corresponding to each hand model contains a background image of the corresponding hand model; and constructing a virtual hand sample set according to each virtual hand sample image and the three-dimensional position labeling information of the hand key points in each virtual hand sample image.

In an exemplary embodiment, the training device of the hand keypoint detection model further includes a model updating unit configured to perform acquisition of a real hand test image, input the real hand test image into the trained second hand keypoint detection model, and obtain three-dimensional position prediction information of the hand keypoints in the real hand test image; the real hand test image carries three-dimensional position labeling information of the corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value; retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position labeling information of the hand key points in the target hand test image to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into a retrained second hand key point detection model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 8 is a block diagram illustrating a hand keypoint detection device, according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a hand image acquisition unit 810 and a position information determination unit 820.

The hand image acquisition unit 810 is configured to perform acquisition of hand images.

A position information determining unit 820 configured to perform inputting the hand image into the trained second hand keypoint detection model, to obtain three-dimensional position prediction information of the hand keypoints in the hand image; the trained second hand keypoint detection model is trained according to the training method of the hand keypoint detection model in any embodiment of the disclosure.

Fig. 9 is a block diagram of an electronic device 900 for performing the above-described training method of the hand keypoint detection side model or the detection method of the hand keypoints, according to an exemplary embodiment. For example, electronic device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, and the like.

Referring to fig. 9, an electronic device 900 may include one or more of the following components: a processing component 1002, a memory 1004, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.

The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operations at the electronic device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, video, and so forth. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk.

The power supply component 906 provides power to the various components of the electronic device 900. Power supply components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 900.

The multimedia component 908 comprises a screen between the electronic device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. When the electronic device 900 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals. The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the electronic device 900. For example, the sensor assembly 914 may detect an on/off state of the electronic device 900, a relative positioning of the components, such as a display and keypad of the electronic device 900, the sensor assembly 914 may also detect a change in position of the electronic device 900 or a component of the electronic device 900, the presence or absence of a user's contact with the electronic device 900, an orientation or acceleration/deceleration of the electronic device 900, and a change in temperature of the electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communication between the electronic device 900 and other devices, either wired or wireless. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described object acquisition methods.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the electronic device 900 to perform the training method of the hand keypoint detection model or the hand keypoint detection method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, the program product comprising a computer program stored in a readable storage medium, from which at least one processor of an electronic device reads and executes the computer program, such that the electronic device performs the training method of the hand keypoint detection model or the detection method of the hand keypoints described in any of the embodiments of the disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. The training method of the hand key point detection model is characterized by comprising the following steps of:

2. The method for training a hand keypoint detection model according to claim 1, wherein training the second hand keypoint detection model to be trained according to the real hand sample image and the three-dimensional position labeling information of the hand keypoints in the real hand sample image to obtain a trained second hand keypoint detection model comprises:

3. The method for training a hand keypoint detection model according to claim 2, wherein the inputting the real hand sample image into a second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of hand keypoints in the real hand sample image, comprises:

4. The method for training a hand keypoint detection model according to claim 1, wherein the pre-trained first hand keypoint detection model is trained by:

5. The method of training a hand keypoint detection model according to any one of claims 1 to 4, characterized in that said virtual hand sample set is obtained by:

6. The method of training a hand keypoint detection model according to any one of claims 1 to 4, further comprising:

7. The method for detecting the hand key points is characterized by comprising the following steps of:

Collecting hand images;

inputting the hand image into a trained second hand key point detection model to obtain three-dimensional position prediction information of hand key points in the hand image; the trained second hand keypoint detection model is trained according to the training method of the hand keypoint detection model of any one of claims 1 to 6.

8. A training device for a hand keypoint detection model, comprising:

9. The training device of the hand keypoint detection model according to claim 8, wherein the first training unit is further configured to perform inputting the real hand sample image into a second hand keypoint detection model to be trained, to obtain three-dimensional position prediction information of hand keypoints in the real hand sample image; determining a loss value of the second hand key point detection model to be trained according to the difference value between the three-dimensional position prediction information and the three-dimensional position labeling information; and according to the loss value, adjusting the model parameters of the second hand key point detection model to be trained to obtain the trained second hand key point detection model.

10. The training device of the hand keypoint detection model according to claim 9, wherein the first training unit is further configured to perform a segmentation process on the real hand sample image to obtain a plurality of key region images in the real hand sample image; the key area image comprises a hand key point; inputting a plurality of key area images in the real hand sample image into a second hand key point detection model to be trained, and respectively carrying out convolution processing and full connection processing on the plurality of key area images through the second hand key point detection model to be trained to obtain three-dimensional position prediction information of hand key points in the real hand sample image.

11. The training device of a hand keypoint detection model according to claim 8, further comprising a second training unit configured to perform inputting the real hand sample image and the virtual hand sample image into a first hand keypoint detection model to be trained, respectively, to obtain two-dimensional position prediction information of hand keypoints in the real hand sample image, and three-dimensional position prediction information of hand keypoints in the virtual hand sample image; obtaining a first sub-loss value according to the difference value between the two-dimensional position prediction information of the hand key points in the real hand sample image and the two-dimensional position labeling information, obtaining a second sub-loss value according to the difference value between the three-dimensional position prediction information of the hand key points in the virtual hand sample image and the three-dimensional position labeling information, and obtaining a total loss value of the first hand key point detection model to be trained according to the first sub-loss value and the second sub-loss value; and according to the total loss value, adjusting the model parameters of the first hand key point detection model to be trained to obtain the first hand key point detection model trained in advance.

12. The training device of a hand keypoint detection model according to any of the claims 8 to 11, characterized in that the device further comprises a sample set acquisition unit configured to perform acquisition of a plurality of different hand models and corresponding background images of each of the hand models; respectively carrying out three-dimensional rendering processing on each hand model on the background image corresponding to each hand model to obtain virtual hand sample images corresponding to each hand model and three-dimensional position labeling information of hand key points in each virtual hand sample image; the virtual hand sample image corresponding to each hand model contains a background image of the corresponding hand model; and constructing the virtual hand sample set according to the virtual hand sample images and the three-dimensional position labeling information of the hand key points in the virtual hand sample images.

13. The training device of the hand keypoint detection model according to any one of claims 8 to 11, further comprising a model updating unit configured to perform acquisition of a real hand test image, input the real hand test image into the trained second hand keypoint detection model, and obtain three-dimensional position prediction information of hand keypoints in the real hand test image; the real hand test image carries three-dimensional position labeling information of corresponding hand key points; screening out a target hand test image from the real hand test image; the target hand test image is a real hand test image with the difference value between the three-dimensional position prediction information of the included hand key points and the corresponding three-dimensional position labeling information being larger than the preset difference value; retraining the trained second hand key point detection model according to the target hand test image and the three-dimensional position marking information of the hand key points in the target hand test image to obtain a retrained second hand key point detection model; and updating the trained second hand key point detection model into the retrained second hand key point detection model.

14. A hand keypoint detection device, comprising:

a hand image acquisition unit configured to perform acquisition of hand images;

a position information determining unit configured to perform inputting the hand image into a trained second hand keypoint detection model, and obtain three-dimensional position prediction information of hand keypoints in the hand image; the trained second hand keypoint detection model is trained according to the training method of the hand keypoint detection model of any one of claims 1 to 6.

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 7.

16. A storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1-7.