CN114185429A

CN114185429A - Method for positioning gesture key points or estimating gesture, electronic device and storage medium

Info

Publication number: CN114185429A
Application number: CN202111334862.4A
Authority: CN
Inventors: 朱铭德; 丛林
Original assignee: Hangzhou Yixian Advanced Technology Co ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-15
Anticipated expiration: 2041-11-11
Also published as: CN114185429B

Abstract

The application relates to a method for positioning or estimating gesture key points, an electronic device and a storage medium, wherein the construction process of a gesture key point positioning or gesture estimation model comprises the following steps: acquiring a basic data set, and training to obtain a basic model according to the basic data set; acquiring image data of gestures in different scenes, determining positions or posture labels of key points of a hand on the image, and determining a region where pixel blocks of the hand are located to obtain a foreground data set; and determining a background data set; superposing the data in the foreground data set and the data in the background data set, and harmonizing the data to obtain a target data set; the gesture key point positioning or posture estimation model is obtained based on the basic data set and the target data set fine-tune basic model, and by the method and the device, the training effect of the data on the model is greatly improved, the problem that the generalization performance of the gesture interaction function is poor in the related technology is solved, and the generalization performance of the gesture interaction function is improved.

Description

Method for positioning gesture key points or estimating gesture, electronic device and storage medium

Technical Field

The present disclosure relates to the field of virtual reality and augmented reality technologies, and in particular, to a method for positioning or estimating a gesture key point, an electronic device, and a storage medium.

Background

With the development of Virtual Reality (VR)), Augmented Reality (AR), and the continuous expansion of the application scenes of AR/VR glasses, the gesture interaction function becomes more important.

Because the illumination, texture and user habits of many scenes are not controllable, and deep learning is also not controllable for the prediction of unseen scenes and data, in the related art, the generalization performance of the gesture interaction function is poor, and here, the generalization performance means that in as many scenes as possible, as many users as possible can normally use the gesture interaction function without obvious effect reduction.

Aiming at the problem of poor generalization performance of the gesture interaction function in the related art, no effective solution is provided.

Disclosure of Invention

The embodiment of the application provides a method for positioning gesture key points or estimating gestures, an electronic device and a storage medium, so as to at least solve the problem of poor generalization performance of gesture interaction functions in the related art.

In a first aspect, an embodiment of the present application provides a method for positioning or estimating a gesture keypoint, where the method includes:

inputting an image containing a gesture to a gesture key point positioning or gesture estimation model to obtain a position or gesture result of each key point of the hand on the image output by the model;

wherein the construction process of the gesture key point positioning or posture estimation model comprises the following steps:

acquiring a basic data set, and training to obtain a basic model according to the basic data set, wherein the basic data set comprises: the method comprises the steps of (1) including an image of a hand and position or posture labels of key points of the hand;

acquiring image data of gestures in different scenes, determining positions or posture labels of key points of a hand on the image, and determining a region where pixel blocks of the hand are located to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

superposing the data in the foreground data set and the data in the background data set, and harmonizing the data to obtain a target data set; and obtaining the gesture key point positioning or posture estimation model based on the basic data set and the target data set fine-tune basic model.

In a second aspect, an embodiment of the present application further provides a method for positioning or estimating a gesture keypoint, where the method includes:

superposing the data in the foreground data set and the data in the background data set, and harmonizing the data to obtain a target data set; and refining-tune the basic model based on the basic data set, the target data set and the difficult case data set to obtain the gesture key point positioning or posture estimation model, wherein in the model training process, the difficult case data set is determined according to training data with errors larger than a preset threshold value.

In some of these embodiments, the maintenance process for the difficult-to-case data set includes:

acquiring hands in training data with errors larger than a preset threshold, randomly extracting a corresponding background image from the background data set for each acquired hand, and superposing the hand and the background image to generate a difficult case image to form the difficult case data set;

and when data is taken out from the difficult data set, the taken data is deleted from the difficult data set.

In some of these embodiments, based on the base dataset, the target dataset, and the refractory dataset, the process of fine-tune the base model includes:

summarizing the basic data set, the target data set and the difficult case data set according to a first preset proportion to obtain a training data set;

the base model is based on the training dataset fine-tune.

In some of these embodiments, the determining of the foreground data set comprises:

inputting the image data to the basic model, and outputting the position or posture result of each key point of the hand in the image by the basic model; visualizing the result;

and determining an image with the precision meeting the preset requirement according to the result, and determining the area where the pixel block of the hand is located on the image to obtain the foreground data set.

In some embodiments, the process of obtaining image data of gestures in different scenes comprises: fixing the position of the camera and the position of the depth camera, and calibrating internal and external parameters to obtain a camera set; acquiring image data of gestures in different scenes by the camera group, wherein the image data comprises a depth map and an image which are aligned by timestamps;

the process of determining pixel blocks of the hand on the image comprises: aligning the depth map and the image, projecting the position or posture result of each key point of the hand to a pixel to obtain 2D information of each key point on the depth map after alignment, extracting a pixel block of the hand on the image according to the 2D information and a region growing method,

determining a mask area of the hand on the image according to the pixel blocks of the hand on the image; and c, shrinking a pixels in the mask region and expanding b pixels outside the mask region to construct a region to be processed, and performing fine matting operation on the region to be processed to obtain the foreground data set.

In some of these embodiments, the process of overlaying the data in the foreground data set and the background data set comprises:

performing a transform enhancement operation on pixel blocks of a hand within the region in the foreground dataset to update the foreground dataset;

randomly extracting foreground images from the foreground data set, for each extracted foreground image, randomly extracting a corresponding background image from the background data set,

and overlapping the foreground image and the background image.

In some of these embodiments, the process of determining the background data set comprises:

acquiring the basic data set; acquiring an open source background data set, and removing a background image containing a hand in the open source background data set to update the open source background data set; acquiring a shot scene background data set;

unifying the image size, summarizing the data in the basic data set, the open source background data set and the scene background data set to obtain the background data set,

and under the condition of randomly extracting the corresponding background image from the background data set, randomly extracting the corresponding background image from the background data set according to a second preset proportion of the basic data set, the open source background data set and the scene background data set.

In a third aspect, an embodiment of the present application provides an electronic apparatus, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for positioning or estimating a pose of a gesture key point.

In a fourth aspect, the present application provides a storage medium having a computer program stored therein, where the computer program is configured to execute the method for positioning or posture estimation of a gesture key point when running.

Compared with the related art, the method for positioning the gesture key points or estimating the gesture provided by the embodiment of the application obtains the position or gesture result of each key point of the hand on the image output by the model by inputting the image containing the gesture to the gesture key point positioning or gesture estimation model; the construction process of the gesture key point positioning or posture estimation model comprises the following steps: acquiring a basic data set, and training to obtain a basic model according to the basic data set; acquiring image data of gestures in different scenes, determining positions or posture labels of key points of a hand on the image, and determining a region where pixel blocks of the hand are located to obtain a foreground data set; and determining a background data set; superposing the data in the foreground data set and the data in the background data set, and harmonizing the data to obtain a target data set; the gesture key point positioning or posture estimation model is obtained based on the basic data set and the target data set fine-tune basic model, the training effect of the data on the model is greatly improved, the problem that the generalization performance of the gesture interaction function in the related technology is poor is solved, and the generalization performance of the gesture interaction function is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an application environment of a method for gesture keypoint location or pose estimation according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for constructing a gesture keypoint location or pose estimation model according to a first embodiment of the present application;

FIG. 3 is a flow chart of a method of constructing a gesture keypoint location or pose estimation model according to a second embodiment of the present application;

FIG. 4 is a flow chart for determining a foreground data set according to a third embodiment of the present application;

FIG. 5 is a schematic representation of 2D keypoints after visualization according to a third embodiment of the present application;

FIG. 6 is a schematic representation of 3D keypoints after visualization according to a third embodiment of the present application;

FIG. 7 is a schematic representation of a pose after visualization according to a third embodiment of the present application;

FIG. 8 is a flow chart for determining a foreground data set according to a fourth embodiment of the present application;

FIG. 9 is a schematic diagram of a hand mask of an image according to a fourth embodiment of the present application;

FIG. 10 is a diagram illustrating a fourth embodiment of constructing a region to be processed on an image according to the present application;

FIG. 11 is a flow chart of determining a background data set according to a fifth embodiment of the present application;

FIG. 12 is a flow chart of superimposing data in a foreground data set and a background data set according to a sixth embodiment of the present application;

FIG. 13 is a schematic view of a hand mask according to a sixth embodiment of the present application;

FIG. 14 is a schematic illustration of pixel block enhancement according to a sixth embodiment of the present application;

FIG. 15 is a schematic representation of 2D keypoints after visualization according to a sixth embodiment of the application;

fig. 16 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method for positioning or estimating the gesture key points provided by the present application can be applied to an application environment shown in fig. 1, where fig. 1 is an application environment schematic diagram of the method for positioning or estimating the gesture key points according to the embodiment of the present application, and as shown in fig. 1, when performing the positioning or estimating the gesture key points, the server 101 inputs an image including a gesture to a gesture key point positioning or posture estimation model, and obtains a position or posture result of each key point of the hand on the image output by the model; the server 101 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers. The hand referred to in the present application includes a hand and an arm.

The embodiment provides a method for positioning or estimating gesture key points, which comprises the following steps: inputting an image containing a gesture to a gesture key point positioning or posture estimation model, and obtaining a position or posture result of each key point of a hand on the image output by the model, wherein fig. 2 is a flowchart of a method for constructing the gesture key point positioning or posture estimation model according to a first embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, acquiring a basic data set, and training to obtain a basic model according to the basic data set, wherein the basic data set comprises: optionally, images including gestures in different scenes can be collected by photographing, and the positions or pose information of the key points of the gestures is marked on the images to obtain a basic data set, wherein the data volume of the basic data set can be about 50 ten thousand; the invention does not limit the specific training method of the basic model and the label output form of the basic model, for example, the label output form may be 2D or 3D key points, or the posture (including rotation and translation) of each joint, etc.;

step S202, acquiring image data of gestures in different scenes, determining positions or posture labels of key points of the hand on the image, and determining a region where pixel blocks of the hand are located to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

for example, the process of acquiring image data of gestures in different scenes may be: fixing the position of a camera (such as a color camera, a gray level camera, an infrared camera and the like) and the position of a depth camera required by a target scene, calibrating internal and external parameters, collecting various gestures in a scene with better performance by using the fixed camera set, and collecting and replacing data with insufficient precision in basic data set again, wherein the collected data comprises a depth map and a target image aligned by timestamps; for another example, if the image data is a color image, the data can be directly collected by a green screen to obtain image data of gestures in different scenes;

step S203, overlapping data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; obtaining a gesture key point positioning or posture estimation model based on the basic data set and the target data set fine-tune basic model; the method of harmonization is not limited, and BargainNet, DoveNet, etc. may be used, and will not be described herein.

Through steps S201 to S203, compared to the problem of poor generalization performance of the gesture interaction function in the prior art, in the construction process of the gesture key point positioning or gesture estimation model used in this embodiment, the target data set of the gesture key point positioning or gesture estimation model is dynamically generated by randomly selecting and integrating the data in the foreground data set and the data in the background data set in the training stage, and is deleted from the memory after a single training, so that the occupation of the storage space of the system by the model construction process can be reduced, and the synthesized image is harmonious in this embodiment, so as to reduce the difference between the hand and the background, greatly improve the true degree of the image, avoid the neural network from learning the features that do not exist in practice, eliminate the influence of the features that do not exist on the training effect of the neural network, and thus guide the neural network to learn the features with strong universality, so that the gesture key point positioning or gesture estimation model in this embodiment is used differently in different scenes The gesture recognition system has the advantages that the performance of a user in use is more stable, gestures in different scenes can be accurately recognized, and the problem that the generalization performance of the gesture interaction function in the related technology is poor is solved.

Fig. 3 is a flowchart of a method for constructing a gesture keypoint localization or pose estimation model according to a second embodiment of the present application, and as shown in fig. 3, the flowchart includes the following steps:

step S301, acquiring a basic data set, and training to obtain a basic model according to the basic data set, wherein the basic data set comprises: the method comprises the steps of (1) including an image of a hand and position or posture labels of key points of the hand;

step S302, acquiring image data of gestures in different scenes, determining positions or posture labels of key points of the hand on the image, and determining a region where pixel blocks of the hand are located to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

step S303, superposing the data in the foreground data set and the data in the background data set, and harmonizing the data to obtain a target data set; and obtaining a gesture key point positioning or posture estimation model based on the basic data set, the target data set and the difficult case data set and a fine-tune basic model, wherein the difficult case data set is determined according to training data with errors larger than a preset threshold value in the model training process.

Through steps S301 to S303, compared with the problem of poor generalization performance of the gesture interaction function in the prior art, this embodiment not only can achieve the beneficial effects as described in the first embodiment, but also can enhance the learning of the neural network on the difficult cases by selecting data with a large error in the model training process as data for multiple times of training and increasing the training weight of the difficult case data in a phase-changing manner, thereby guiding the neural network to learn features with high complexity, further improving the training effect of the neural network, so that the gesture key point positioning or gesture estimation model in this embodiment is more stable when used by different users in different scenes, can accurately recognize gestures in different scenes, and solve the problem of poor generalization performance of the gesture interaction function in the related art.

In some of these embodiments, fig. 4 is a flowchart of determining a foreground data set according to the third embodiment of the present application, and as shown in fig. 4, the flowchart includes the following steps:

step S401, inputting image data to a basic model, and outputting the position or posture result of each key point of the hand in the image by the basic model; visualizing the position or pose; for example, the acquired image data is labeled with the base model: inputting each frame of image into the basic model to obtain the position or posture information of each key point of the hand corresponding to the image; visualizing the obtained information for screening;

step S402, determining an image with the precision meeting the preset requirement according to the visualized position or posture, and determining the area where the pixel block of the hand is located on the image to obtain a foreground data set;

the process of determining the image with the accuracy meeting the preset requirement may be: fig. 5 is a schematic diagram after 2D keypoints are visualized according to a third embodiment of the present application, fig. 6 is a schematic diagram after 3D keypoints are visualized according to the third embodiment of the present application, fig. 7 is a schematic diagram after pose visualization according to the third embodiment of the present application, and data with accuracy meeting training requirements are manually screened according to the visualized images shown in fig. 5 to 7; for data with low precision, the data can be directly discarded, and parts can be selected for manual marking; in this way, a group of images with still-available group precision, key points or postures and a depth map at the same time as the images are obtained;

the process of determining the area where the pixel block of the hand is located on the image may be: performing pixel alignment on the depth map and the image, projecting key points or postures to pixels to obtain 2D key point information on the aligned depth map, extracting pixel blocks of a hand and an arm by using a region growing method according to the 2D key point information, and then determining mask information of the hand; if the scene is gesture recognition under a first visual angle, all pixels within a certain distance can be directly segmented according to the depth, and the pixels are projected to a corresponding image according to internal and external parameters to obtain mask information of the image; it should be noted that, if the image data is a color image, the coarse mask information may not be obtained by using the depth camera, but the mask information may be extracted by using some relatively mature green-curtain matting algorithms.

Considering that, in the case of using a depth camera, there is a certain error at the edge of the hand due to the accuracy of the depth camera, and the obtained mask accuracy needs to be further improved, in some embodiments, fig. 8 is a flowchart for determining a foreground data set according to the fourth embodiment of the present application, and as shown in fig. 8, after determining a pixel block of the hand on an image, the flowchart includes the following steps:

step S801, determining a mask area of the hand on the image according to the pixel block of the hand on the image;

s802, shrinking a pixels in the mask area and expanding b pixels to construct an area to be processed, and performing fine matting operation on the area to be processed to obtain a foreground data set; therefore, the mask accuracy of the acquired hand is ensured through fine matting;

for example, fig. 9 is a schematic diagram of a hand mask of an image according to a fourth embodiment of the present application, as shown in fig. 9, a region where a pixel block of a hand is located is determined on the image, after a foreground data set is obtained, the foreground data set is a coarser mask, fig. 10 is a schematic diagram of constructing a region to be processed on the image according to the fourth embodiment of the present application, as shown in fig. 10, the obtained mask is retracted by a pixels, an inner pixel (i.e., a white region) is considered as a hand with a high confidence, b pixels are extended, and an outer portion of a contour (i.e., a black region) is considered as a stable background; the middle area (i.e. gray area) is considered as possible hand and background, and is the area to be processed, which is the part needing fine finishing;

specifically, when the fine Matting operation is performed, a GrabCut Matting (Matting) algorithm may be used, which is only schematically illustrated but not limited, and in other embodiments, other algorithms for Matting, such as KNN Matting, may also be used; initializing the matting algorithm, completing a fine matting task, and thus obtaining a foreground data set, wherein the foreground data set comprises a group of data, and each group of data comprises: the image, the hand mask of the image, and the position or pose of each key point of the hand on the image.

In some of these embodiments, fig. 11 is a flowchart of determining a background data set according to a fifth embodiment of the present application, as shown in fig. 11, the flowchart includes the following steps:

step S1101, acquiring a basic data set; acquiring an open source background data set, and removing a background image containing a larger hand in the open source background data set to update the open source background data set; acquiring a shot scene background data set;

the basic data set is consistent with the basic data set used by the training basic model, the basic data set not only contains some background information, but also contains some hand information with labels, hand data are reserved in a part of the background data set, and preparation is made for obtaining images with crossed hands and overlapped hands during subsequent data synthesis, so that the generalization performance of the gesture interaction function is further improved; the open source background data set is some existing data sets containing various scenes or backgrounds, for example, the open source background data set may be LUSN/COCO/Google-Landmarks, etc., and an image containing a large human hand in the open source background data set needs to be removed to be used as a background; specifically, the existing gesture algorithm can be run once on the acquired open source image, if the image has a hand, the image is rejected, and if the image does not have a hand, the image is retained; alternatively, the images to be retained can be determined by manual screening; the scene background data set is a background image acquired by using an actually used camera under an actual possible scene, and a hand is not needed in the scene; it should be noted that this embodiment is a preferred embodiment, and in other embodiments, one or two data sets of a basic data set, an open source background data set, and a scene background data set may also be obtained;

step S1102, unifying the image size, and summarizing the data in the basic data set, the open source background data set and the scene background data set to obtain a background data set;

for example, according to a grid in a link, processing images in a basic data set, an open source background data set, and a scene background data set into a uniform size consistent with the grid, specifically, the size may be 640 × 480, after the size of the image is uniform, summarizing data in the basic data set, the open source background data set, and the scene background data set to obtain a background data set, where it is to be noted that, in the case of randomly extracting a background image from the background data set, the background image may be randomly extracted from the background data set according to a second preset proportion of the basic data set, the open source background data set, and the scene background data set; the second preset ratio is not limited, and may be determined according to an actual scene, for example, in this embodiment, the first preset ratio of the basic data set, the open source background data set, and the scene background data set may be 0.2:0.5: 0.3.

In some of these embodiments, fig. 12 is a flowchart of superimposing data in a foreground data set and a background data set according to a sixth embodiment of the application, as shown in fig. 12, the flowchart comprising the steps of:

step S1201, performing a transformation enhancement operation on pixel blocks of the hand within the region in the foreground dataset to update the foreground dataset;

alternatively, fig. 13 is a schematic diagram of a hand mask according to a sixth embodiment of the present application, and fig. 14 is a schematic diagram of pixel block enhancement according to the sixth embodiment of the present application; as shown in fig. 13 and fig. 14, pixel blocks of the image inside the mask may be extracted, and some data enhancement may be performed according to actual task requirements, for example, the data enhancement may be translation, rotation, scaling, stretching, brightness adjustment, contrast adjustment, and the like, and it should be noted that fig. 15 is a schematic diagram after 2D keypoints are visualized according to the sixth embodiment of the present application, and as shown in fig. 15, the positions or postures of the keypoints of the hand are also transformed along with the enhancement method of the hand;

step S1202, randomly extracting foreground images from a foreground data set, and randomly extracting corresponding background images from a background data set for each extracted foreground image;

step S1203, superimposing the foreground image and the background image, for example, the foreground image of the hand may be overlaid on the background image in an overlaying manner;

it should be noted that, if the background image is determined by taking into account the situation of the two hands crossing or overlapping and adding the data in the basic data set to the background data set, the hand correspondence label is included in the basic data set, so in this case, the position or posture label of each key point of the hand in the target data set needs to include: the hand labels in the basic data set used as the background and the hand labels newly superposed after the foreground images are superposed are taken as hand labels in the target data set; by synthesizing image data with crossed and overlapped hands, the gesture key point positioning or posture estimation method using the model can be suitable for gestures under the condition of two hands, and therefore generalization performance of a gesture interaction function is further improved.

In some embodiments, after obtaining the training data, before inputting the training data set to the base model, a transformation enhancing operation is further performed on the image data in the training data set to update the training data set, specifically, some image enhancing methods may be selected according to the actual task, for example, the method may be random clipping, affine transformation, rotation, inversion, contrast transformation, brightness transformation, stretching, and the like, and the final label is modified correspondingly;

because data synthesis is completed in a training stage, and data which can be theoretically obtained is infinite through enhancement operation during synthesis, compared with a mode of generating data offline in the related art, the embodiment can further improve generalization performance, and has smaller dependence on a storage space, for example, 1 million background images and 1 million foreground images are stored, 1 million images can be obtained without considering enhancement, and massive image data can be obtained by performing enhancement operations such as random cropping, affine transformation, rotation, overturning, contrast transformation, brightness transformation, stretching and the like on the 1 million images, so that the embodiment has smaller storage pressure on a system; in addition, because the mass image data are different from each other, the training is not required to be performed once by using the same data in the training stage, and the generalization performance of the gesture interaction function is further ensured.

In some of these embodiments, the process of fine-tune base model based on the base dataset, the target dataset, and the refractory dataset includes: summarizing the basic data set, the target data set and the difficult data set according to a first preset proportion to obtain a training data set, and obtaining a fine-tune basic model based on the training data set, wherein the proportion of the three data can be adjusted according to actual conditions, and a reasonable proportion is 4:1:1, namely, if a batch has 192 images, 128 images are extracted from the basic data set, 32 images are extracted from the target data set, 32 images are extracted from the difficult data set, and if no data exists in the difficult data set, the data in the target data set can be used for replacing the data;

the method for constructing the difficult case data set comprises the following steps: in the model training process, calculating errors of training data, acquiring hand data in the training data once the errors are larger than a preset threshold (the threshold can be adjusted according to actual requirements), randomly extracting corresponding background data from a background data set for each acquired hand data, overlapping the hand data and the background data, carrying out harmony to obtain difficult-case data, and putting the difficult-case data into the tail end of a data sequence of the difficult-case data set; when data is fetched every time, the data is fetched from the head of the sequence, and the data in the difficult data set is deleted after the data is fetched; by the method, the training weight of the difficult-case data is increased in a phase change manner, the training of the difficult-case sequence can be enhanced, a new background can be changed every time, and the generalization performance of the model is ensured.

In addition, in combination with the method for positioning or estimating the gesture keypoints in the foregoing embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the above-described embodiments of a method for gesture keypoint location or pose estimation.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of gesture keypoint location or pose estimation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 16 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 16, there is provided an electronic device, which may be a server, and an internal structure diagram of which may be as shown in fig. 16. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a method for positioning or estimating the gesture key point, and the database is used for storing data.

Those skilled in the art will appreciate that the structure shown in fig. 16 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of gesture keypoint location or pose estimation, the method comprising:

2. A method of gesture keypoint location or pose estimation, the method comprising:

3. The method of claim 2, wherein the maintenance process for the difficult-to-instantiate data set comprises:

4. The method of claim 3, wherein the process of defining the base model based on the base dataset, the target dataset, and the difficult dataset comprises:

the base model is based on the training dataset fine-tune.

5. The method of claim 2, wherein the determining of the foreground data set comprises:

6. The method of claim 5, wherein the process of obtaining image data of gestures in different scenes comprises: fixing the position of the camera and the position of the depth camera, and calibrating internal and external parameters to obtain a camera set; acquiring image data of gestures in different scenes by the camera group, wherein the image data comprises a depth map and an image which are aligned by timestamps;

7. The method of claim 2, wherein the process of superimposing the data in the foreground data set and the background data set comprises:

and overlapping the foreground image and the background image.

8. The method of claim 3 or 7, wherein the determining a background data set comprises:

9. An electronic device comprising a memory and a processor, wherein the memory has stored thereon a computer program, and the processor is configured to execute the computer program to perform the method of gesture keypoint location or pose estimation of any of claims 1 to 8.

10. A storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the method of gesture keypoint location or pose estimation of any of claims 1 to 8 when run.