CN114185429B

CN114185429B - Gesture key point positioning or gesture estimating method, electronic device and storage medium

Info

Publication number: CN114185429B
Application number: CN202111334862.4A
Authority: CN
Inventors: 朱铭德; 丛林
Original assignee: Hangzhou Yixian Advanced Technology Co ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2024-03-26
Anticipated expiration: 2041-11-11
Also published as: CN114185429A

Abstract

The application relates to a method, an electronic device and a storage medium for gesture key point positioning or gesture estimation, wherein the construction process of a gesture key point positioning or gesture estimation model comprises the following steps: acquiring a basic data set, and training to obtain a basic model according to the basic data set; acquiring image data of gestures under different scenes, determining the position or gesture labels of key points of the hands on the images, and determining the areas of pixel blocks of the hands to obtain a foreground data set; and determining a background dataset; superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; based on a basic data set and a target data set fine-tune basic model, a gesture key point positioning or gesture estimation model is obtained, and through the method and the device, the training effect of data on the model is greatly improved, the problem that the generalization performance of a gesture interaction function in the related technology is poor is solved, and the generalization performance of the gesture interaction function is improved.

Description

Gesture key point positioning or gesture estimating method, electronic device and storage medium

Technical Field

The present disclosure relates to the field of virtual reality and augmented reality technologies, and in particular, to a method, an electronic device, and a storage medium for gesture key point positioning or gesture estimation.

Background

With the development of Virtual Reality (VR), augmented Reality (Augmented Reality), AR) technology, and the continuous expansion of application scenes of AR/VR glasses, gesture interaction functions are becoming more important.

Since illumination, texture, and habit of users of many scenes are uncontrollable, and deep learning is also uncontrollable for prediction of unseen scenes and data, generalization performance of gesture interaction functions in the related art is poor, where generalization performance means that as many users as possible can normally use the gesture interaction functions without significant effect degradation in as many scenes as possible.

Aiming at the problem of poor generalization performance of gesture interaction functions in the related art, no effective solution has been proposed yet.

Disclosure of Invention

The embodiment of the application provides a method for positioning gesture key points or estimating gesture, an electronic device and a storage medium, which are used for at least solving the problem of poor generalization performance of gesture interaction functions in the related technology.

In a first aspect, embodiments of the present application provide a method for gesture keypoint location or gesture estimation, the method comprising:

inputting an image containing a gesture to a gesture key point positioning or gesture estimation model to obtain a position or gesture result of each key point of a hand on the image output by the model;

the construction process of the gesture key point positioning or gesture estimation model comprises the following steps:

obtaining a basic data set, and training to obtain a basic model according to the basic data set, wherein the basic data set comprises: the method comprises the steps of including an image of the hand and a position or posture label of each key point of the hand;

acquiring image data of gestures under different scenes, determining the position or gesture labels of key points of the hands on the images, and determining the areas of pixel blocks of the hands to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; and obtaining the gesture key point positioning or gesture estimation model based on the basic data set and the basic model of the target data set fine-tune.

In a second aspect, embodiments of the present application further provide a method for gesture keypoint positioning or gesture estimation, the method including:

superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; based on the basic data set, the target data set and the difficult case data set, the fine-tune basic model is used for obtaining the gesture key point positioning or gesture estimation model, wherein in the model training process, the difficult case data set is determined according to training data with errors larger than a preset threshold value.

In some of these embodiments, the maintenance process of the difficult case data set includes:

acquiring hands in training data with errors larger than a preset threshold, randomly extracting corresponding background images from the background data set for each acquired hand, and overlapping the hands and the background images to generate difficult-case images so as to form the difficult-case data set;

when the data is fetched from the difficult data set, the fetched data is deleted from the difficult data set.

In some of these embodiments, the process of the base model for fine-tune based on the base dataset, the target dataset, and the difficult-to-case dataset includes:

summarizing the basic data set, the target data set and the difficult data set according to a first preset proportion to obtain a training data set;

the base model is defined-tune based on the training dataset.

In some of these embodiments, the determining of the foreground data set includes:

inputting the image data to the basic model, and outputting the position or posture result of each key point of the hand in the image by the basic model; visualizing the result;

and determining an image with the precision meeting the preset requirement according to the result, and determining the area of the pixel block of the hand on the image to obtain the foreground data set.

In some of these embodiments, the process of acquiring image data of gestures in different scenes includes: fixing the position of the camera and the position of the depth camera, and calibrating internal and external parameters to obtain a camera set; acquiring image data of gestures in different scenes by the camera group, wherein the image data comprises a depth map and an image with aligned time stamps;

the process of determining a block of pixels of a hand on the image includes: aligning the depth map and the image, projecting the position or gesture result of each key point of the hand to a pixel to obtain 2D information of each key point on the depth map after alignment, extracting a pixel block of the hand on the image according to the 2D information and a region growing method,

determining a mask area of the hand on the image according to the pixel blocks of the hand on the image; and shrinking a pixels in the mask area and expanding b pixels outwards to construct a to-be-processed area, and performing fine matting operation on the to-be-processed area to obtain the foreground data set.

In some of these embodiments, the process of overlaying the foreground data set with the data in the background data set comprises:

performing a transform enhancement operation on blocks of pixels of a hand within the region in the foreground dataset to update the foreground dataset;

randomly extracting foreground images in the foreground data set, randomly extracting corresponding background images from the background data set for each extracted foreground Jing Tuxiang,

and superposing the foreground image and the background image.

In some of these embodiments, the process of determining the background data set includes:

acquiring the basic data set; acquiring an open source background data set, and removing a background image containing hands in the open source background data set to update the open source background data set; acquiring a shot scene background data set;

unifying the image size, summarizing the data in the basic data set, the open source background data set and the scene background data set to obtain the background data set,

and under the condition that the corresponding background images are randomly extracted from the background data set, randomly extracting the corresponding background images from the background data set according to a second preset proportion of the basic data set, the open source background data set and the scene background data set.

In a third aspect, embodiments of the present application provide an electronic device comprising a memory having a computer program stored therein and a processor configured to run the computer program to perform a method of gesture keypoint location or pose estimation.

In a fourth aspect, embodiments of the present application provide a storage medium having a computer program stored therein, wherein the computer program is configured to perform a method of gesture keypoint location or pose estimation at runtime.

Compared with the related art, the gesture key point positioning or gesture estimation method provided by the embodiment of the application obtains the position or gesture result of each key point of the hand on the image output by the model by inputting the image containing the gesture to the gesture key point positioning or gesture estimation model; the construction process of the gesture key point positioning or gesture estimation model comprises the following steps: acquiring a basic data set, and training to obtain a basic model according to the basic data set; acquiring image data of gestures under different scenes, determining the position or gesture labels of key points of the hands on the images, and determining the areas of pixel blocks of the hands to obtain a foreground data set; and determining a background dataset; superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; based on the basic data set and the target data set fine-tune basic model, a gesture key point positioning or gesture estimation model is obtained, the training effect of data on the model is greatly improved, the problem of poor generalization performance of a gesture interaction function in the related technology is solved, and the generalization performance of the gesture interaction function is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of a method of gesture keypoint location or pose estimation according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of constructing a gesture keypoint location or pose estimation model according to a first embodiment of the present application;

FIG. 3 is a flow chart of a method of constructing a gesture keypoint location or pose estimation model according to a second embodiment of the present application;

FIG. 4 is a flow chart of determining a foreground data set according to a third embodiment of the present application;

FIG. 5 is a schematic diagram after 2D keypoint visualization according to a third embodiment of the present application;

FIG. 6 is a schematic diagram after 3D keypoint visualization according to a third embodiment of the present application;

FIG. 7 is a schematic diagram after visualization of a gesture according to a third embodiment of the present application;

FIG. 8 is a flow chart of determining a foreground dataset according to a fourth embodiment of the present application;

FIG. 9 is a schematic diagram of a hand mask of an image according to a fourth embodiment of the present application;

FIG. 10 is a schematic view of constructing a region to be processed on an image according to a fourth embodiment of the present application;

FIG. 11 is a flow chart of determining a background data set according to a fifth embodiment of the present application;

FIG. 12 is a flow chart of superimposing data in a foreground data set and a background data set according to a sixth embodiment of the present application;

FIG. 13 is a schematic view of a hand mask according to a sixth embodiment of the present application;

fig. 14 is a schematic diagram of pixel block enhancement according to a sixth embodiment of the present application;

FIG. 15 is a schematic diagram after 2D keypoint visualization according to a sixth embodiment of the present application;

fig. 16 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The method for positioning or estimating the gesture keypoints can be applied to an application environment shown in fig. 1, fig. 1 is a schematic diagram of an application environment of a method for positioning or estimating the gesture keypoints according to an embodiment of the present application, as shown in fig. 1, when performing positioning or estimating the gesture keypoints, a server 101 inputs an image including a gesture to a gesture keypoint positioning or estimating model to obtain a position or gesture result of each keypoint of a hand on the image output by the model; the server 101 may be implemented as a stand-alone server or a server cluster including a plurality of servers. It should be noted that, the hand referred to in this application includes a hand and an arm.

The embodiment provides a method for positioning a gesture key point or estimating a gesture, which comprises the following steps: inputting an image containing a gesture to a gesture keypoint positioning or gesture estimation model to obtain a position or gesture result of each keypoint of a hand on the image output by the model, wherein fig. 2 is a flowchart of a method for constructing the gesture keypoint positioning or gesture estimation model according to the first embodiment of the present application, as shown in fig. 2, the flowchart includes the following steps:

step S201, a basic data set is obtained, and a basic model is obtained through training according to the basic data set, wherein the basic data set comprises: the method comprises the steps of acquiring images containing gestures under different scenes by photographing, and marking the images with gesture key point positions or gesture information to obtain a basic data set, wherein the data volume of the basic data set can be about 50 ten thousand; the specific training method of the basic model and the label output form of the basic model are not limited, for example, the label output form can be a 2D or 3D key point, the gesture (including rotation and translation) of each joint and the like;

step S202, acquiring image data of gestures in different scenes, determining the position or gesture labels of key points of the hands on the images, and determining the areas of pixel blocks of the hands to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

for example, the process of acquiring image data of gestures in different scenes may be: fixing the position of a camera (such as a color camera, a gray level camera, an infrared camera and the like) and the position of a depth camera required by a target scene, calibrating internal and external parameters, acquiring various gestures in the scene with better performance by utilizing a fixed camera group, and re-acquiring and replacing data with insufficient precision in basic data set, wherein the acquired data comprises a depth map and a target image with aligned time stamps; for another example, if the image data is a color image, the data may be collected directly with a green curtain to obtain image data of gestures in different scenes;

step S203, superposing data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; based on the basic data set and the target data set fine-tune basic model, a gesture key point positioning or gesture estimation model is obtained; the method of harmony is not limited, bargainNet, doveNet and the like may be used, and will not be described here.

Through steps S201 to S203, compared with the problem of poor generalization performance of the gesture interaction function in the prior art, in the construction process of the gesture key point positioning or gesture estimation model used in the embodiment, the target data set is dynamically generated by randomly selecting and integrating the foreground data set and the data in the background data set in the training stage, and the data are deleted from the memory after single training, so that occupation of the system storage space in the model construction process can be reduced, the synthesized image is harmoniized, difference between hands and the background can be reduced, authenticity of the image is greatly improved, characteristics which are not actually existed in the learning process of the neural network are avoided, influence of the characteristics which are not existed in the training process of the neural network is eliminated, and therefore, the characteristics with strong universality of the neural network learning process can be guided, the gesture key point positioning or gesture estimation model in the embodiment faces different scenes and is more stable in performance when different users use, the gesture in different scenes can be accurately identified, and the problem of poor performance of the gesture interaction function in the related technologies is solved.

The embodiment also provides a method for constructing a gesture key point positioning or gesture estimation model, and fig. 3 is a flowchart of a method for constructing a gesture key point positioning or gesture estimation model according to a second embodiment of the present application, as shown in fig. 3, where the flowchart includes the following steps:

step S301, a basic data set is obtained, and a basic model is obtained through training according to the basic data set, wherein the basic data set comprises: the method comprises the steps of including an image of the hand and a position or posture label of each key point of the hand;

step S302, acquiring image data of gestures in different scenes, determining the position or gesture labels of key points of the hands on the images, and determining the areas of pixel blocks of the hands to obtain a foreground data set; and determining a background data set, wherein the background data set comprises a background image;

step S303, superposing data in a foreground data set and a background data set, and harmonizing the data to obtain a target data set; based on the basic data set, the target data set and the difficult-to-detect basic model, a gesture key point positioning or gesture estimation model is obtained, wherein in the model training process, the difficult-to-detect data set is determined according to training data with errors larger than a preset threshold value.

Through steps S301 to S303, compared with the problem of poor generalization performance of the gesture interaction function in the prior art, the method and the device not only can achieve the beneficial effects described in the first embodiment, but also can enhance learning of the difficult case data by changing the phase by selecting the data with larger errors in the model training process as the data for multiple training, thereby further enhancing learning of the difficult case by the neural network, guiding the neural network to learn the feature with higher complexity, further improving the training effect of the neural network, enabling the gesture key point positioning or gesture estimation model in the embodiment to be more stable when being used by different users in different scenes, accurately identifying gestures in different scenes, and solving the problem of poor generalization performance of the gesture interaction function in the related art.

In some of these embodiments, fig. 4 is a flowchart of determining a foreground data set according to a third embodiment of the present application, as shown in fig. 4, the flowchart comprising the steps of:

step S401, inputting image data to a basic model, and outputting the position or posture result of each key point of the hand in the image by the basic model; visualizing the position or pose; for example, the acquired image data is labeled with a base model: inputting each frame of image into the basic model to obtain the position or posture information of each key point of the hand corresponding to the image; visualizing the obtained information for screening;

step S402, determining an image with the precision meeting the preset requirement according to the visualized position or posture, and determining the area of the pixel block of the hand on the image to obtain a foreground data set;

the process of determining the image with the accuracy meeting the preset requirement can be as follows: fig. 5 is a schematic diagram after 2D keypoints are visualized according to a third embodiment of the present application, fig. 6 is a schematic diagram after 3D keypoints are visualized according to a third embodiment of the present application, and fig. 7 is a schematic diagram after gesture is visualized according to a third embodiment of the present application, and data with accuracy meeting training requirements is manually screened out according to the visualized images as shown in fig. 5 to 7; for data with insufficient precision, the data can be directly discarded, or parts can be selected for manual labeling; in this way, a set of images, key points or gestures with good precision and a depth map at the same time as the images are obtained;

the process of determining the area where the pixel block of the hand is located on the image may be: carrying out pixel alignment on the depth map and the image, projecting key points or gestures to the pixels to obtain 2D key point information on the aligned depth map, extracting pixel blocks of hands and arms by using a region growing method according to the 2D key point information, and then determining mask information of the hands; if the scene is gesture recognition under the first view angle, all pixels within a certain distance can be directly segmented according to depth, and the pixels are projected to a corresponding image according to internal and external parameters to obtain mask information of the image; it should be noted that if the image data is a color image, the depth camera may not be used to obtain coarse mask information, but rather some relatively mature green screen matting algorithm may be used to extract mask information.

Considering that in the case of using a depth camera, limited by the accuracy of the depth camera, there may be a certain error at the edge of the hand, and the resulting mask accuracy needs to be further improved, in some embodiments, fig. 8 is a flowchart for determining the foreground data set according to the fourth embodiment of the present application, as shown in fig. 8, after determining the pixel block of the hand on the image, the flowchart includes the following steps:

step S801, determining mask areas of the hands on the image according to the pixel blocks of the hands on the image;

step S802, shrinking a pixels in the mask area and expanding b pixels outwards to construct a to-be-processed area, and executing fine matting operation on the to-be-processed area to obtain a foreground data set; therefore, the mask accuracy of the obtained hand is ensured through fine matting;

for example, fig. 9 is a schematic diagram of a hand mask of an image according to a fourth embodiment of the present application, as shown in fig. 9, after determining an area where a pixel block of a hand is located on the image, and obtaining a foreground data set, a rough mask is in the foreground data set, fig. 10 is a schematic diagram of constructing a region to be processed on the image according to the fourth embodiment of the present application, as shown in fig. 10, the obtained mask is contracted by a pixels, an internal pixel (i.e. a white area) is considered to be a hand with high confidence, b pixels are expanded by b pixels, and an outline outside (i.e. a black area) is considered to be a stable background; the middle area (i.e., gray area) is considered to be either a hand or a background, as the area to be treated, as part of the fine-pitch needed;

specifically, when performing the fine Matting operation, a GrabCut Matting (Matting) algorithm may be used, which is only schematically illustrated, but not limited to, and in other embodiments, other algorithms for Matting, such as KNN Matting, may also be used; initializing the matting algorithm to finish a fine matting task, so as to obtain a foreground data set, wherein the foreground data set comprises a group of data, and each group of data comprises: an image, a hand mask for the image, and the position or pose of each keypoint of the hand on the image.

In some of these embodiments, fig. 11 is a flowchart of determining a background data set according to a fifth embodiment of the present application, as shown in fig. 11, the flowchart comprising the steps of:

step S1101, obtaining a basic data set; acquiring an open source background data set, and removing a background image of a hand with a larger size in the open source background data set to update the open source background data set; acquiring a shot scene background data set;

the basic data set is consistent with the basic data set used for training the basic model, the basic data set not only contains some background information, but also contains some information of the hand with the label, and the hand data is reserved in a part of the background data set, so that preparation is made for obtaining images of both hands crossing and overlapping during the subsequent data synthesis, and the generalization performance of the gesture interaction function is further improved; the open source background data set is some existing data sets containing various scenes or backgrounds, for example, the open source background data set can be LUSN/COCO/Google-Landmarks and the like, and images containing larger hands in the open source background data set can be used as the background only by removing the images; specifically, an existing gesture algorithm can be run on the acquired open source image, if the image has hands, the image is removed, and if the image does not have hands, the image is reserved; alternatively, the images to be retained can be determined by manual screening; the scene background data set is a background image acquired by using an actually used camera under an actually possible scene, and the user is careful that hands are not needed in the scene; it should be noted that, this embodiment is a preferred embodiment, and in other embodiments, one or two data sets of the basic data set, the open source background data set, and the scene background data set may be acquired;

step 1102, unifying the image sizes, and summarizing the data in the basic data set, the open source background data set and the scene background data set to obtain a background data set;

for example, according to the grid in the link, the images in the basic data set, the open source background data set and the scene background data set are processed into a uniform size consistent with the grid, specifically, the size can be 640 x 480 resolution, after the image size is uniform, the data in the basic data set, the open source background data set and the scene background data set are summarized to obtain the background data set, and it is to be noted that in the case of randomly extracting the background image from the background data set, the background image can be randomly extracted from the background data set according to a second preset proportion of the basic data set, the open source background data set and the scene background data set; the selection of the second preset proportion is not limited, and a suitable proportion can be determined according to an actual scene, for example, in this embodiment, the first preset proportion of the basic data set, the open source background data set and the scene background data set may be 0.2:0.5:0.3.

In some of these embodiments, fig. 12 is a flowchart of superimposing data in a foreground dataset and a background dataset according to a sixth embodiment of the present application, as shown in fig. 12, the flowchart comprising the steps of:

step S1201, performing a transformation enhancing operation on the pixel blocks of the hand in the region in the foreground data set to update the foreground data set;

alternatively, fig. 13 is a schematic view of a hand mask according to a sixth embodiment of the present application, and fig. 14 is a schematic view of pixel block enhancement according to the sixth embodiment of the present application; as shown in fig. 13 and 14, the pixel blocks of the image in the mask may be taken out, and some data enhancement may be performed according to the actual task requirements, for example, the data enhancement may be translation, rotation, scaling, stretching, brightness adjustment, contrast adjustment, etc., and it should be noted that fig. 15 is a schematic view after the 2D keypoint visualization according to the sixth embodiment of the present application, and as shown in fig. 15, the positions or attitudes of the keypoints of the hand are also transformed along with the enhancement method of the hand;

step S1202, randomly extracting foreground images from the foreground data set, and randomly extracting corresponding background images from the background data set for each extracted foreground Jing Tuxiang;

step S1203, overlapping the foreground image and the background image, for example, the foreground image of the hand may be overlaid on the background image in an overlaying manner;

it should be noted that, if the situation that the hands intersect or overlap is considered when determining the background image, and the data in the basic data set is also added to the background data set, since the hand correspondence tag is included in the basic data set, in this case, the position or posture tag of each key point of the hand in the target data set needs to include: the method comprises the steps of taking a hand label in a basic data set as a background and a hand label newly overlapped after a foreground image is overlapped, wherein a union set is taken as the hand label in the target data set; by synthesizing the image data of the intersection of the hands and the overlapping of the hands, the method for positioning the key points of the gestures or estimating the gestures by using the model can be suitable for the gestures under the condition of the hands, so that the generalization performance of the gesture interaction function is further improved.

In some embodiments, after obtaining the training data, before inputting the training data set to the basic model, a transformation enhancing operation is further performed on the image data in the training data set to update the training data set, specifically, some image enhancing methods may be selected according to the actual task, for example, the methods may be random clipping, affine transformation, rotation, flipping, contrast transformation, brightness transformation, stretching, and the like, and the final label is modified correspondingly;

because the data synthesis is completed in the training stage, the theoretically available data is infinite through the enhancement operation during the synthesis, compared with the way of generating the data offline in the related art, the generalization performance of the embodiment can be further improved, meanwhile, the dependence on the storage space is smaller, for example, 1 ten thousand background images are stored, 1 ten thousand images can be obtained before Jing Tuxiang, 1 hundred million images can be obtained without considering enhancement, and the 1 hundred million images can be subjected to the enhancement operations of random clipping, affine transformation, rotation, turnover, contrast transformation, brightness transformation, stretching and the like to obtain massive image data, so that the embodiment has smaller storage pressure on the system; in addition, because the massive image data are different from each other, the same data do not need to be used for training for one time in the training stage, and the generalization performance of the gesture interaction function is further ensured.

In some of these embodiments, the process of the fine-tune base model based on the base dataset, the target dataset, and the difficult-to-case dataset includes: according to a first preset proportion, summarizing a basic data set, a target data set and a difficult-case data set to obtain a training data set, and based on a fine-tune basic model of the training data set, wherein the proportion of the three data sets can be adjusted according to actual conditions, a reasonable proportion is 4:1:1, in other words, if one batch has 192 images, 128 images are extracted from the basic data set, 32 images are extracted from the target data set, 32 images are extracted from the difficult-case data set, and if the difficult-case data set has no data, the data in the target data set can be used for replacing the difficult-case data set;

the construction method of the difficult-case data set comprises the following steps: in the model training process, calculating the error of training data, acquiring hand data in the training data once the error is larger than a preset threshold (the threshold can be adjusted according to actual requirements), randomly extracting corresponding background data from a background data set for each acquired hand data, superposing the hand data and the background data, and harmonizing to obtain difficult-case data, and putting the difficult-case data into the tail end of a data sequence of a difficult-case data set; each time data is fetched, starting to fetch from the sequence head, and deleting the data in the difficult data set after fetching the data; the training weight of the difficult-to-sample data is increased by the phase change method, so that the training of the difficult-to-sample sequence can be enhanced, and meanwhile, a new background can be changed each time, thereby ensuring the generalization performance of the model.

In addition, in combination with the method for gesture keypoint positioning or gesture estimation in the above embodiments, the embodiments of the present application may provide a storage medium for implementation. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements the method of gesture keypoint location or gesture estimation of any of the above embodiments.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a method of gesture keypoint location or pose estimation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 16 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, as shown in fig. 16, and an electronic device, which may be a server, is provided, and an internal structure diagram thereof may be as shown in fig. 16. The electronic device includes a processor, a network interface, an internal memory, and a non-volatile memory connected by an internal bus, where the non-volatile memory stores an operating system, computer programs, and a database. The processor is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing environment for the operation of an operating system and a computer program, the computer program is executed by the processor to realize a gesture key point positioning or gesture estimating method, and the database is used for storing data.

It will be appreciated by those skilled in the art that the structure shown in fig. 16 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of gesture keypoint location or pose estimation, the method comprising:

superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; and fine-tuning the basic model based on the basic data set and the target data set to obtain the gesture key point positioning or gesture estimation model.

2. A method of gesture keypoint location or pose estimation, the method comprising:

superposing the data in the foreground data set and the background data set, and harmonizing the data to obtain a target data set; and fine-tuning the basic model based on the basic data set, the target data set and the difficult case data set to obtain the gesture key point positioning or gesture estimation model, wherein in the model training process, the difficult case data set is determined according to training data with errors larger than a preset threshold value.

3. The method of claim 2, wherein the maintenance process of the difficult-to-case dataset comprises:

4. A method according to claim 3, wherein the process of fine-tuning the base model based on the base dataset, the target dataset, and the refractory dataset comprises:

the base model is trimmed based on the training dataset.

5. The method of claim 2, wherein the determining of the foreground dataset comprises:

6. The method of claim 5, wherein acquiring image data of gestures in different scenes comprises: fixing the position of the camera and the position of the depth camera, and calibrating internal and external parameters to obtain a camera set; acquiring image data of gestures in different scenes by the camera group, wherein the image data comprises a depth map and an image with aligned time stamps;

7. The method of claim 2, wherein the process of overlaying the foreground dataset with the data in the background dataset comprises:

and superposing the foreground image and the background image.

8. The method according to claim 3 or 7, wherein the process of determining a background data set comprises:

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of gesture keypoint location or pose estimation of any of claims 1 to 8.

10. A storage medium having stored therein a computer program, wherein the computer program is arranged to perform the method of gesture keypoint location or pose estimation of any of claims 1 to 8 at run-time.