CN112801064A

CN112801064A - Model training method, electronic device and storage medium

Info

Publication number: CN112801064A
Application number: CN202110386076.2A
Authority: CN
Inventors: 石彪; 李廷照; 张举勇; 户磊
Original assignee: Beijing Dilusense Technology Co Ltd; Hefei Dilusense Technology Co Ltd
Current assignee: Beijing Dilusense Technology Co Ltd; Hefei Dilusense Technology Co Ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-05-14

Abstract

The embodiment of the invention relates to the field of data processing, and discloses a model training method, electronic equipment and a storage medium. In some embodiments of the present invention, a model training method comprises the following steps: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model. This embodiment makes it possible to obtain, based on a single image, the parameters required in the process of building a model of the movements of the human hand based on a parameterized model.

Description

Model training method, electronic device and storage medium

Technical Field

The embodiment of the invention relates to the field of data processing, in particular to a model training method, electronic equipment and a storage medium.

Background

The reconstruction and attribute identification of three-dimensional hands are always important research directions in the field of machine vision. At present, the related work of human hand reconstruction based on deep learning in academic circles can be roughly divided into two categories, namely parameterized model reconstruction and unparameterized model reconstruction. The non-parametric model reconstruction mainly takes a multi-view depth map and a multi-view color map as input, a human hand model is obtained through a large amount of data learning, and the data acquisition difficulty is high. The parameterized model reconstruction mainly learns the parameters of the human hand model and then fits the target gesture model through the parameters. Both methods have respective advantages and disadvantages.

With the rapid development of deep learning technology and hardware technology in recent years, the evolution from the use of machine learning optimization to the obtaining of models through deep learning gradually makes real-time gesture motion capture have an opportunity to be realized. However, the current reconstruction method based on the parametric model often needs to obtain the parameters required in the parametric model based on multiple images.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a model training method, an electronic device, and a storage medium, which enable parameters required in a process of constructing a hand motion model based on a parameterized model to be obtained based on a single image.

In order to solve the above technical problem, an embodiment of the present invention provides a model training method, including the following steps: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model.

An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method as mentioned in the above embodiments.

Embodiments of the present invention also provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the model training method mentioned in the above embodiments.

According to the model training method, the electronic device and the storage medium provided by the embodiment, the hand capturing model capable of obtaining the hand posture parameters and the shape parameters based on the single hand image is obtained through training based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.

Additionally, an image training set is acquired, comprising: controlling a plurality of image acquisition devices to shoot hand images synchronously; the arrangement directions of the plurality of image acquisition devices are different; and constructing an image training set according to the shot hand image.

In addition, according to the shot human hand image, an image training set is constructed, and the method comprises the following steps: performing data enhancement on the shot hand image; and taking the hand image subjected to data enhancement and the shot hand image as training images of the image training set.

In addition, obtaining annotation data includes: optimizing to obtain a three-dimensional model of the hand as labeled data based on input data of a predefined optimization objective function and an optimization algorithm; the input data of the optimization algorithm comprises a human hand contour map, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on an image training set.

In addition, the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors.

In addition, the functional expression of the adjacent frame loss constraint term is:

L_Smooth=||θ_pre-θ_cur||₂ ²+||β_pre-β_cur||₂ ²a formula a;

wherein L is_SmoothRepresenting adjacent frame loss constraint, theta_preRepresenting a human hand pose parameter, beta, predicted based on a previous frame of training image_preRepresenting a human hand shape parameter, theta, predicted based on a previous frame of training image_curRepresenting the hand pose parameter, beta, predicted based on the current frame training image_curRepresenting the human hand shape parameters predicted based on the current frame training image.

In addition, the loss function further comprises a point cloud loss constraint term, a two-dimensional key point loss constraint term and a three-dimensional key point loss constraint term.

In addition, the hand capture model is a network model obtained by cascading a lightweight neural network and a gated cycle unit model.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a model training method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a model training method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of an image acquisition system in the model training method shown in FIG. 2;

fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

A first embodiment of the present invention relates to a model training method including the steps of: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model. In the embodiment, based on the image training set and the labeling data, a hand capturing model capable of obtaining hand posture parameters and hand shape parameters based on a single hand image is obtained through training, so that a hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.

The following describes details of the model training method according to the present embodiment. The following disclosure provides implementation details for the purpose of facilitating understanding, and is not necessary to practice the present solution.

The model training method in the present embodiment is applied to an electronic device. The electronic device may be a terminal, a server, a cloud server, or the like. As shown in fig. 1, the model training method includes the following steps:

step 101: and acquiring an image training set and labeling data.

Specifically, the annotation data comprises a human hand three-dimensional model obtained based on optimization of an image training set. The image training set comprises a plurality of training images, and each training image is a hand image in various postures and under various backgrounds. The human hand three-dimensional model is a point cloud model of the human hand with fixed human hand posture parameters and human hand shape parameters corresponding to the image training set.

Step 102: and training to obtain a human hand capturing model by utilizing each frame training image and the labeled data in the image training set based on a predefined loss function at least comprising a loss constraint term of adjacent frames.

Specifically, the hand capturing model is a neural network model for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model is obtained based on the hand posture parameters, the hand shape parameters and a hand parameterized model. The input data of the hand capturing model is a single hand image, and the output data is hand posture parameters and hand shape parameters in the parameterized model. The parameterized model is a model which is constructed in advance and can generate a hand motion model corresponding to a single hand image based on the input hand posture parameters and hand shape parameters of the single hand image. The hand motion model is a three-dimensional model of the motion of a hand capable of representing a single hand image.

It is worth mentioning that the electronic equipment obtains a hand capturing model capable of obtaining hand posture parameters and hand shape parameters based on a single hand image through training based on the image training set and the labeling data, so that a hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.

Alternatively, the human hand image is a color (RGB) image.

In one example, the functional expression of the adjacent frame loss constraint term is:

L_Smooth=||θ_pre-θ_cur||₂ ²+||β_pre-β_cur||₂ ²a formula a;

It is worth mentioning that the influence of jitter on the modeling can be effectively reduced by comparing the previous frame with the current frame.

In one example, the loss function further includes a point cloud loss constraint term, a two-dimensional keypoint loss constraint term, and a three-dimensional keypoint loss constraint term. Specifically, the process of training the model may use three different loss constraint types, respectively: conventional loss constraints, adjacent frame loss constraints, and point cloud constraints. The traditional loss adopts the common Euclidean distance constraint after the three-dimensional key points are projected on a two-dimensional image plane and marked with two-dimensional key points. The inventors have found that using the keypoint constraint training model alone has two drawbacks: firstly, the relation between adjacent frames is not considered, and the model jitter is serious; secondly, no real hand point cloud is used for constraint, and the model is not accurate enough in the depth direction. The addition of the adjacent frame loss constraint and the point cloud loss constraint can be chosen to achieve more accurate and coherent motion capture. In order to make the joint rotation of adjacent frames smoother in the parameterized model, the change of the angular velocity (including the magnitude and direction) of the joint rotation should be as small as possible. Wherein, the average rotation angular velocity of a certain key point in two adjacent frames can be calculated by dividing the difference of rotation vectors of the adjacent frames by unit time. Two average rotation angular velocities (vectors) can be calculated for three adjacent frames. Thus, alternatively, the regularization term loss may be designed to minimize the difference between the two average rotational angular velocities, i.e., the L1 mode of the difference between the two rotational angular velocities, and so on. Similarly, the translation vector should be the difference (change) in translation speed. The calculation of the point cloud loss constraint may calculate a point-to-point error of the parameterized model generated human hand model based on the three-dimensional model of the human hand in the annotation data. Therefore, the total loss function can be composed of a two-dimensional key point loss constraint term, a three-dimensional key point loss constraint term, a point cloud loss constraint term and an adjacent frame loss constraint term as shown in formula b. The function expression of the two-dimensional key point loss constraint term is shown as a formula c, the function expression of the three-dimensional key point loss constraint term is shown as a formula d, and the function expression of the point cloud loss constraint term is shown as a formula e.

L_total=λ₁L_2D+λ₂L_3D+λ₃L_Point+λ₄L_SmoothA formula b;

in the formula b, L_totalRepresenting the total loss function, λ₁Weight, λ, representing a two-dimensional keypoint loss constraint term₂Weight, λ, representing a three-dimensional keypoint loss constraint term₃Weight, λ, representing a point cloud loss constraint term₄Representing the weight of the adjacent frame loss constraint term. Lambda [ alpha ]₁、λ₂、λ₃And λ₄Generally chosen empirically. L is_2DRepresenting two-dimensional keypoint loss constraints, whichThe calculation is shown in formula c. L is_3DAnd (4) representing a three-dimensional key point loss constraint term, wherein the calculation mode is shown as formula d. L is_PointAnd (4) representing a point cloud loss constraint term, wherein the calculation formula is shown as a formula e. L is_SmoothAn adjacent frame loss constraint term is expressed and the calculation formula is shown as formula a above, and the loss term is mainly used to prevent jitter.

L_2D=||p_gt2-p_l2||₂ ²Formula c;

in formula c, p_gt2Two-dimensional key point information representing hand marks (ground-truth) is obtained by detection of a third-party algorithm (such as a two-dimensional key point detection algorithm), p_l2Attitude information representing a prediction of the network.

L_3D=v||p_gt3-p_l3||₂ ²A formula d;

in formula d, p_gt3Three-dimensional key point information representing a hand label, p_gt3The three parts of X, Y and Z are included, wherein (X, Y) is the two-dimensional key point information detected by the third-party algorithm, and Z is the value of the coordinate (X, Y) on the depth map. p is a radical of_l3And representing the position of the three-dimensional key point obtained by the network prediction parameters. v is a one-hot vector consisting of 0 and 1. Due to factors such as self-shielding of the human hand and the like, the Z value of the key point of the self-shielding part is inconsistent with the value of the coordinate XY on the depth map, and some rules can be set as required to filter out the self-shielding key point. For example, after obtaining three-dimensional key points of a human hand, there is a confidence value of each three-dimensional key point, and the confidence value represents the probability that the three-dimensional key point is correct. By setting a confidence threshold, the detected three-dimensional keypoints can be filtered. And setting the numerical value of the element corresponding to the correct three-dimensional key point in the one-hot vector to be 1, and setting the numerical value of the element corresponding to the incorrect three-dimensional key point to be 0.

L_Point=w||n_gt ^T(V_gt-V_pred) ||₂ ²A formula e;

in formula e, w is a one-hot vector consisting of 0 and 1, and if a three-dimensional relationship is found on the depth mapAnd the point corresponding to the key point is 1, otherwise, the point is 0. V_gtRepresenting grid points, V_predPoints, n, corresponding to the human hand three-dimensional model generated by representing the prediction parameters and the human hand three-dimensional model obtained by optimization_gt ^TRepresenting the grid point normal. And searching a corresponding point between the human hand three-dimensional model generated by the prediction parameters and the human hand three-dimensional model obtained through optimization through the Kd-Tree.

Optionally, the human hand capture model is a network model obtained by cascading a lightweight neural network (ShuffleNetv 2) model and a gated cyclic unit (gru) model. Specifically, in view of the real-time requirement in the aspect of efficiency, according to the efficiency tests of different models, the following conclusions are drawn: some small networks, together with a recurrent neural network, can achieve better results. For example: the network model obtained by cascading the lightweight neural network and the gated cyclic unit model is more excellent in performance and higher in speed. Therefore, optionally, the human hand capture model may use the network model as a base network during the model training process. The input data of the hand capturing model is a single hand image, the output data is hand posture parameters and hand shape parameters in the parameterized model, and the hand motion model can be obtained by inputting the parameters into the parameterized model.

Optionally, the human hand posture parameters of the continuous frames output by the ShuffleNet v2 network are input into the gru network, so that the continuous frame effect can be continuously optimized.

The above description is only for illustrative purposes and does not limit the technical aspects of the present invention.

Compared with the prior art, according to the model training method provided by the embodiment, the electronic equipment is trained to obtain the hand capture model capable of obtaining the hand posture parameters and the hand shape parameters based on a single hand image based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.

A second embodiment of the present invention relates to a model training method. The embodiment is further detailed on the basis of the first embodiment, and illustrates a process of obtaining an image training set and labeling data.

Specifically, as shown in fig. 2, the present embodiment includes steps 201 to 204, where step 204 is substantially the same as step 102 in the first embodiment, and is not repeated here. The following mainly introduces the differences:

step 201: and controlling a plurality of image acquisition devices to synchronously shoot the hand images.

Specifically, the arrangement orientations of the plurality of image capturing devices are different.

In one example, a multi-view multi-camera based image acquisition system and a subsequent data enhancement algorithm are used to obtain a desired training set of images. Alternatively, an image capturing system composed of a plurality of image capturing devices is shown in fig. 3. The first camera 301, the second camera 302, and the third camera 303 are respectively located at different positions of the human hand 304. Alternatively, the captured training images include 40 common gesture types with different directions, and the captured objects include 30 individuals with different ages and different hand types. Optionally, in consideration of the influence factor of the color of the clothes at the wrist, the acquisition object wears sleeves of different colors to perform acquisition in the acquisition process. The process of image acquisition system set-up using multiple cameras is probably as follows:

in the first step, a plurality of cameras are placed in different directions. Wherein the camera may be an RGB-D camera.

And step two, developing an image acquisition program to enable a plurality of cameras to acquire images synchronously.

Optionally, a method similar to human body multi-camera reconstruction can be used, the multi-camera data is used for optimizing a human hand three-dimensional model to be used as the marking data, and the accuracy of the marking data and the accuracy of the three-dimensional key points can be guaranteed to the greatest extent.

After the image acquisition system is built, gestures with different pitch angles and different rotation angles are acquired aiming at different individuals. In consideration of the fuzzy condition, the gesture change speed is adjusted at any time in the acquisition process. After data acquisition is completed, three-dimensional key points and two-dimensional key points of the human hand need to be obtained. The three-dimensional key points and the two-dimensional key points can be obtained by a key point detection library and a manual labeling method. Specifically, three-dimensional key points and two-dimensional key points in a training image are detected through a key point detection library, and then the three-dimensional key points and the two-dimensional key points which are missed to be detected are manually marked on the basis of a detection result.

Step 202: and constructing an image training set according to the shot hand image.

Specifically, the electronic device may use the shot hand image as a training image to construct an image training set, and may also change the shot hand image to obtain the image training set.

In one example, the electronic device performs data enhancement on a captured human hand image; and taking the hand image subjected to data enhancement and the shot hand image as training images of the image training set. Specifically, because the background in the acquisition process is single, the electronic equipment can extract the hand image by performing data enhancement on the data, and the hand image is fused into different background pictures through a picture fusion algorithm to obtain image data of different backgrounds so as to enrich the image training set. The data enhancement processing performed on the hand image may include operations of randomly rotating, randomly changing the size, randomly cutting, changing the illumination, normalizing, and the like, on the shot hand image.

Step 203: and optimizing to obtain a hand three-dimensional model as annotation data based on the input data of the predefined optimization objective function and optimization algorithm.

Specifically, the input data of the optimization algorithm comprises a human hand contour (mask) graph, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on an image training set. The point cloud data refers to point cloud data of a human hand in the training image, the two-dimensional key point data refers to two-dimensional key point data of the human hand in the training image, and the three-dimensional key point data refers to three-dimensional key point data of the human hand in the training image.

In one example, the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors. The human hand parameters comprise human hand posture parameters and human hand shape parameters.

In one example, the optimization objective function is:

min(loss_total)=min(L_2d+L_3d+L_mano+L_t1-t0+L_mask) Formula f;

in the formula f, min (loss)_total) To optimize the objective, L_2dIs a two-dimensional key point error term, and the calculation formula is formula g, L_3dThe error term of the three-dimensional key point is calculated by the formula h, L_manoFor the hand parameter error term, the calculation formula is the formula i, L_t1-t0Limiting terms for adjacent frames are related to parameter difference of adjacent frames, and the calculation formula is formula j, L_maskThe calculation formula is formula k for the contour error term.

Formula g;

in the formula g, the first step is,x _trepresenting two-dimensional keypoints projecting three-dimensional keypoints of a three-dimensional model of a human hand,x _gtrepresenting the true two-dimensional key points of the image,Trepresenting the number of training images in the set of training images.

A formula h;

in the formula h, the first and second groups,Q _trepresents the three-dimensional key points of the three-dimensional model of the human hand,Q _gtrepresenting three-dimensional keypoints resulting from two-dimensional keypoint projections,Trepresenting the number of training images in the set of training images.

A formula i;

in the formula i, the first and second groups,θ _trepresenting a predicted hand pose parameter of the human hand,θ _gtrepresenting the real hand posture parameters of the human hand,β _trepresenting the predicted shape parameters of the human hand,β _gtrepresenting the parameters of the shape of the real human hand,V _trepresenting a human hand model generated from predicted human hand parameters,V _gtrepresents a model of a real human hand,Trepresenting the number of training images in the set of training images.

Formula j;

in the formula j, the first and second groups,θ _trepresenting the hand pose parameter for the t-th frame,θ _t+1and representing the hand posture parameter of the T +1 th frame, wherein T represents the number of training images in the training image set.

A formula k;

in the formula k, the first and second groups,Mask _trepresenting the predicted contour of the human hand,Mask _gtrepresenting the true human hand contour and T representing the number of training images in the set of training images.

In the formula, the two-dimensional key point error term and the three-dimensional key point error term respectively limit the gesture of the hand in a two-dimensional space and a three-dimensional space, and the hand parameter error term mainly calculates the distance from the hand point to the point, so that the shape coefficient and the gesture parameter of the hand can be limited. The constraint term of the adjacent frames limits the pose jitter between the adjacent frames by calculating the parameter difference of the adjacent frames. The contour error term is used for calculating the loss of the contour of the three-dimensional human hand projected on a two-dimensional space so as to obtain better shape parameters.

It should be noted that, as can be understood by those skilled in the art, in practical application, other optimization objective functions may also be constructed to optimize to obtain a human hand three-dimensional model, and this embodiment is merely an example, and does not limit a specific functional relationship of the optimization objective functions in practical application.

It should be noted that, as can be understood by those skilled in the art, in the present embodiment, the three-dimensional model of the human hand is obtained through an optimization algorithm, and in practical application, the three-dimensional model of the human hand may also be obtained through other manners, and the present embodiment is only an example.

Step 204: and training to obtain a human hand capturing model by utilizing each frame training image and the labeled data in the image training set based on a predefined loss function at least comprising a loss constraint term of adjacent frames.

In one example, after training the human hand capture model, the human hand capture model may be applied to various scenes. Specifically, inputting a shot single color image into a hand capturing model to obtain hand posture parameters and hand shape parameters in the single color image; and obtaining a hand motion model in the single color image through a parameterized model based on the obtained hand posture parameters and the hand shape parameters.

In the embodiment, the inventor considers that the hardware condition of the current hand motion capture meets the requirement, but a real-time hand motion capture technology based on a single-sheet color graph does not exist in the industry, the main reason is that the difficulty in acquiring the training data is high, most of the training data is an open data set, the efficiency is low, and the related technology has extremely high hardware requirements or cannot achieve the real-time effect. Currently, the process techniques that can be employed include the parametric and non-parametric methods mentioned above. The non-parametric method, for example, the minimum Hand (minimum _ handle) method, has the disadvantages of too large computational cost, small gesture diversity, and inapplicable industrial application scenarios. In the model training method provided by the embodiment, a parameterization method based on an RGB image is provided, which has the advantages of small calculation amount and large generalization, so that the RGB image is used as an input in the embodiment, and the method has a wider application scene and is closer to the use of daily life. Furthermore, the embodiment provides a real-time hand motion capture method based on a single-sheet colorful picture, which is easy to arrange and high in efficiency. A model capable of outputting hand posture parameters is trained through a color image of a camera, and a reconstructed hand motion model can be obtained through a parameterization method. Because the network efficiency is high and the speed is high, the method can be flexibly applied to a plurality of fields needing three-dimensional hand actions, and the development of the three-dimensional field is further accelerated. The embodiment provides a real-time hand motion capture method based on a single picture, describes the content of training data used by the technology and an image acquisition system thereof, elaborates the network structure, loss function and the like used by training in detail, has high application value and innovation space, and has strong guiding significance for optimizing and improving the motion capture algorithm in the field.

Compared with the prior art, the model training method provided by the embodiment trains the hand capture model capable of obtaining the hand posture parameters and the hand shape parameters based on the single hand image based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced. In addition, the image training set is expanded in a data enhancement mode, so that the acquisition mode of the image training set is simpler and the efficiency is higher.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to an electronic apparatus, as shown in fig. 4, including: at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401 to enable the at least one processor 401 to perform the model training method as mentioned in the above embodiments.

The electronic device includes: one or more processors 401 and a memory 402, one processor 401 being exemplified in fig. 4. The processor 401 and the memory 402 may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. Memory 402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 401 executes various functional applications of the device and data processing by executing nonvolatile software programs, instructions, and modules stored in the memory 402, thereby implementing the model training method described above.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 402 and, when executed by the one or more processors 401, perform the model training method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.

A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of model training, comprising:

acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on the optimization of the image training set;

training to obtain a human hand capturing model by utilizing each frame training image in the image training set and the marking data based on a predefined loss function at least comprising a loss constraint term of adjacent frames; the human hand capturing model is a neural network model used for judging human hand posture parameters and human hand shape parameters represented by a single human hand image, so that a human hand motion model can be obtained based on the human hand posture parameters, the human hand shape parameters and a human hand parameterized model.

2. The model training method of claim 1, wherein obtaining the training set of images comprises:

controlling a plurality of image acquisition devices to shoot hand images synchronously; wherein the arrangement directions of the plurality of image acquisition devices are different;

and constructing the image training set according to the shot human hand image.

3. The model training method according to claim 2, wherein the constructing the image training set from the captured human hand image comprises:

performing data enhancement on the shot hand image;

and taking the human hand image subjected to data enhancement and the shot human hand image as a training image of the image training set.

4. The model training method of claim 1, wherein obtaining the annotation data comprises:

optimizing to obtain the human hand three-dimensional model as the labeling data based on the input data of a predefined optimization objective function and an optimization algorithm;

the input data of the optimization algorithm comprises a human hand contour map, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on the image training set.

5. The model training method of claim 4, wherein the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors.

6. The model training method of any one of claims 1 to 5, wherein the functional expression of the adjacent frame loss constraint term is:

L_Smooth=||θ_pre-θ_cur||₂ ²+||β_pre-β_cur||₂ ²a formula a;

7. The model training method of any one of claims 1 to 5, wherein the loss function further comprises a point cloud loss constraint term, a two-dimensional keypoint loss constraint term, and a three-dimensional keypoint loss constraint term.

8. The model training method according to any one of claims 1 to 5, wherein the human hand capture model is a network model obtained by cascading a lightweight neural network and a gated cyclic unit model.

9. An electronic device, comprising: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the model training method of any one of claims 1 to 8.