CN112801064A - Model training method, electronic device and storage medium - Google Patents

Model training method, electronic device and storage medium Download PDF

Info

Publication number
CN112801064A
CN112801064A CN202110386076.2A CN202110386076A CN112801064A CN 112801064 A CN112801064 A CN 112801064A CN 202110386076 A CN202110386076 A CN 202110386076A CN 112801064 A CN112801064 A CN 112801064A
Authority
CN
China
Prior art keywords
model
image
hand
human hand
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110386076.2A
Other languages
Chinese (zh)
Inventor
石彪
李廷照
张举勇
户磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dilusense Technology Co Ltd
Hefei Dilusense Technology Co Ltd
Original Assignee
Beijing Dilusense Technology Co Ltd
Hefei Dilusense Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dilusense Technology Co Ltd, Hefei Dilusense Technology Co Ltd filed Critical Beijing Dilusense Technology Co Ltd
Priority to CN202110386076.2A priority Critical patent/CN112801064A/en
Publication of CN112801064A publication Critical patent/CN112801064A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention relates to the field of data processing, and discloses a model training method, electronic equipment and a storage medium. In some embodiments of the present invention, a model training method comprises the following steps: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model. This embodiment makes it possible to obtain, based on a single image, the parameters required in the process of building a model of the movements of the human hand based on a parameterized model.

Description

Model training method, electronic device and storage medium
Technical Field
The embodiment of the invention relates to the field of data processing, in particular to a model training method, electronic equipment and a storage medium.
Background
The reconstruction and attribute identification of three-dimensional hands are always important research directions in the field of machine vision. At present, the related work of human hand reconstruction based on deep learning in academic circles can be roughly divided into two categories, namely parameterized model reconstruction and unparameterized model reconstruction. The non-parametric model reconstruction mainly takes a multi-view depth map and a multi-view color map as input, a human hand model is obtained through a large amount of data learning, and the data acquisition difficulty is high. The parameterized model reconstruction mainly learns the parameters of the human hand model and then fits the target gesture model through the parameters. Both methods have respective advantages and disadvantages.
With the rapid development of deep learning technology and hardware technology in recent years, the evolution from the use of machine learning optimization to the obtaining of models through deep learning gradually makes real-time gesture motion capture have an opportunity to be realized. However, the current reconstruction method based on the parametric model often needs to obtain the parameters required in the parametric model based on multiple images.
Disclosure of Invention
An object of an embodiment of the present invention is to provide a model training method, an electronic device, and a storage medium, which enable parameters required in a process of constructing a hand motion model based on a parameterized model to be obtained based on a single image.
In order to solve the above technical problem, an embodiment of the present invention provides a model training method, including the following steps: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model.
An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method as mentioned in the above embodiments.
Embodiments of the present invention also provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the model training method mentioned in the above embodiments.
According to the model training method, the electronic device and the storage medium provided by the embodiment, the hand capturing model capable of obtaining the hand posture parameters and the shape parameters based on the single hand image is obtained through training based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.
Additionally, an image training set is acquired, comprising: controlling a plurality of image acquisition devices to shoot hand images synchronously; the arrangement directions of the plurality of image acquisition devices are different; and constructing an image training set according to the shot hand image.
In addition, according to the shot human hand image, an image training set is constructed, and the method comprises the following steps: performing data enhancement on the shot hand image; and taking the hand image subjected to data enhancement and the shot hand image as training images of the image training set.
In addition, obtaining annotation data includes: optimizing to obtain a three-dimensional model of the hand as labeled data based on input data of a predefined optimization objective function and an optimization algorithm; the input data of the optimization algorithm comprises a human hand contour map, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on an image training set.
In addition, the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors.
In addition, the functional expression of the adjacent frame loss constraint term is:
LSmooth=||θprecur||2 2+||βprecur||2 2a formula a;
wherein L isSmoothRepresenting adjacent frame loss constraint, thetapreRepresenting a human hand pose parameter, beta, predicted based on a previous frame of training imagepreRepresenting a human hand shape parameter, theta, predicted based on a previous frame of training imagecurRepresenting the hand pose parameter, beta, predicted based on the current frame training imagecurRepresenting the human hand shape parameters predicted based on the current frame training image.
In addition, the loss function further comprises a point cloud loss constraint term, a two-dimensional key point loss constraint term and a three-dimensional key point loss constraint term.
In addition, the hand capture model is a network model obtained by cascading a lightweight neural network and a gated cycle unit model.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
FIG. 1 is a flow chart of a model training method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a model training method according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of an image acquisition system in the model training method shown in FIG. 2;
fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
A first embodiment of the present invention relates to a model training method including the steps of: acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on image training set optimization; training to obtain a human hand capturing model by utilizing each frame training image and marking data in the image training set based on a predefined loss function at least comprising loss constraint terms of adjacent frames; the hand capturing model is a neural network model used for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model can be obtained based on the hand posture parameters, the hand shape parameters and the hand parameterized model. In the embodiment, based on the image training set and the labeling data, a hand capturing model capable of obtaining hand posture parameters and hand shape parameters based on a single hand image is obtained through training, so that a hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.
The following describes details of the model training method according to the present embodiment. The following disclosure provides implementation details for the purpose of facilitating understanding, and is not necessary to practice the present solution.
The model training method in the present embodiment is applied to an electronic device. The electronic device may be a terminal, a server, a cloud server, or the like. As shown in fig. 1, the model training method includes the following steps:
step 101: and acquiring an image training set and labeling data.
Specifically, the annotation data comprises a human hand three-dimensional model obtained based on optimization of an image training set. The image training set comprises a plurality of training images, and each training image is a hand image in various postures and under various backgrounds. The human hand three-dimensional model is a point cloud model of the human hand with fixed human hand posture parameters and human hand shape parameters corresponding to the image training set.
Step 102: and training to obtain a human hand capturing model by utilizing each frame training image and the labeled data in the image training set based on a predefined loss function at least comprising a loss constraint term of adjacent frames.
Specifically, the hand capturing model is a neural network model for judging hand posture parameters and hand shape parameters represented by a single hand image, so that a hand motion model is obtained based on the hand posture parameters, the hand shape parameters and a hand parameterized model. The input data of the hand capturing model is a single hand image, and the output data is hand posture parameters and hand shape parameters in the parameterized model. The parameterized model is a model which is constructed in advance and can generate a hand motion model corresponding to a single hand image based on the input hand posture parameters and hand shape parameters of the single hand image. The hand motion model is a three-dimensional model of the motion of a hand capable of representing a single hand image.
It is worth mentioning that the electronic equipment obtains a hand capturing model capable of obtaining hand posture parameters and hand shape parameters based on a single hand image through training based on the image training set and the labeling data, so that a hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.
Alternatively, the human hand image is a color (RGB) image.
In one example, the functional expression of the adjacent frame loss constraint term is:
LSmooth=||θprecur||2 2+||βprecur||2 2a formula a;
wherein L isSmoothRepresenting adjacent frame loss constraint, thetapreRepresenting a human hand pose parameter, beta, predicted based on a previous frame of training imagepreRepresenting a human hand shape parameter, theta, predicted based on a previous frame of training imagecurRepresenting the hand pose parameter, beta, predicted based on the current frame training imagecurRepresenting the human hand shape parameters predicted based on the current frame training image.
It is worth mentioning that the influence of jitter on the modeling can be effectively reduced by comparing the previous frame with the current frame.
In one example, the loss function further includes a point cloud loss constraint term, a two-dimensional keypoint loss constraint term, and a three-dimensional keypoint loss constraint term. Specifically, the process of training the model may use three different loss constraint types, respectively: conventional loss constraints, adjacent frame loss constraints, and point cloud constraints. The traditional loss adopts the common Euclidean distance constraint after the three-dimensional key points are projected on a two-dimensional image plane and marked with two-dimensional key points. The inventors have found that using the keypoint constraint training model alone has two drawbacks: firstly, the relation between adjacent frames is not considered, and the model jitter is serious; secondly, no real hand point cloud is used for constraint, and the model is not accurate enough in the depth direction. The addition of the adjacent frame loss constraint and the point cloud loss constraint can be chosen to achieve more accurate and coherent motion capture. In order to make the joint rotation of adjacent frames smoother in the parameterized model, the change of the angular velocity (including the magnitude and direction) of the joint rotation should be as small as possible. Wherein, the average rotation angular velocity of a certain key point in two adjacent frames can be calculated by dividing the difference of rotation vectors of the adjacent frames by unit time. Two average rotation angular velocities (vectors) can be calculated for three adjacent frames. Thus, alternatively, the regularization term loss may be designed to minimize the difference between the two average rotational angular velocities, i.e., the L1 mode of the difference between the two rotational angular velocities, and so on. Similarly, the translation vector should be the difference (change) in translation speed. The calculation of the point cloud loss constraint may calculate a point-to-point error of the parameterized model generated human hand model based on the three-dimensional model of the human hand in the annotation data. Therefore, the total loss function can be composed of a two-dimensional key point loss constraint term, a three-dimensional key point loss constraint term, a point cloud loss constraint term and an adjacent frame loss constraint term as shown in formula b. The function expression of the two-dimensional key point loss constraint term is shown as a formula c, the function expression of the three-dimensional key point loss constraint term is shown as a formula d, and the function expression of the point cloud loss constraint term is shown as a formula e.
Ltotal1L2D2L3D3LPoint4LSmooth A formula b;
in the formula b, LtotalRepresenting the total loss function, λ1Weight, λ, representing a two-dimensional keypoint loss constraint term2Weight, λ, representing a three-dimensional keypoint loss constraint term3Weight, λ, representing a point cloud loss constraint term4Representing the weight of the adjacent frame loss constraint term. Lambda [ alpha ]1、λ2、λ3And λ4Generally chosen empirically. L is2DRepresenting two-dimensional keypoint loss constraints, whichThe calculation is shown in formula c. L is3DAnd (4) representing a three-dimensional key point loss constraint term, wherein the calculation mode is shown as formula d. L isPointAnd (4) representing a point cloud loss constraint term, wherein the calculation formula is shown as a formula e. L isSmoothAn adjacent frame loss constraint term is expressed and the calculation formula is shown as formula a above, and the loss term is mainly used to prevent jitter.
L2D=||pgt2-pl2||2 2Formula c;
in formula c, pgt2Two-dimensional key point information representing hand marks (ground-truth) is obtained by detection of a third-party algorithm (such as a two-dimensional key point detection algorithm), pl2Attitude information representing a prediction of the network.
L3D=v||pgt3-pl3||2 2A formula d;
in formula d, pgt3Three-dimensional key point information representing a hand label, pgt3The three parts of X, Y and Z are included, wherein (X, Y) is the two-dimensional key point information detected by the third-party algorithm, and Z is the value of the coordinate (X, Y) on the depth map. p is a radical ofl3And representing the position of the three-dimensional key point obtained by the network prediction parameters. v is a one-hot vector consisting of 0 and 1. Due to factors such as self-shielding of the human hand and the like, the Z value of the key point of the self-shielding part is inconsistent with the value of the coordinate XY on the depth map, and some rules can be set as required to filter out the self-shielding key point. For example, after obtaining three-dimensional key points of a human hand, there is a confidence value of each three-dimensional key point, and the confidence value represents the probability that the three-dimensional key point is correct. By setting a confidence threshold, the detected three-dimensional keypoints can be filtered. And setting the numerical value of the element corresponding to the correct three-dimensional key point in the one-hot vector to be 1, and setting the numerical value of the element corresponding to the incorrect three-dimensional key point to be 0.
LPoint=w||ngt T(Vgt-Vpred) ||2 2A formula e;
in formula e, w is a one-hot vector consisting of 0 and 1, and if a three-dimensional relationship is found on the depth mapAnd the point corresponding to the key point is 1, otherwise, the point is 0. VgtRepresenting grid points, VpredPoints, n, corresponding to the human hand three-dimensional model generated by representing the prediction parameters and the human hand three-dimensional model obtained by optimizationgt TRepresenting the grid point normal. And searching a corresponding point between the human hand three-dimensional model generated by the prediction parameters and the human hand three-dimensional model obtained through optimization through the Kd-Tree.
Optionally, the human hand capture model is a network model obtained by cascading a lightweight neural network (ShuffleNetv 2) model and a gated cyclic unit (gru) model. Specifically, in view of the real-time requirement in the aspect of efficiency, according to the efficiency tests of different models, the following conclusions are drawn: some small networks, together with a recurrent neural network, can achieve better results. For example: the network model obtained by cascading the lightweight neural network and the gated cyclic unit model is more excellent in performance and higher in speed. Therefore, optionally, the human hand capture model may use the network model as a base network during the model training process. The input data of the hand capturing model is a single hand image, the output data is hand posture parameters and hand shape parameters in the parameterized model, and the hand motion model can be obtained by inputting the parameters into the parameterized model.
Optionally, the human hand posture parameters of the continuous frames output by the ShuffleNet v2 network are input into the gru network, so that the continuous frame effect can be continuously optimized.
The above description is only for illustrative purposes and does not limit the technical aspects of the present invention.
Compared with the prior art, according to the model training method provided by the embodiment, the electronic equipment is trained to obtain the hand capture model capable of obtaining the hand posture parameters and the hand shape parameters based on a single hand image based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced.
A second embodiment of the present invention relates to a model training method. The embodiment is further detailed on the basis of the first embodiment, and illustrates a process of obtaining an image training set and labeling data.
Specifically, as shown in fig. 2, the present embodiment includes steps 201 to 204, where step 204 is substantially the same as step 102 in the first embodiment, and is not repeated here. The following mainly introduces the differences:
step 201: and controlling a plurality of image acquisition devices to synchronously shoot the hand images.
Specifically, the arrangement orientations of the plurality of image capturing devices are different.
In one example, a multi-view multi-camera based image acquisition system and a subsequent data enhancement algorithm are used to obtain a desired training set of images. Alternatively, an image capturing system composed of a plurality of image capturing devices is shown in fig. 3. The first camera 301, the second camera 302, and the third camera 303 are respectively located at different positions of the human hand 304. Alternatively, the captured training images include 40 common gesture types with different directions, and the captured objects include 30 individuals with different ages and different hand types. Optionally, in consideration of the influence factor of the color of the clothes at the wrist, the acquisition object wears sleeves of different colors to perform acquisition in the acquisition process. The process of image acquisition system set-up using multiple cameras is probably as follows:
in the first step, a plurality of cameras are placed in different directions. Wherein the camera may be an RGB-D camera.
And step two, developing an image acquisition program to enable a plurality of cameras to acquire images synchronously.
Optionally, a method similar to human body multi-camera reconstruction can be used, the multi-camera data is used for optimizing a human hand three-dimensional model to be used as the marking data, and the accuracy of the marking data and the accuracy of the three-dimensional key points can be guaranteed to the greatest extent.
After the image acquisition system is built, gestures with different pitch angles and different rotation angles are acquired aiming at different individuals. In consideration of the fuzzy condition, the gesture change speed is adjusted at any time in the acquisition process. After data acquisition is completed, three-dimensional key points and two-dimensional key points of the human hand need to be obtained. The three-dimensional key points and the two-dimensional key points can be obtained by a key point detection library and a manual labeling method. Specifically, three-dimensional key points and two-dimensional key points in a training image are detected through a key point detection library, and then the three-dimensional key points and the two-dimensional key points which are missed to be detected are manually marked on the basis of a detection result.
Step 202: and constructing an image training set according to the shot hand image.
Specifically, the electronic device may use the shot hand image as a training image to construct an image training set, and may also change the shot hand image to obtain the image training set.
In one example, the electronic device performs data enhancement on a captured human hand image; and taking the hand image subjected to data enhancement and the shot hand image as training images of the image training set. Specifically, because the background in the acquisition process is single, the electronic equipment can extract the hand image by performing data enhancement on the data, and the hand image is fused into different background pictures through a picture fusion algorithm to obtain image data of different backgrounds so as to enrich the image training set. The data enhancement processing performed on the hand image may include operations of randomly rotating, randomly changing the size, randomly cutting, changing the illumination, normalizing, and the like, on the shot hand image.
Step 203: and optimizing to obtain a hand three-dimensional model as annotation data based on the input data of the predefined optimization objective function and optimization algorithm.
Specifically, the input data of the optimization algorithm comprises a human hand contour (mask) graph, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on an image training set. The point cloud data refers to point cloud data of a human hand in the training image, the two-dimensional key point data refers to two-dimensional key point data of the human hand in the training image, and the three-dimensional key point data refers to three-dimensional key point data of the human hand in the training image.
In one example, the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors. The human hand parameters comprise human hand posture parameters and human hand shape parameters.
In one example, the optimization objective function is:
min(losstotal)=min(L2d+L3d+Lmano+Lt1-t0+Lmask) Formula f;
in the formula f, min (loss)total) To optimize the objective, L2dIs a two-dimensional key point error term, and the calculation formula is formula g, L3dThe error term of the three-dimensional key point is calculated by the formula h, LmanoFor the hand parameter error term, the calculation formula is the formula i, Lt1-t0Limiting terms for adjacent frames are related to parameter difference of adjacent frames, and the calculation formula is formula j, LmaskThe calculation formula is formula k for the contour error term.
Figure 409451DEST_PATH_IMAGE001
Formula g;
in the formula g, the first step is,x t representing two-dimensional keypoints projecting three-dimensional keypoints of a three-dimensional model of a human hand,x gt representing the true two-dimensional key points of the image,Trepresenting the number of training images in the set of training images.
Figure 20561DEST_PATH_IMAGE002
A formula h;
in the formula h, the first and second groups,Q t represents the three-dimensional key points of the three-dimensional model of the human hand,Q gt representing three-dimensional keypoints resulting from two-dimensional keypoint projections,Trepresenting the number of training images in the set of training images.
Figure 498815DEST_PATH_IMAGE003
A formula i;
in the formula i, the first and second groups,θ t representing a predicted hand pose parameter of the human hand,θ gt representing the real hand posture parameters of the human hand,β t representing the predicted shape parameters of the human hand,β gt representing the parameters of the shape of the real human hand,V t representing a human hand model generated from predicted human hand parameters,V gt represents a model of a real human hand,Trepresenting the number of training images in the set of training images.
Figure 297007DEST_PATH_IMAGE004
Formula j;
in the formula j, the first and second groups,θ t representing the hand pose parameter for the t-th frame,θ t+1 and representing the hand posture parameter of the T +1 th frame, wherein T represents the number of training images in the training image set.
Figure 902432DEST_PATH_IMAGE005
A formula k;
in the formula k, the first and second groups,Mask t representing the predicted contour of the human hand,Mask gt representing the true human hand contour and T representing the number of training images in the set of training images.
In the formula, the two-dimensional key point error term and the three-dimensional key point error term respectively limit the gesture of the hand in a two-dimensional space and a three-dimensional space, and the hand parameter error term mainly calculates the distance from the hand point to the point, so that the shape coefficient and the gesture parameter of the hand can be limited. The constraint term of the adjacent frames limits the pose jitter between the adjacent frames by calculating the parameter difference of the adjacent frames. The contour error term is used for calculating the loss of the contour of the three-dimensional human hand projected on a two-dimensional space so as to obtain better shape parameters.
It should be noted that, as can be understood by those skilled in the art, in practical application, other optimization objective functions may also be constructed to optimize to obtain a human hand three-dimensional model, and this embodiment is merely an example, and does not limit a specific functional relationship of the optimization objective functions in practical application.
It should be noted that, as can be understood by those skilled in the art, in the present embodiment, the three-dimensional model of the human hand is obtained through an optimization algorithm, and in practical application, the three-dimensional model of the human hand may also be obtained through other manners, and the present embodiment is only an example.
Step 204: and training to obtain a human hand capturing model by utilizing each frame training image and the labeled data in the image training set based on a predefined loss function at least comprising a loss constraint term of adjacent frames.
In one example, after training the human hand capture model, the human hand capture model may be applied to various scenes. Specifically, inputting a shot single color image into a hand capturing model to obtain hand posture parameters and hand shape parameters in the single color image; and obtaining a hand motion model in the single color image through a parameterized model based on the obtained hand posture parameters and the hand shape parameters.
In the embodiment, the inventor considers that the hardware condition of the current hand motion capture meets the requirement, but a real-time hand motion capture technology based on a single-sheet color graph does not exist in the industry, the main reason is that the difficulty in acquiring the training data is high, most of the training data is an open data set, the efficiency is low, and the related technology has extremely high hardware requirements or cannot achieve the real-time effect. Currently, the process techniques that can be employed include the parametric and non-parametric methods mentioned above. The non-parametric method, for example, the minimum Hand (minimum _ handle) method, has the disadvantages of too large computational cost, small gesture diversity, and inapplicable industrial application scenarios. In the model training method provided by the embodiment, a parameterization method based on an RGB image is provided, which has the advantages of small calculation amount and large generalization, so that the RGB image is used as an input in the embodiment, and the method has a wider application scene and is closer to the use of daily life. Furthermore, the embodiment provides a real-time hand motion capture method based on a single-sheet colorful picture, which is easy to arrange and high in efficiency. A model capable of outputting hand posture parameters is trained through a color image of a camera, and a reconstructed hand motion model can be obtained through a parameterization method. Because the network efficiency is high and the speed is high, the method can be flexibly applied to a plurality of fields needing three-dimensional hand actions, and the development of the three-dimensional field is further accelerated. The embodiment provides a real-time hand motion capture method based on a single picture, describes the content of training data used by the technology and an image acquisition system thereof, elaborates the network structure, loss function and the like used by training in detail, has high application value and innovation space, and has strong guiding significance for optimizing and improving the motion capture algorithm in the field.
The above description is only for illustrative purposes and does not limit the technical aspects of the present invention.
Compared with the prior art, the model training method provided by the embodiment trains the hand capture model capable of obtaining the hand posture parameters and the hand shape parameters based on the single hand image based on the image training set and the labeling data, so that the hand motion model can be obtained based on the single hand image in practical application. Because the hand motion model can be obtained based on a single image, the requirement on input data is lower, so that the hand capture model is wider in application and stronger in transportability, and can be applied to various scenes in the industry, thereby generating huge commercial value. In addition, the loss constraint term of adjacent frames is added, so that the change of the angular speed of the joint rotation is smaller, the joint rotation is smoother, and the influence of jitter on the obtained parameters is reduced. In addition, the image training set is expanded in a data enhancement mode, so that the acquisition mode of the image training set is simpler and the efficiency is higher.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the present invention relates to an electronic apparatus, as shown in fig. 4, including: at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401 to enable the at least one processor 401 to perform the model training method as mentioned in the above embodiments.
The electronic device includes: one or more processors 401 and a memory 402, one processor 401 being exemplified in fig. 4. The processor 401 and the memory 402 may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. Memory 402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 401 executes various functional applications of the device and data processing by executing nonvolatile software programs, instructions, and modules stored in the memory 402, thereby implementing the model training method described above.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory 402 and, when executed by the one or more processors 401, perform the model training method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.
A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A method of model training, comprising:
acquiring an image training set and annotation data; the annotation data comprises a human hand three-dimensional model obtained based on the optimization of the image training set;
training to obtain a human hand capturing model by utilizing each frame training image in the image training set and the marking data based on a predefined loss function at least comprising a loss constraint term of adjacent frames; the human hand capturing model is a neural network model used for judging human hand posture parameters and human hand shape parameters represented by a single human hand image, so that a human hand motion model can be obtained based on the human hand posture parameters, the human hand shape parameters and a human hand parameterized model.
2. The model training method of claim 1, wherein obtaining the training set of images comprises:
controlling a plurality of image acquisition devices to shoot hand images synchronously; wherein the arrangement directions of the plurality of image acquisition devices are different;
and constructing the image training set according to the shot human hand image.
3. The model training method according to claim 2, wherein the constructing the image training set from the captured human hand image comprises:
performing data enhancement on the shot hand image;
and taking the human hand image subjected to data enhancement and the shot human hand image as a training image of the image training set.
4. The model training method of claim 1, wherein obtaining the annotation data comprises:
optimizing to obtain the human hand three-dimensional model as the labeling data based on the input data of a predefined optimization objective function and an optimization algorithm;
the input data of the optimization algorithm comprises a human hand contour map, point cloud data, two-dimensional key point data and three-dimensional key point data which are obtained based on the image training set.
5. The model training method of claim 4, wherein the optimization objective function is related to two-dimensional keypoint errors, three-dimensional keypoint errors, human hand parameter errors, parameter differences of adjacent frames, and contour errors.
6. The model training method of any one of claims 1 to 5, wherein the functional expression of the adjacent frame loss constraint term is:
LSmooth=||θprecur||2 2+||βprecur||2 2a formula a;
wherein L isSmoothRepresenting adjacent frame loss constraint, thetapreRepresenting a human hand pose parameter, beta, predicted based on a previous frame of training imagepreRepresenting a human hand shape parameter, theta, predicted based on a previous frame of training imagecurRepresenting the hand pose parameter, beta, predicted based on the current frame training imagecurRepresenting the human hand shape parameters predicted based on the current frame training image.
7. The model training method of any one of claims 1 to 5, wherein the loss function further comprises a point cloud loss constraint term, a two-dimensional keypoint loss constraint term, and a three-dimensional keypoint loss constraint term.
8. The model training method according to any one of claims 1 to 5, wherein the human hand capture model is a network model obtained by cascading a lightweight neural network and a gated cyclic unit model.
9. An electronic device, comprising: at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1 to 8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the model training method of any one of claims 1 to 8.
CN202110386076.2A 2021-04-12 2021-04-12 Model training method, electronic device and storage medium Pending CN112801064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110386076.2A CN112801064A (en) 2021-04-12 2021-04-12 Model training method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110386076.2A CN112801064A (en) 2021-04-12 2021-04-12 Model training method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN112801064A true CN112801064A (en) 2021-05-14

Family

ID=75816728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110386076.2A Pending CN112801064A (en) 2021-04-12 2021-04-12 Model training method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN112801064A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077383A (en) * 2021-06-07 2021-07-06 深圳追一科技有限公司 Model training method and model training device
CN113901971A (en) * 2021-12-09 2022-01-07 北京的卢深视科技有限公司 Body-building posture correction method and device, electronic equipment and storage medium
CN114968055A (en) * 2022-05-20 2022-08-30 重庆科创职业学院 Electronic glove synchronization system, method, device and storage medium
CN116030247A (en) * 2023-03-20 2023-04-28 之江实验室 Medical image sample generation method and device, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070076567A (en) * 2007-06-08 2007-07-24 충남대학교산학협력단 Design method of fit clothes using pattern making from 3-dimensional curved surface to 2-dimensional plane
CN102103756A (en) * 2009-12-18 2011-06-22 华为技术有限公司 Comic exaggeration method, device and system for human face digital image supporting position deflection
CN110348406A (en) * 2019-07-15 2019-10-18 广州图普网络科技有限公司 Parameter deducing method and device
CN110929616A (en) * 2019-11-14 2020-03-27 北京达佳互联信息技术有限公司 Human hand recognition method and device, electronic equipment and storage medium
CN111723688A (en) * 2020-06-02 2020-09-29 北京的卢深视科技有限公司 Human body action recognition result evaluation method and device and electronic equipment
CN111723687A (en) * 2020-06-02 2020-09-29 北京的卢深视科技有限公司 Human body action recognition method and device based on neural network
CN112488067A (en) * 2020-12-18 2021-03-12 北京的卢深视科技有限公司 Face pose estimation method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070076567A (en) * 2007-06-08 2007-07-24 충남대학교산학협력단 Design method of fit clothes using pattern making from 3-dimensional curved surface to 2-dimensional plane
CN102103756A (en) * 2009-12-18 2011-06-22 华为技术有限公司 Comic exaggeration method, device and system for human face digital image supporting position deflection
CN110348406A (en) * 2019-07-15 2019-10-18 广州图普网络科技有限公司 Parameter deducing method and device
CN110929616A (en) * 2019-11-14 2020-03-27 北京达佳互联信息技术有限公司 Human hand recognition method and device, electronic equipment and storage medium
CN111723688A (en) * 2020-06-02 2020-09-29 北京的卢深视科技有限公司 Human body action recognition result evaluation method and device and electronic equipment
CN111723687A (en) * 2020-06-02 2020-09-29 北京的卢深视科技有限公司 Human body action recognition method and device based on neural network
CN112488067A (en) * 2020-12-18 2021-03-12 北京的卢深视科技有限公司 Face pose estimation method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHRISTIAN ZIMMERMANN 等,: "FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images", 《2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
HAO ZHANG 等,: "InteractionFusion: real-time reconstruction of hand poses and deformable objects in hand-object interactions", 《ACM TRANSACTIONS ON GRAPHICS》 *
OSCAR KOLLER 等,: "Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
蒲俊福,: "基于深度学习的视频手语识别研究", 《中国博士学位论文全文数据库信息科技辑》 *
郑新千,: "非线性参数化三维人手模型及其应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077383A (en) * 2021-06-07 2021-07-06 深圳追一科技有限公司 Model training method and model training device
CN113901971A (en) * 2021-12-09 2022-01-07 北京的卢深视科技有限公司 Body-building posture correction method and device, electronic equipment and storage medium
CN114968055A (en) * 2022-05-20 2022-08-30 重庆科创职业学院 Electronic glove synchronization system, method, device and storage medium
CN114968055B (en) * 2022-05-20 2023-07-07 重庆科创职业学院 Electronic glove synchronization system, method, equipment and storage medium
CN116030247A (en) * 2023-03-20 2023-04-28 之江实验室 Medical image sample generation method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
Kanazawa et al. End-to-end recovery of human shape and pose
Zhu et al. Detailed human shape estimation from a single image by hierarchical mesh deformation
Xue et al. Learning attraction field representation for robust line segment detection
CN109166149B (en) Positioning and three-dimensional line frame structure reconstruction method and system integrating binocular camera and IMU
CN109636831B (en) Method for estimating three-dimensional human body posture and hand information
US10949649B2 (en) Real-time tracking of facial features in unconstrained video
CN112801064A (en) Model training method, electronic device and storage medium
Gupta et al. Rotation equivariant siamese networks for tracking
Xu et al. Predicting animation skeletons for 3d articulated models via volumetric nets
Labbé et al. Single-view robot pose and joint angle estimation via render & compare
JP2009157767A (en) Face image recognition apparatus, face image recognition method, face image recognition program, and recording medium recording this program
CN113706699A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN112200057A (en) Face living body detection method and device, electronic equipment and storage medium
CN110942512A (en) Indoor scene reconstruction method based on meta-learning
CN115661246A (en) Attitude estimation method based on self-supervision learning
Cong et al. Weakly supervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar
CN115761905A (en) Diver action identification method based on skeleton joint points
Baudron et al. E3d: event-based 3d shape reconstruction
Kourbane et al. A graph-based approach for absolute 3D hand pose estimation using a single RGB image
CN110598595A (en) Multi-attribute face generation algorithm based on face key points and postures
Mirani et al. Object recognition in different lighting conditions at various angles by deep learning method
Chao et al. Adversarial refinement network for human motion prediction
Huang et al. Life: Lighting invariant flow estimation
CN115862130B (en) Behavior recognition method based on human body posture and trunk sports field thereof
CN114049678B (en) Facial motion capturing method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210514

RJ01 Rejection of invention patent application after publication